08/02/2026 9:37 PM

⬅️ [07/02/2026 11:23 AM](<./07_02_2026 11_23 AM.md>) | ⬆️ [2026 - February](<./README.md>) | [09/02/2026 10:12 AM](<./09_02_2026 10_12 AM.md>) ➡️

08/02/2026 9:37 PM
Well that was a frustrating bug. I've been working with Ray and vLLM so that I can specify which model to run where. Turns out that vLLM uses Ray internally which causes conflicts so you need to be careful. In order to manually specify which GPU you use you need to do this song and dance

os.environ["RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO"] = "0"  # Necessary to prevent Ray from overriding CUDA_VISIBLE_DEVICES to empty  

cq_model = CQModelWorker.options( 
    runtime_env={"env_vars": {"CUDA_VISIBLE_DEVICES": "6"}},  
    num_gpus=0  # 0 so that we do not get conflicts with vLLM trying to auto-assign GPUs  
).remote(cq_model_cfg, lora_checkpoint_path, n_gpus=1)

You need to specify num_gpus=0 so that vLLM does not try to allocate its own GPUs because if it does it will overwrite CUDA_VISIBLE_DEVICES and cause errors when you try to run another model. But of course if you do num_gpus=0 then it overwrites CUDA_VISIBLE_DEVICES to empty. So you need to set os.environ["RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO"] = "0" which tells vLLM to not overwrite your visible devices which will be the default in the future. I hate it.

I’ve not discovered this is only sometimes true and sometimes it’s the opposite. So… good luck

⬅️ [07/02/2026 11:23 AM](<./07_02_2026 11_23 AM.md>) | ⬆️ [2026 - February](<./README.md>) | [09/02/2026 10:12 AM](<./09_02_2026 10_12 AM.md>) ➡️