08/02/2026 9:37 PM
08/02/2026 9:37 PM
Well that was a frustrating bug. I've been working with Ray and vLLM so that I can specify which model to run where. Turns out that vLLM uses Ray internally which causes conflicts so you need to be careful. In order to manually specify which GPU you use you need to do this song and dance
os.environ["RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO"] = "0" # Necessary to prevent Ray from overriding CUDA_VISIBLE_DEVICES to empty
cq_model = CQModelWorker.options(
runtime_env={"env_vars": {"CUDA_VISIBLE_DEVICES": "6"}},
num_gpus=0 # 0 so that we do not get conflicts with vLLM trying to auto-assign GPUs
).remote(cq_model_cfg, lora_checkpoint_path, n_gpus=1)
You need to specify num_gpus=0 so that vLLM does not try to allocate its own GPUs because if it does it will overwrite CUDA_VISIBLE_DEVICES and cause errors when you try to run another model. But of course if you do num_gpus=0 then it overwrites CUDA_VISIBLE_DEVICES to empty. So you need to set os.environ["RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO"] = "0" which tells vLLM to not overwrite your visible devices which will be the default in the future. I hate it.
I’ve not discovered this is only sometimes true and sometimes it’s the opposite. So… good luck