09/02/2026 10:12 AM
09/02/2026 10:12 AM
You have to be so careful with these things. They do not issue warnings ever. I mislabeled part of a dict as mm_data instead of multi_modal_data since both are used and instead of issuing a warning it just didn't put any data into the image placeholder so it hallucinated random things.
09/02/2026 11:04 AM
Freezing happening again. Tried reducing GPUs to 1. Increasing RAY cpus to 32 and parallel execution to 32 as well.
Noted that RAM usage is high at freeze. Around 15 GB. CPU on process 100%, but that is normal. Perhaps freeze is due to image tokenization? But that happens inside vLLM and I don't know if I can manage that myself.
23.2gb at start -> consistenly rising. CPU hovering around 98-99% for main process. nvtop CPU usage around 200% gpu 85%. Suddenly shifts to 100% CPU usage and 100% GPU usage. Losing access to CPUs? -> 28.1GB RAM usage at end.
See many many RAY::IDLE processes in htop. Note that freeze only happens when processing images. When images are not passed to vLLM it finishes as normal.
09/02/2026 11:22 AM
Trying reducing ray params to @ray.remote(max_concurrency=8, num_cpus=1) and num_concurrent_loops = 8 to match the number of allowed ray concurrent calls.
GPU usage remaining around 80%. CPU around 105% on nvtop and 99% on htop.
09/02/2026 11:23 AM
ValueError: The decoder prompt (length 11968) is longer than the maximum model length of 4096. Make sure that `max_model_len` is no smaller than the number of text tokens plus multimodal tokens. For image inputs, the number of image tokens depends on the number of images, and possibly their aspect ratios as well.
Got an interesting error. Maybe a very large image? Perhaps I should finally do the thing I meant to a while back and make it so that the dataset is the thing that processes the images to a consistent small size.
ValueError: The decoder prompt (length 4568) is longer than the maximum model length of 4096. Make sure that `max_model_len` is no smaller than the number of text tokens plus multimodal tokens. For image inputs, the number of image tokens depends on the number of images, and possibly their aspect ratios as well.
Got another one as well. Seems like images are too large.
In any case it is doing better. About half the speed though.
I was thinking that perhaps limiting batch size on the LLM Engine would be an option. That's how I fixed the freezes during training, reducing batch size.
It still feels odd to me that there are thousands of processes running ray::IDLE with different PIDs.
09/02/2026 11:28 AM
Tyring with leaving Ray params small,but upping concurrent loops.
This did not accelerate inference so Ray is still throttling due to the low concurrency. Let's see if that helps.
Next step is to try limiting batch size isntead of ray concurrency.
09/02/2026 12:15 PM
Trying cropping the image down to a consistent small square before processing. I also suspect that having many more parallel loops than allowed concurrent ray calls is causing some kind of backup.
That seems to have stabilized things. Trying increasing ray concurrency to 16 to see if I can get some speed back.
09/02/2026 12:28 PM
I think a lesson here is keep things consistent as much as possible. Inputs to the model being of the same size makes it much less likely you get weird issues popping up half way through processing a dataset.
Concurrency 16 worked. Trying increasing image size to 512. Immediately failed. Trying lower concurrency. Still didn't work. Trying setting max_num_seqs to 1 instead. If that doesn't work then something is seriously wrong with using large images.
So slow. I'm going up to 8 and seeing what happens.
Seems to work. Going up to 16
Note that I should also probably swap to
max_num_batched_tokens. I just don't know how many tokens my normal input is.
Once I find the min batch size that causes a freeze I will shift to using 2 GPUs and see if that helps.
Ok the answer is 16. Froze quickly. Going to 2 GPUs.
09/02/2026 1:35 PM
Ok, with 2 GPUs it has not crashed. So it is about how many batches are being processed in parallel on the GPUs. Which kinda sucks actually I was hoping it had to do with image processing on the cpu. But this does mean I can try out a bunch of values and get a per-GPU batch size and then limit based on that. Generation of data will be a lot slower than I initially though, that that isn't the end of the world.
I'm going to try 12 batch size on 1 GPU.
09/02/2026 2:03 PM
Went on a walk and came back and it was still running so I'm going to assume that's ok. Trying 12 on each GPU for 2 GPUs.
09/02/2026 2:20 PM
I also just realized I've been using tensor parallel which parallelizes a single model across GPUs instead of data parallel which puts a copy on each GPU. Data parallel seems slightly faster for my simple test, but of course also reduces the KV cache size which might make the tree processing slower. So I should try both.
I also need to make sure I'm properly using https://docs.vllm.ai/en/latest/features/multimodal_inputs/#cached-inputs when using the image multiple times in tree creation. To be safe I will use the real image the first time and then use an ID subsequently. So I'm not even passing the image in for the model to try to use it.
09/02/2026 2:24 PM
On the other hand with data parallelism it just crashed with a batch size of 12 so... I'm moving back to 8 and seeing what happens. Maybe tensor parallelism is actually the way to go. It does have higher throughput and a better KV Cache.
We get around 6 samples/s with batch size 8 2 gpus.
09/02/2026 2:48 PM
Crashed after a few thousand. Trying tensor parallel at 12. It gets around 5.5 samples/s. So slightly slower, but larger KV cache.
09/02/2026 2:53 PM
I could also try using their utility for spinning up the models and then call them externally. maybe they have things in place that make it more stable. Now that I'm thinking about it that's probably a good idea to at least try. It would be nice to be able to run multiple experiments without needing to spin up the models again cause that takes a while.