16/04/2026 8:05 AM - SNU presentation prep
16/04/2026 8:05 AM
Start with slide outlining the
Swapping back and forth between PEFT adapters is a bad idea.
Best method is to merge your reference LoRA into the base model and initialize a new LoRA on top of that base model.
Or in other words, keep your reference model as one model. If you trained your reference model as a LoRA, merge it into the base model.
Or better yet, pre compute your reference model logprobs. Problem with this is that if you are using something like vLLM to accelerate inference then you cannot compute the LogProbs under the reference model without loading an entirely new HF model into memory because vLLM does not allow you to get logprobs for an existing sequence with a multi-modal model.
Aggresively delete tensors. Modern language model outputs are absolutely massive. As soon as you get to sequence lengths more than around 2000 you start getting into the gigabytes of memory used just to store a single model output. If you are computing in batches this can easily get into the tens of gb of data.
Pytorch can manage memory well, but only if you as the user remember to actively delete the python objects corresponding to tensors you are not using. Otherwise they may linger until the garbage collector comes to get them
Mostly been practical considerations, but looking for methods to compare to. Both looking for methods that are built for language models in multi-turn conversations and methods that are plain RL, but could be used on my data.
Show the ones I have collected as highly relevant.