04/02/2026 9:14 AM

⬅️ [03/02/2026 7:19 PM](<./03_02_2026 7_19 PM.md>) | ⬆️ [2026 - February](<./README.md>) | [04/02/2026 5:33 PM - SNU Meeting Prep](<./04_02_2026 5_33 PM - SNU Meeting Prep.md>) ➡️

04/02/2026 9:14 AM
The outputs get really uncreative when I train the LORA on the cold start cq data. I'm wondering if I should try prompt tuning and save the LORA for the RL. I should look at more past papers. I don't remember them having much detail on the cold start mechanism.

Need other metrics besides loss for measuing the ability of a model to ask clarifying questions. Again, look closer at past papers. I felt like most of them were using things like final accuracy.

04/02/2026 11:48 AM
I'm now thinking about which layers to transform and I know that I'm getting too down in the weeds. That is something to explore later after I have the actual RL working.

I also need to actually do the diverse output generation. I think the way to do that is to hijack the beam search so that it returns all of the beams as those have enforced diversity to a certain extent. Otherwise I can use random sampling with some mechanism to ensure that we get enough diversity.

Before my 1-on-1 today I want to have the diverse generation built and integrated into the demo so that I can generate multiple possible paths and then have the user select which they want to use.

04/02/2026 1:13 PM
Ok one-on-one isn't happening so I'll just have that stuff by tomorrow for the SNU meeting.

04/02/2026 1:31 PM
Example of an actual functional multi-step dialog that needed multiple steps to disambiguate the intention. The answer model correctly hid the fact that it was asking about the height in stories and the cq model correctly identified that there was ambiguity in units and clarified it. Good job models.
Screenshot 2026-02-04 at 1.31.48 PM.png

04/02/2026 3:22 PM
https://arxiv.org/abs/2410.09038
This paper mentions a CovergeQA dataset that uses multiple equally as plausible answers to measure the ability for outputs to be diverse.

04/02/2026 3:36 PM
Screenshot 2026-02-04 at 3.36.16 PM.png
Screenshot 2026-02-04 at 7.18.42 PM.png
I've gotten diverse sampling somewhat working. I tried out group beam search, but the quality of the outputs was very low. Instead, I just used normal temperature scaling which worked much better.

generated_ids = model.generate(  
            **inputs,  
            do_sample=True,           # Enable sampling  
            temperature=1.2,          # High temperature to encourage diverse answers  
            top_p=0.95,               # Nucleus sampling  
            top_k=50,                 # Limit sample pool to top 50 tokens  
            num_return_sequences=num_samples,   # Generate 5 different samples per prompt  
            max_new_tokens=self.max_new_tokens  
        )

Another question is how not to waste compute. Clearly from these responses we can see that often there really is only one output in the diverse set. This happens more often with the answer model than the cq model, but it happens in both. What could be done is to use a sentence embedding model and then enfore some minimum distance to consider two sentences different or a clustering algorithm like DBscan as a preprocessing step with further cutting down into only the most diverse samples.

04/02/2026 3:51 PM
I'm struggling a bit to get the answer model to be vague. It tends to provide much more information than was was asked for in the question.
Screenshot 2026-02-04 at 4.12.37 PM.png
Like here it even seems to try to be vauge, but then gives away the answer after a while of the clarification model floundering
Screenshot 2026-02-04 at 4.13.19 PM.png
If we look in the diverse answers we can see that some of them remain ambiguous and so some branches of the tree would be useful, but many will just be useless. Perhaps we frame this as a bit of a minimax where we train on the branches that are the deepest which correspond to cases where the answer model is vauge or even incorrect (which I have seen happen). So maybe depth relative to other branches is a proxy for how hazy the answer model is being.

[ ] @TODO Implement diverse sampling
[ ] @TODO 504 Homework
[ ] @TODO Read this article Jason sent https://ai.engin.umich.edu/2023/08/17/eight-lessons-learned-in-two-years-of-ph-d/

⬅️ [03/02/2026 7:19 PM](<./03_02_2026 7_19 PM.md>) | ⬆️ [2026 - February](<./README.md>) | [04/02/2026 5:33 PM - SNU Meeting Prep](<./04_02_2026 5_33 PM - SNU Meeting Prep.md>) ➡️