13/02/2026 2:52 PM

⬅️ [12/02/2026 8:03 PM](<./12_02_2026 8_03 PM.md>) | ⬆️ [2026 - February](<./README.md>) | [16/02/2026 10:45 AM](<./16_02_2026 10_45 AM.md>) ➡️

13/02/2026 2:52 PM

I've been thinking through the LLM-as-a-judge part of this work.

This survey is an amazing source.
Pairwise comparison is better than outright scoring.

I'm not sure for my situation though. Like if we do ranking then how do we assign the actual reward? Also we generally have very right or very wrong answers with little in between. A ranking doesn't capture this well.

One thing that could be done to get a score. 2.3.2 outlines a strategy for getting continuous scores from the logits of a yes/no binary. So I could ask: Is this a correct answer to the question or equivalent to the correct answer for the question. And then use the logit scores for evaluating the probability of being correct.

I don't see other things, but maybe there are. I can implement both and see what aligns best with my intuition for the score.

I need prompts for both the asking the model to score (maybe 1-10) and the prompt for the yes/no.

Prompts should include relevant examples, so I'll include some flows here.

******* Sample *******  
Unambiguous Question: What are the clear circles on the front of the car made of?  
Ambiguous Question: What are they made of?  
Gold Answer: glass  
Answers: ['glass', 'glass', 'glass', 'lights', 'glass', 'glass', 'glass', 'glass', 'glass', 'silver rod']  
Image Path: /scratch4/home/adempst/projects/clarification-trees-v2/data/clearvqa/images/train_000000.jpg  
Caption: a blue and white truck parked next to a crowd of people in a grassy field with trees in the background  
**********************  

>> Inference Response:  
>> Metal and paint.

This would get 0.

******* Sample *******  
Unambiguous Question: What are the clear circles on the front of the car made of?  
Ambiguous Question: What are they made of?  
Gold Answer: glass  
Answers: ['glass', 'glass', 'glass', 'lights', 'glass', 'glass', 'glass', 'glass', 'glass', 'silver rod']  
Image Path: /scratch4/home/adempst/projects/clarification-trees-v2/data/clearvqa/images/train_000000.jpg  
Caption: a blue and white truck parked next to a crowd of people in a grassy field with trees in the background  
**********************  

>> Inference Response:  
>> Glass and metal.

This would get around a 5 or 6 because it got one, but also included a wrong answer. The model would have performed better if it had further clarified exactly what part was being asked about.

******* Sample *******  
Unambiguous Question: What are the clear circles on the front of the car made of?  
Ambiguous Question: What are they made of?  
Gold Answer: glass  
Answers: ['glass', 'glass', 'glass', 'lights', 'glass', 'glass', 'glass', 'glass', 'glass', 'silver rod']  
Image Path: /scratch4/home/adempst/projects/clarification-trees-v2/data/clearvqa/images/train_000000.jpg  
Caption: a blue and white truck parked next to a crowd of people in a grassy field with trees in the background  
**********************  

>> Inference Response:  
>> Glass

This of course gets a 10 as it gets the answer exactly.

******* Sample *******  
Unambiguous Question: Where should the reflective items on the peoples high foreheads really be?  
Ambiguous Question: Where should these items be positioned?  
Gold Answer: over eyes  
Answers: ['covering eyes', 'over eyes', 'over eyes', 'over eyes', 'goggles', 'googles', 'on eyes', 'over eyes', 'over eyes', 'eyes']  
Image Path: /scratch4/home/adempst/projects/clarification-trees-v2/data/clearvqa/images/train_000011.jpg  
Caption: a man posing with a child on skis in front of a lodge at a ski resort during the winter season  
**********************  

>> Inference Response:  
>> On their heads, above eyes.

This one is much less clear. I would argue that this is also a correct answer to the question. So I would say this gets a 9 or 10.

******* Sample *******  
Unambiguous Question: There is a large white structure with which animal directly in front of it?  
Ambiguous Question: What is directly in front of the large structure?  
Gold Answer: zebra  
Answers: ['zebra', 'wall', 'zebra', 'zebra', 'zebra', 'roof', 'tent', 'zebra', 'giraffe', 'wall']  
Image Path: /scratch4/home/adempst/projects/clarification-trees-v2/data/clearvqa/images/train_000103.jpg  
Caption: a giraffe standing in the middle of a lush green field with trees and other animals in the background on a cloudy day  
**********************  


>> Inference Response:  
>> Giraffe

This is a case of one of the answers actually being wrong. The model thinks the answer is Giraffe and one of the answers is giraffe, but in reality the answer is zebra. There are also distractor "correct" answers like "wall", "roof", and "tent". In this case only the gold answer is correct.

Also look at section 5.1.1 to see if other people have done similar NLP games.

⬅️ [12/02/2026 8:03 PM](<./12_02_2026 8_03 PM.md>) | ⬆️ [2026 - February](<./README.md>) | [16/02/2026 10:45 AM](<./16_02_2026 10_45 AM.md>) ➡️