13/02/2026 2:52 PM
13/02/2026 2:52 PM
I've been thinking through the LLM-as-a-judge part of this work.
This survey is an amazing source.
Pairwise comparison is better than outright scoring.
I'm not sure for my situation though. Like if we do ranking then how do we assign the actual reward? Also we generally have very right or very wrong answers with little in between. A ranking doesn't capture this well.
One thing that could be done to get a score. 2.3.2 outlines a strategy for getting continuous scores from the logits of a yes/no binary. So I could ask: Is this a correct answer to the question or equivalent to the correct answer for the question. And then use the logit scores for evaluating the probability of being correct.
I don't see other things, but maybe there are. I can implement both and see what aligns best with my intuition for the score.
I need prompts for both the asking the model to score (maybe 1-10) and the prompt for the yes/no.
Prompts should include relevant examples, so I'll include some flows here.
******* Sample *******
Unambiguous Question: What are the clear circles on the front of the car made of?
Ambiguous Question: What are they made of?
Gold Answer: glass
Answers: ['glass', 'glass', 'glass', 'lights', 'glass', 'glass', 'glass', 'glass', 'glass', 'silver rod']
Image Path: /scratch4/home/adempst/projects/clarification-trees-v2/data/clearvqa/images/train_000000.jpg
Caption: a blue and white truck parked next to a crowd of people in a grassy field with trees in the background
**********************
>> Inference Response:
>> Metal and paint.
This would get 0.
******* Sample *******
Unambiguous Question: What are the clear circles on the front of the car made of?
Ambiguous Question: What are they made of?
Gold Answer: glass
Answers: ['glass', 'glass', 'glass', 'lights', 'glass', 'glass', 'glass', 'glass', 'glass', 'silver rod']
Image Path: /scratch4/home/adempst/projects/clarification-trees-v2/data/clearvqa/images/train_000000.jpg
Caption: a blue and white truck parked next to a crowd of people in a grassy field with trees in the background
**********************
>> Inference Response:
>> Glass and metal.
This would get around a 5 or 6 because it got one, but also included a wrong answer. The model would have performed better if it had further clarified exactly what part was being asked about.
******* Sample *******
Unambiguous Question: What are the clear circles on the front of the car made of?
Ambiguous Question: What are they made of?
Gold Answer: glass
Answers: ['glass', 'glass', 'glass', 'lights', 'glass', 'glass', 'glass', 'glass', 'glass', 'silver rod']
Image Path: /scratch4/home/adempst/projects/clarification-trees-v2/data/clearvqa/images/train_000000.jpg
Caption: a blue and white truck parked next to a crowd of people in a grassy field with trees in the background
**********************
>> Inference Response:
>> Glass
This of course gets a 10 as it gets the answer exactly.
******* Sample *******
Unambiguous Question: Where should the reflective items on the peoples high foreheads really be?
Ambiguous Question: Where should these items be positioned?
Gold Answer: over eyes
Answers: ['covering eyes', 'over eyes', 'over eyes', 'over eyes', 'goggles', 'googles', 'on eyes', 'over eyes', 'over eyes', 'eyes']
Image Path: /scratch4/home/adempst/projects/clarification-trees-v2/data/clearvqa/images/train_000011.jpg
Caption: a man posing with a child on skis in front of a lodge at a ski resort during the winter season
**********************
>> Inference Response:
>> On their heads, above eyes.
This one is much less clear. I would argue that this is also a correct answer to the question. So I would say this gets a 9 or 10.
******* Sample *******
Unambiguous Question: There is a large white structure with which animal directly in front of it?
Ambiguous Question: What is directly in front of the large structure?
Gold Answer: zebra
Answers: ['zebra', 'wall', 'zebra', 'zebra', 'zebra', 'roof', 'tent', 'zebra', 'giraffe', 'wall']
Image Path: /scratch4/home/adempst/projects/clarification-trees-v2/data/clearvqa/images/train_000103.jpg
Caption: a giraffe standing in the middle of a lush green field with trees and other animals in the background on a cloudy day
**********************
>> Inference Response:
>> Giraffe
This is a case of one of the answers actually being wrong. The model thinks the answer is Giraffe and one of the answers is giraffe, but in reality the answer is zebra. There are also distractor "correct" answers like "wall", "roof", and "tent". In this case only the gold answer is correct.
Also look at section 5.1.1 to see if other people have done similar NLP games.