Skip to content

20/01/2026 10:11 AM - SK Presentation Prep

⬅️ [15/01/2026 2:00 PM](<./15_01_2026 2_00 PM.md>) | ⬆️ [Lab Meetings](<./README.md>) | [22/01/2026 8:00 AM - SK Team](<./22_01_2026 8_00 AM - SK Team.md>) ➡️

20/01/2026 10:11 AM - Presentation Prep

Components

Datasets
Ambiguity categorization
Current small model capabilities
Current large model capabilities
Methods
Progress
Future

  • [ ] Justify why type of ambiguity is important before datasets. I guess this can be inside the problem statement.
  • [ ] Get sizes of datasets

I am thinking that before we launch into dataset we need to present the problem statement and the types of questions we can ask that are of scientific interest.

Checking if models can accurately produce ambiguity that matches the way humans are ambiguous (we can start this by using existing datasets, but we want to later design a human experiment to get a more accurate picture). Basically, do LLM generated ambiguous VQA datasets match the data distribution of actual human data?
Evaluating "depth-accuracy curves" with "number of clarifying questions" on the x axis and "accuracy" on the y axis over multiple models as well as humans (through the same human experiment). This allows direct comparison to human performance on the clarification dialog task which is currently not available in the literature (would like advice on further metrics that would be useful in comparing between humans and LLMs in this task.)
Proposing a method to learn a policy to perform multi-turn clarification dialogs that can be easily tuned to balance accuracy and effort.

Datasets

Our theme here is the different ways that we can get ambiguous data and the trade offs we incur. basically I am arguing that we trade off between "good labels on type of ambiguity", "amount of data (function of human effort to create)", and "matching to real human data distribution".

Wait, there is also whether we have ground truth unambiguous questions.
1. Take unambiguous questions and from those remove information to make them ambiguous.
1. Pros: Allows us to choose the type of ambiguity. Automatically gives us a ground truth unambiguous question to bootstrap the training.
2. Cons: Relies on extensive human work to convert the questions or LLMs with verification.
3. Example: ClearVQA
2. Find ambiguous questions within existing datasets.
1. Pros: Data distribution matches human ambiguity. Finding ambiguous questions often requires defining the type of ambiguity.
2. Cons: Relies on heuristics to find ambiguous questions. Does not naturally produce ground truth unambiguous questions. Produces relatively small datasets.
3. Example: Ambiguous subset from "Why did the chicken cross the road?"
3. Find a domain in which ambiguity is a natural part of the problem.
1. Pros: Data distribution matches human ambiguity within the problem. Often rich in data.
2. Cons: Restricted domain of questions. Does not produce type of ambiguity. Not explicitly intended for use in ambiguous question answering. Would be a research project in itself to convert into an ambiguous VQA dataset.
3. Example: Derm1M

We get a pick-two between our three desires for a dataset. However, this may not be as big a problem as it appears. As long as LLM produced ambiguous VQA datasets have a similar data distribution to human ambiguity, training on a dataset like ClearVQA should produce a model capable of operating over human ambiguity. This can be evaluated by using a dataset such as the ambiguous subset as a test set.

Methods

Future directions

Is training a regression head the best way to predict ambiguity?
It would be nice to generate my own dataset based on my research.

So do a human experiment where we collect as part of it human examples of making a question ambiguous in order to better prompt a model to make questions ambiguous.


⬅️ [15/01/2026 2:00 PM](<./15_01_2026 2_00 PM.md>) | ⬆️ [Lab Meetings](<./README.md>) | [22/01/2026 8:00 AM - SK Team](<./22_01_2026 8_00 AM - SK Team.md>) ➡️