Skip to content

AQUA: Toward Strategic Response Generation for Ambiguous Visual Questions

⬅️ [Bayesian teaching enables probabilistic reasoning in large language models](<./Bayesian teaching enables probabilistic reasoning in large language models.md>) | ⬆️ [Reading List](<./README.md>) | [Acknowledging Focus Ambiguity in Visual Questions](<./Acknowledging Focus Ambiguity in Visual Questions.md>) ➡️

AQUA: Toward Strategic Response Generation for Ambiguous Visual Questions
https://openreview.net/forum?id=7b1MpD6IF8

Introduce a dataset and concept of levels of ambiguity. They classify into 4 levels.

Similar idea to mine, except we want to use a continuous version.

Dataset generated using the COCO dataset. Look at objects in each image and find where there may be ambiguity. Speicifically targets objects and asks questions about them. Levels of ambiguity depend on the number and prominance of similar objects. Generates questions whole cloth using an image and the class of ambiguous object.

Does have a model with multi-step clarification for level 3 ambiguity?
Uses LLM (gpt5 mini) as a judget to assign reward for correct or incorrect clarification.

Use SFT and GRPO
What are they predicting with SFT? Level of ambiguity? Then what would the GRPO be for?

It is still unclear to me what they are getting the VLM to do. Do they predict the final answer? Is that what the GRPO is based on?

3.6K for training and 3.6K for evaluation.
Each split is evenly balanced across the four ambiguity levels, with 0.9K instances per level.

Evaluation:
Accuracy of prediction level of ambiguity

They don't have accuracy of final answer? Yea, it is only if it chose the right strategy, not if it gets the correct final answer.

Screenshot 2026-02-04 at 5.55.49 PM.png

Screenshot 2026-02-04 at 5.56.02 PM.png

Screenshot 2026-02-04 at 5.56.11 PM.png

Screenshot 2026-02-04 at 5.56.20 PM.png


⬅️ [Bayesian teaching enables probabilistic reasoning in large language models](<./Bayesian teaching enables probabilistic reasoning in large language models.md>) | ⬆️ [Reading List](<./README.md>) | [Acknowledging Focus Ambiguity in Visual Questions](<./Acknowledging Focus Ambiguity in Visual Questions.md>) ➡️