Skip to content

11/03/2026 6:53 PM - SNU Presentation Prep

⬅️ [11/03/2026 4:00 PM - Sebastian](<./11_03_2026 4_00 PM - Sebastian.md>) | ⬆️ [2026 - March](<./README.md>) | [12/03/2026 9:41 AM - SNU Meeting](<./12_03_2026 9_41 AM - SNU Meeting.md>) ➡️

11/03/2026 6:53 PM - SNU Presentation Prep

Uhhhhh what do I do this on?

Formulating a bellman equation for optimal stopping?
Stopping criteria in general?

Reward shaping?

No I think stopping and how we choose a stopping criterion

Something about why we want to continue on even when we could get it correct by inferring. More data is good, especially for learning to ask questions in deeper layers. It also helps because sometimes it can guess right, but to a human there are still multiple possibilities so it may be that if it asks a later question it can still then disambiguate further.

Start with the problem of infer or defer. Note that it is a decision that can be made separately from finding a good clarifying question, but that technically the decision does effect the optimal policy.

Formalize the task as an MDP like I do in the paper. Note that we consider the infer defer decision as a function of state, but do not learn it. We instead compute it classically using information we have access to.

Present the bellman equation for the situation with the infer/defer decision integrated. Then give the formulation for optimal stopping and present the bellman equation for that. Explain that this makes it possible to train, but the choice of this infer/defer decision is largely arbitrary. It is just one reasonable choice.

Then present our desire for an infer defer decision as a bound on the expected error. Then present distributional RL as a way to get the distribution of expected error and compute the value we need.

End by saying that this can be integrated at inference time to bring the bellman equation back to a probabilistic decision and show a new form where the p(infer | state) is replaced by p(error < z). Point out that z is a hyperparameter that in reality depends on the situation. In a medical question answering task that is probably a very low value while in lower stakes like a chatbot it might be better to prioritize effort and not ask unnecessary follow-ups.


⬅️ [11/03/2026 4:00 PM - Sebastian](<./11_03_2026 4_00 PM - Sebastian.md>) | ⬆️ [2026 - March](<./README.md>) | [12/03/2026 9:41 AM - SNU Meeting](<./12_03_2026 9_41 AM - SNU Meeting.md>) ➡️