12/02/2026 8:03 PM

⬅️ [09/02/2026 10:12 AM](<./09_02_2026 10_12 AM.md>) | ⬆️ [2026 - February](<./README.md>) | [13/02/2026 2:52 PM](<./13_02_2026 2_52 PM.md>) ➡️

12/02/2026 8:03 PM
Questions coming up on how to compute the advantage.
Once we have the reward from the leaves we can propagate that back through the tree as the expected sum of future rewards like normal in RL. But then what? The two things to think about are scale and baseline.

You can subtract off anything from the rewards as long as the thing you subtract off is independent of the actions that the Q values are depending on. So do you subtract off the average reward for the entire batch? For the individual tree? For just the siblings? Depends on what you are trying to achieve. It seems most reasonable to me to subtract off the sibling mean so that you are reinforcing or deincentivizing behavior at each decision point. The logic being that you shouldn't get reward just for being in a good branch of the tree that started with a good question.
Another possible thing to consider would be to take the idea of additional error and apply it here. Additional error is the subtraction of the error you can currently get and the best error from the decendents. I think that the best error from the decendents holds as an allowed baseline. But perhaps the expected error of the decendents would be lower variance... which I think might bring us back around to having the baseline be the expected error of the siblings because of the linearity of expectation. Whereas I think the AE given the max future reward would correspond to a minimax version of the problem.
There are other options like the leave-one-out baseline which might just be a better way of implementing the average baseline (why is this better actually? Why is it better to leave out the branch you are actually on from the average? Oh perhaps because the average will pursue the branch you are currently looking at so you actually want to see how the branch is different from the rest?)

For scaling, this same logic doesn't hold. There will be many situations where all decisions that the model made were pretty much equally bad/where there is no better or worse decision to make. In either of these cases we want to have small gradients. If we computed a group level advantage at that point, however, it would force the optimizer to consider one of the options as far superior to another. Instead, we want to consider the variance of possible reward on the tree level. We want to incentivize decisions that got us to the best outcome possible for this tree so we should scale with the std of the rewards in the tree. That way decisions that have the largest effect on the outcome of the tree as a whole are reinforced or supressed.

I should implement in such a way that it is easy to swap between these though as only experimentation will tell us which is really best.
Also I should read TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees and papers that cite it to make sure I am not missing something in existing theory.

12/02/2026 8:36 PM
Oh, I also got the inference model working. It's just the answer model with a different prompt. I did rollouts of entire dialog trees and they work well.

I have noticed that the CQ model often switches to giving answers after it becomes abundantly clear what the unambiguous question is. In reality this is fine, but for the purpose of the dialog trees we want only questions from our clarification model. This means we will need some other components to the reward. In this case, the CQ model producing something that is not a question should count as getting the wrong answer.

I was also thinking that giving negative reward for generating a question that has high biconditional entailment with questions earlier in the dialog would be good. Basically suppress the tendancy for the model to just continue asking the same question over and over.

So quick summary of what I've been up to and the future steps:
We have three models: The clarification question (CQ) model which is the policy to be trained, the answer model, and the inference model. The CQ model and the answer model are in a kind of game where the CQ model is trying to extract the unambiguous question and the answer model is trying to give true answers while concealing the maximum amount of information.

We start with the ambiguous question/image pair and the CQ model generates a set of D clarifying questions. We use biconditional entailment to convert this list of D questions into L groups. We then further prune down to N groups where N is a hyperparameter that controls the tree branching factor. We use the normalized group size as a proxy for transition probability since this should be true in expectation. For each question we then run the answer model and group using the same entailment processes.

For each answer we then both run the inference model to see if the correct answer can be derived from the given information and further rerun the CQ model to produce the next node in the tree.

These steps are repeated until we reach depth $D_{max}$. Technical note: We consider depth to increase by 1 after every clarifying answer, not after ever new node. So depth 2 means that two pairs of clarifying questions and answers have been generated for a total depth of 4+root=5. So inference after only the ambiguous question is presented is a depth 0 inference.

We then process the trees by assigning rewards and terminal states. This is the part I am still most unsure of because really it should depend on the probability of making an inference which we do not have because that depends on our deferred inference criteria which does not yet exist. But here is the general idea:
Reward is assigned at each inference for whether the answer was correct or not. We can use LLM-as-judge to get a soft score for how close it was.
Reward is assigned at clarifying question nodes for things like supressing repeated questions and incentivizing asking questions even deep in the tree. Reward is propagated up the tree by using the true expected future reward (think about how things like punishing repeated questions could effect this. Will the model learn to ask uninformative questions early so that it can prevent asking the same informative question again later on?). The question then becomes how allocate probability of making an inference versus asking another clarifying question. This effects our expected future reward. For now, we will assume that we always choose to ask another clarifying question until a inference that is completely correct is reached.

Now we have a tree with rewards assigned at each node. The next step is to, per the discussion above, choose a baseline and scale to compute the advantages. For each tree, we take the STD of the rewards at the clarification question nodes as the scale. For each clarification question, we take the baseline as the average of leave-one-out average of the siblings.

Then we train using standard policy gradient methods where each sample has the entire dialog up to a clarification question as the context and the part to generate is the clarification question.

This completes our RL loop. We then go back and generate more trees. Think about reward shaping as needed.

In the future we will train a model, similar to a value model, that predicts the additional error incurred by making an inference how over waiting. This gives us a principled way to decide whether to defer inference and ask another question. This could be trained along with the CQ model to compute the probability of whether to use the current clarifying answer to make an inference.

12/02/2026 11:08 PM
I think I figured out the problem with when to use inference and when to not. We assume that we have a perfect deferral model that knows exactly when it is better to defer and when it is not and take some inspiration from a minimax approach. As we propogate back through the tree, when we hit a clarifying answer we check the inference that was made at that point. If it leads to a better reward than has propagated back to that point, then we take that one. So basically we take a max over two avenues: Defer or Infer.

I am not sure how exactly this effects the theoretical mathematics. I think it just changes the game that is being played. Even if an earlier decision would have resulted in inference, we still train on the deeper levels to train the model how to clarify further even when it is not necessary by a perfect infer-defer computation which endows the model with the capability to be careful once we move to an imperfect infer-defer system.

What are possible results:
Perhaps the CQ model will learn to ask very risky questions early on as it can get better reward now? I don't think any more than any early stopping strategy.

Gemini points out that the general topic that this falls under is optimal stopping theory.

⬅️ [09/02/2026 10:12 AM](<./09_02_2026 10_12 AM.md>) | ⬆️ [2026 - February](<./README.md>) | [13/02/2026 2:52 PM](<./13_02_2026 2_52 PM.md>) ➡️