06/02/2026 10:52 PM
06/02/2026 10:52 PM
The damn AI was trying to be flattering all the time, but this conversation was somewhat useful for getting the ideas down.
I think it makes sense to frame the game as expectimax (which might correspond to MCTS?).
We roll out all the trajectories with leaves being when the model gets the answer right (or perhaps we sample multiple times and see how many it gets right and only give it credit if it's confident). Leaves that never get the right answer get 0 credit. Compute the group relative advantage for each leaf. Then propogate the advantage back through the tree, each time taking the expected value of the children times a discount factor to incentivize getting to the answer quickly.
Before I proceed with this, I really need to read TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Tr. It is highly similar, but I do not think it plays as a game. They are trying to teach trees of thought. Could be a substantial extension to their work. It is a 2024 work with 14 citations so a relatively important.
We can replace the "correct or incorrect" at the end with LLM as judge computing how close the model was as well.
Question with expectimax is, do we actually try to get probabilities of each path or do we just pick randomly. Actually I guess the easiest thing to do would be to to check how many responses are in each response cluster and use that as a proxy for probability. Should be corect in expectation as long as the clusters really do correspond to individual ideas of how to ask a question/answer a question.