12/03/2026 12:30 PM

⬅️ [12/03/2026 9:41 AM - SNU Meeting](<./12_03_2026 9_41 AM - SNU Meeting.md>) | ⬆️ [2026 - March](<./README.md>) | [12/03/2026 1:30 PM - Lab Meeting](<./12_03_2026 1_30 PM - Lab Meeting.md>) ➡️

12/03/2026 12:30 PM

From RL class

Take insights from epirical regret minimization for my tree thing. I was thinking about taht previously. Think about advantage and the maximum score of all desendant of the current node.
"Regret is the difference between the payoff you could have received if you had perfect information (the optimal decision in hindsight) and the payoff you actually received based on your model's decision."
We know the optimal policy during train time so we can use empirical risk minimization.
I should go through the math and see if I am already doing this somehow using my current baseline.

For using UCB in my system, we could first generate a large number of potentials and then cluster using entailment. Then choose which one to expand based on number of times we explored that path. This substitutes for the current expansion strategy of just random selection of a path.

⬅️ [12/03/2026 9:41 AM - SNU Meeting](<./12_03_2026 9_41 AM - SNU Meeting.md>) | ⬆️ [2026 - March](<./README.md>) | [12/03/2026 1:30 PM - Lab Meeting](<./12_03_2026 1_30 PM - Lab Meeting.md>) ➡️