26/02/2026 10:12 AM - Presentation Prep

⬅️ [26/02/2026 12:05 AM](<./26_02_2026 12_05 AM.md>) | ⬆️ [2026 - February](<./README.md>)

GRPO

Prior Algos and Motivation

Start with policy gradient for large action spaces. Introduce the concept of a baseline for variance reduction and small step size for stabilizing training.
- Note that we need the baseline to be independent of the chosen action.

Move to natural policy gradient. Computationally infeasible for large models.
Why do we need the information matrix? What is it actually optimizing for?

Move to PPO. Still widely used. Use a heuristic for step size instead of second order optimization. Note that we still have a very heavy value model that, in industry, is often trained as a separate model of almost the same size as the original. For research we often are just training a LoRA which means we can train the value model as a LoRA as well (or even just a head on the base model).

GRPO

Remove that value model. Make it even more simple. If we have multiple rollouts from a single prompt then this gives us another way to create a baseline. Functions of all actions marginalize out the action: $f(x)=g(x, {a_1, a_2, ..., a_n})$ are also valid choices
- Pretty sure there is a very general class of functions that satisfy this. Expectation is one, but also mins and maxes, and general distributional opterations for distributional RL (although we then generally take an expectation over that for the value function when we compute advantage).

Why is the KL regularization to the base (SFT) model so important in LLMs? Basically the state space is so huge that LLMs can easily reward hack and start generating meaningless tokens.

GRPO Practicalities

$\pi_{old}$: The tokens generated when we roll out the trajectories. These can be stored in the sidecar as we are doing the generation. These are used to compute the importance ratio in the loss.
$\pi_{ref}$: The logprobs of the trajectory tokens under the base model. These are used to estimate the KL divergence of the reference policy to the new policy. This is generally computed right before the loss is calculated since it is just a forward pass and does not increase VRAM usage. You just disable the peft adapter and compute the logprobs.
$\pi_{\theta}$: The logprobs of the trajectory tokens under the current model. Used for both the KL divergence and the importance ratios.

What do we have to use in terms of vLLM and huggingface for training?
Find that webpage that shows what happens when you don't use Extract logprobs and actual token IDs from vLLM requires specific generation parameters.
During training we use PEFT for the LoRA which allows us to turn off the LoRA to compute the reference logprobs in order to compute the divergence between the current and reference model. (and potentially the value function)

Tree Relative Policy Optimization

Not to be confused with Tree Policy Optimization which somebody else did

This problem generalizes GRPO to the ability to continue generation in rounds until a decision is made. This is applicable both to single player games such as chain of thought or multiple tool calling, but also to multi-player games such as my clarification question task.

The reward function in this case follows the common bellman equation closely excpet that we do not have to use single sample monte-carlo rollouts because we generate the full trees.
There is a second problem which is how do we deal with the ability to infer or defer?
Two solutions:
1. Train distributional RL and have a risk averse policy that uses a threshold on the probability of error at inference time. I want this to be the eventual solution as it theoretically converges to a policy that balances risk well.
2. Assume a specific form of the infer/defer policy. In this case the natural approach is to use an optimal stopping policy. This policy uses its perfect knowledge of the future to always make the decision that maximizes value. This biases a policy toward one that is more willing to take larger risks, but removes the need for the value function.
We compute the reward as:
$$
V(x_k) = \max{ \left\lbrace \substack{\text{Infer}(x_k)\ E[r(x, u) + V(x_{k+1})]} \right.}
$$

Theory of this is relatively simple. We have two objectives:
1. At each decision point, we want to be able to reduce variance and intuitively make it easy for the model to tell which decision was better and which was worse.
2. We also want to know how impactful the decision was in terms of the entire tree.

This is where we differ slightly from GRPO. The definition of our group becomes the entire tree and to satisfy the second objective we normalize our advantage with the standard deviation of the reward over the entire tree. However, when we compute the baseline, we still want to have a "better" and "worse" option so we compute the mean over the siblings.

⬅️ [26/02/2026 12:05 AM](<./26_02_2026 12_05 AM.md>) | ⬆️ [2026 - February](<./README.md>)