24/02/2026 12:02 PM - Lecture
24/02/2026 12:02 PM - Lecture
Reinforce has high variance and takes a long time to converge. We need a very small learning rate. Or... do variance reduction.
Variance Reduction/Control Variates Method
Comes from stats. Can we find a better estimator that has a lower variance that has the same mean as the value we want?
As long as we subtract a feature that has a mean of 0 (in the distribution of x), then we can subtract it off and not change the mean. So we need to find a feature function that has high correlation with the original estimator to supress the total variance.
In the original REINFORCE algo, we just multiply by the Q value to estimate the value gradient. We can instead use the TD error. We can use this because we have the variance reduction theorem which states that we can subtract off any function as long as it is action independent. The independence gives us the fact that the expected value of the score function times this "G(x)" function is always 0.
The value function is a good G function because the only difference between the Q and V functions is that our Q is making a specific choice based on the future V function. So these are very highly correlated.
Other choices can be made. Like if we are doing controls then maybe a lyapunov function would be good as we know in practice it is highly correlated with the value, but we don't actually need to learn the value. And of course in somethng like GRPO we choose a sample mean.
Remember that expectations are over the x_k, u_k given that we start in an initial state. Which separates into the policy and probability of being in a state given the policy since the policy is markovian and is independent of k. it only depends on the current state. Remember also that gradient only applies to score function.
Figure out what is going on with x_ks. Like What is the expectation over initially? All x_ks? Is this an expectation over just a single step x_k?
Note that G cannot depend on future state either. (Although could technically depend on past states). Cannot depend on the decision you make at this state.
Wait but the value function is a function of the future state? Oh wait no it's over the distribution of future states. Wait but it still depends on you policy? Just not the actual choice you made? It cannot depend on the actual future state chosen. Is this a problem for me? I don't think so? Or maybe? Because I am propogating reward back for the current rollout instead of learning the value function?
Maybe I should do the analysis with taking G equal to the expectation over all future rewards for the tree and showing that it still has 0 mean?How does this optimization perspective relate to taylor expansion? Does it? Shoudl we technically have the hessian appearing in this, but we just substitute a learning rate that we expect to overestimate the penalty?
But in general this is just a better basis for why we are using this gradient update step. Like what is it doing? It is finding the max of a function that has been linearized and penalized from deviating too far from the original.
But in this case we really want to penalzie changes to the output. Cause weights in neural networks not deviating much doesn't do much to ensure that your policy doesn't change much cause of the weird nonlinear relationships.
This is making me wonder if there is a way to regularize my policy to make it ask similar questions, not necessarily minimize the KL divergence. Since difference in policy for me is about information gain and not absolute KL divergence. I should see if there is research using a model as the policy distance.
- [ ] @TODO Find a paper for this.