17/02/2026 12:00 PM - Lecture
17/02/2026 12:00 PM - Lecture
Start with end of lec 11.
Advantage
$A(s, a) = Q(s, a) - V(s)$
Wait, so theoretically advantage is negative for any non-optimal action. And 0 when we take the optimal action.
Dueling DQN: Have a head for the value function and the advantage. So predict how good the state is and how much better each action is than each other. The backbone is the same.
The selection of the action is completely dependent on the advantage.
Why? Removes the correlation between action outputs. Requires the network to find the contrast between the actions. Usually the network will often focus on learning the value because of the way the loss is set up.
So then what is the loss? Have we now introduced a hyperparameter to weight how much we care about the advantage and value? Or I guess not because the loss is still applied to the Q value? So maybe I'm wrong about my intuition of why this is better. Look back at lecture to understand that slide.
We get non-unique results if we subtract and add the same value from both A and V. A solution is the force the advantage to be a real advantage by setting $Q = V + (A - A_{max})$. Now if A is a valid advantage the subtracted term will be 0. Basically we have a modified advantage function that matches our defintiion of advantage. There are apparently other methods.
Another solution is to subtract the average. This also ensures uniqueness, but means that V and A do not strictly fit the definition. But we still get the advantage of having small magnitude A values. This is more stable in general.
Does this relate to the baseline later on? Does it also ensure uniqness? I don't think so, but good to look at.
A: It is still possible for A to drift to infty with these, but it doesn't matter in practice probably. Cause you can just add some regularization to stabilize training.
Multi-step target: Expand the Q value around the optimal policy until the last step where we take the max to get the target.
Wait how do we have the optimal policy? Is this the current policy? Oh this is just the theory.
Go back and study SARSA to understand. I think this makes the training on-policy no matter what since the multi-step reward depends on the current policy.So maybe for systems where there are sparse reward, but still every N-ish steps then it is useful to stabilize your learning of the Q value.
Prioritized Replay: Instead of sampling uniformly from your replay buffer, sample in proportion to the TD error. Use a hyperparameter to weight exponent.
Oh, is this useful for my research? Should I sample in proportion to the magnitude of the advantage for faster training? Should look up research on this. Maybe that's doubling up on the advantage?
We do need to weight the gradient update based on the importance weight to correct the bias.
I would like to see the proof that this cancells the bias.
It is probably quite similar to the policy gradient version.
Distributional RL: Useful for risk sensitive environments. Instead of bellman operating estimating the mean value, you learn to predict the entire distribution of reward. Conventional way to do this is to quantize the distribution.
This is very interesting for my research. Specifically the paper on stopping. This can give the distribution over AE. We would expect to get things like multi-modal distributions if you ask a risky question and just narrowing the distribution with generic quesitons.
- [ ] @TODO Look into distributional RL for research.
Rainbow paper: Analyzed the relative performance of different methods. Showed that not only are these useful alone, they are useful together.
Policy iteration basically has two steps: Estimate the value of the policy, improve the policy using the value.
These correspond to the critic and actor in actor-critic methods.
Critic: Estimate Q function for the given policy. We can use TD, double-Q, etc...
Wait how? I thought we needed discrete actions for these? Or maybe you can use the same strategies, but with continuous outputs?
A: Yea, it's different. Now we go back to the old idea of inputting state and action and getting out Q value.
Actor is trained differently. We need to derive policy gradients for this.
Policy is now a stochastic model $\pi_w(u | x)$.
So input is state and output is usually the distribution over actions.
Some algorithms just output actions which is just a sample from the underlying distribution.
Why not just discretize actions? Uncertainty of where to discretize and how to balance the exponentially large action spaces when you have multiple continuous action outputs.
How do we parameterize the distribution of the policy?
If we have a finite action space, we can use a categorical distribution with something like a gibbs policy (Which is softmax/boltzmann exploration)
Why not just use DQN in this case? Do we still outperform?
It depends on the size of the action space. He says around 30 actions should be Q learning. Larger probably swap to policy gradient.
A possibility for an infinite action space is something like a guassian policy.
Policy Gradient Theorem: Policy iteration depends on having finite actions. Go back to original goal of maximizing the value function of a given policy parameterized by w. Take the gradient with respect to w to learn.
Difficult to know how the Q value changes when w changes. Critic is only indirectly a function of w through training on data generated by policy w so we cannot directly take the derivative.