12/02/2026 11:56 AM - Lecture
12/02/2026 11:56 AM - Lecture
Starting with end of lecture 10.
The gradient tutorial seems to kinda miss the more complex case of weights earlier in the network.
lecture-11.pdf
My prediction for DQN is that we rely on the fact that Q learning is off policy to allow for the use of a replay buffer.
Review the derivation of tabular Q learning. I'm having trouble remembering why the objective for DQN is correct for finding the optimal Q value from offline data.
Look into tabular Q learning with a stochastic policy.
Would like to see what an objective would be for learning the Q function for an already existing policy. Would we then directly use the full rollout reward?
Interesting that the label depends on the Q function you are learning. Network updating => labels change.
We need to ignore this because we cannot take the derivative over the max function.
Why? Can't we do it.
Can we quantify how much effect it has to ignore this part of the gradient?
So we just say y_k is constant and treat it as a label. We will slowly update it over time.
Convergence is slow if we feed trajectories in directly as we get them. Related to samples not being IID.
You can think of it as learning to play one part of the game while destroying the rest of the information we gained.
Why even get rid of old parts of the replay buffer?
Wont this make me forget how to play parts of the game that are outside of my strategy?I guess chaning the target slowly also aligns with how tabular Q learning works. In tabular Q learning we don't just update one Q value, we update all of them before moving to the next iteration. So we should really only update the target network when the Q network.
Can also use a soft update where you have the target network be a moving average of the online network.
In practice, don't have the action be an input to the network. Use multiple heads. One for each action. There is a large correlation between Q values in a given state so sharing information is useful to stabilize training and inference.
I mean, why not include the estimated Q network in the loss. Why treat y as constant with respect to w? We can take a derivative over a max as long as we are taking a small step.
When we swap away from model based we swap the order of the max and expectation causing overestimates.
In the DDQL we are assuming that Q is the same for all actions so it can be taken out of the max. And the max can be ignored over Q because all actions have the same value.
How do you actually compute the expected max?
We did it on the annotated slides.
Double estimators allow us to estimate a max from data. Solves our exact problem.
One dataset solves the argmax problem and the second solves the estimation of max problem.
But why is this useful. I can also just choose -infinity as my estimate. How close is the double estimator to the true mean? Is it tight? Is there a tighter estimator?
The error is the same between DQN and DDQN, but DDQN converges faster due to a higher learning rate.
So in double DQN we basically take the target network and swap it with the online network instead of copying it in.
Uhh no I think we update both at the same time we just treat each as the target of the other.
Another thing that works well is just to take the min of the two estimated Q values. In this case we use use the max value from our own network instead of from the other one.