10/03/2026 11:57 AM - Lecture
10/03/2026 11:57 AM - Lecture
DPG
We are now outputing a single value instead of the parameters of a distribution
Why not just use the MLE as the deterministic policy during inference? Why is it important to have a deterministic policy during training?
In general with policy gradients, start with some reasonable policy. Iteratively improve the policy.
In stochastic values we maximize the value function which is over the distribution of actions. Here we maximize the Q values since we don't have the distribution. We can also consider it just $V(x_0)$ since the policy is deterministic.
We are assuming that our Q function estimate is good. So when we maximize the reward over the Q function we have, it improves the true Q function.
We still need to average (or really expectation) over next state since the evironment can still be probabilistic.
The proof just expands the Q value and then takes the gradient over the expanded Q value.
Is the first equality valid? Isn't that only true when Q_w is perfect?
A: Yes, but since we trained the Q function to satisfy this equation, once it converges, we should be good to use it like this.Does this only work when actions are continuous.
There was something about a gradient being 0 in the stochastic case, but not here. Didn't understand. The gradient of the reward? Was it 0 before?
Gradient of the actual action taken with respect to theta for the second term.
So we take the gradient with respect to the input instead of with respect to the parameters. Which also then means that the action space needs to be continuous in order to take that gradient.
In reality we generally use an off policy method where we have a replay buffer. Often priority buffer to bias the behavior policy toward the current policy.
Why can we do this here, but not with other methods like PPO? Why is this different?
DPG is when the critic isn't actually a deep NN. DDPG (Deep DPG) is when we are using a neural network for the critic.
When we sample from the replay buffer, we don't use the action that was taken in the past. We predict a new action from the state. I think this is why it is more off-policy than old policy gradients. The reward gets hidden within the gradient of the Q value with respect to the action. You can look at the math and see that the gradient of Q implicitly takes the gradient of the reward.
TD3 is basically DDPG, but with a clipped double Q as the critic.
TD here is not temporal difference, it is Twin Delayed.
Actor trains at a slow timescale, but critic trains at a higher rate. Really make sure the critic is good.
Reward Shaping
If the model will never get the reward in the first place, it can never learn. How can we bias it to learn what we want?
A common problem is that you don't handle the end of an episode well. This promotes reward hacking by moving back and forth often.
We want "policy invariance reward shaping" where the reward shaping does not change the optimal policy.
Total reward can be different, but the optimal policy doesn't change.
Here we are considering additive reward shaping
What about multiplicative reward shaping? I think that's what I'm considering for my problem where reward is sampled from a bernoulli distribution.
We will look at a special class of reward shaping function
Potential based reward shaping
Also called lyapunov based shaping function
If our function $F = \alpha \phi(x') - \phi(x)$ when this is a policy invariant reward shaping function. $\phi: \mathcal{S} -> \mathbb{R}$.
For maountain car, this can be something like $\phi(x) = h$. The means we get extra reward for going up.
If alpha is too large, couldn't this cause issues by inverting the reward? Especially when h gets large the alpha would cause a large change. So you would need very high velocity.
Every optimal policy is still an optimal policy under the shaped MDP.
If F is not potential based, then there is a problem in which the shaping function is not invariant. Not saying it is not invariant in your situation, but in some situation. So might still be good, but not always. It's situational.
Since $\phi$ is independent of action it does not effect the max.
This cannot work for me since the action is the exact thing being punished. We specifically actually don't want the optimal policy to stay the same since the optimal policy is probably not asking a question at all.