14/04/2026 12:03 PM - Lecture
14/04/2026 12:03 PM - Lecture
Continuing with last lec slides.
Average reward problem basically makes us subtract a baseline.
Policy iteration
Basically the same, except that we have the relative value function and subtract off the baseline which is also computed at each step.
Since we now have two variables we are computing we also now need to enforce a variable to have a unique solution. Which generally we just say the realtive value of the initial state is 0.
Then be greedy wrt the relative value function. Then iterate.
Policy may oscillate at the end, but check if the value and relative value are still evolving.
Turns out that training for average reward policy is rather unstable. So what do we do instead?
Well we try to convert back to discounted RL. Choose a discount close to 1 to approximate. Usually works well.
Blackwell optimality. Policy is blackwell optimal if it is optimal from a range of discount factors from some quanity up to, but not including, 1. So it is a stronger optimality.
If the MDP is finite state finite option, there is a blackwell optimal policy for some discount range.
If the policy is blackwell optimal, then it is optimal for the average reward problem. So we can transfer a model trained for a discount problem and have it work for the average reward problem.
So we don't get this nice property for most modern RL situations due to infinite state and action space. But in practice does it still work?
What is the lower bound usually? Is it like 0.9? Or is it 0.9999999999999?For the existence proof, this also means that there could be multiple optimal polciies though, right? You have a monotonic subsequence, but you can have multiple monotonic subsequences corresponding to different policies.
We prove by contradiction that the policy that repeats infinitely is blackwell optimal.
The i bar is basically just choosing a state that appears in the stationary distribution.
Basically if we have two different monotonic subsequences then if we define a function that evaluates the Q value under different discount factors then subtract between the two policies then the function oscillates infinitely.
The Q value is discounted future rewad which is continuous. Also it is rational
I don't actually know why it is rational
Together this forms a contradicton since continuous rational functions cannot oscillate infinitely. So the policy that forms the optimal policy for the monotic subsequence is the only policy that can satisfy the convergence.
Safe RL (Constrained RL)
Maximize reward subject to constraint.
Options:
1. Formulate as a multi-objective and look at pareto front.
2. Pick one primary, put other as constraint.
We are going to look at the second in the field of constrained optimization.
Basically this method provides us with a way to shift the lambda and mu values such that constraints are satisfied.
HW9 Quiz.