27/01/2026 12:00 PM - Lecture
27/01/2026 12:00 PM - Lecture
This lecture
- Given stationary policy, why does the bellman equation give the value function?
- Prove the contraction mapping theorem
- Prove that value iteration is a contraction mapping and so we have a unique solution.
- Prove that you can obtain the optimal policy from $V^*$
We are assuming that the reward is bounded above by some constant and below by 0. Or basically that it is just bounded as we can always add a constant.
Note that something like Guassian reward is also valid.
Slide 2 restates what the value function is. Given our first state, what is the expected discounted reward over the infinite horizon.
We then define a truncated expected reward. Then $a_N$ is an upper bounded increasing sequence.
Bounded by $\frac{r_{max}}{1-\alpha}$
Slide 3
Simple proof that the value function satisfies the Ballman equation (if the policy is stationary and action state is finite (could do infinite, but then we need an integral)). We just expand the expectation over the next step and then note that the expanded expectation contains the value function at the next state. The reason that the expectation is the value function at the next state is because our policy is stationary. If it varied with time the expectation would be over a different distribution. Also this is only true over the infinite horizon because otherwise the indexing of the sum doesn't work.
Think through what this implies. Is it the only function that satisfies the Bellman equation? If something satisfies the Bellman equation is it the value function?
Yes we prove that later with the contraction mapping. Whoops.
I will also note that the final line also can be condensed into the expectation over the value function at the next state conditioned on the action we take.
I also note that this is for a given policy we know that the value function satisfies the Bellman equation. But we have not stated anything about what the optimal policy satisfies.
Also we are assuming that the policy is deterministic (state => action without any randomness). Introducing a stochastic policy is easy though. We just have another term in the expectation.
Slide 4 - Contraction Mapping
How can we show that the Bellman equation only has one solution? As we have shown that the value function is one solution, but what if there are multiple?
If we show that the Bellman equation is a contraction mapping we get that it has a unique fixed point that we achieve if we iterate the map. So this gives us value iteration as well. Value iteration is just iteratively applying the Bellman equation so if the Bellman equation is a contraction mapping then iterating it gives us a unique solution that is the value function.
But still, this gives us the value function for a given policy, but not for the optimal policy.
Contraction mapping: Something that maps from some dimension to the same dimension and the distance between the outputs of the mapping is closer than the inputs.
We can define this over any norm, not just the 2 norm.
Mathematically: $||T(x) - T(y)|| \leq \alpha ||x - y|| \text{ where } \alpha \in [0, 1)$Do we also have a special case of contraction mappings that only hold on parts of the domain? We do we still get convergence guarantees if the initial point is inside the domain?
We never showed that there is a unique solution lol. So I guess we didn't prove the contraction mapping theorem.
Bellman Contraction Mapping:
Define a Bellman operator which takes the previous value function and returns a new one:
$$
T(V(1), V(2), ...) = \text{Bellman equation}
$$
$$
\tilde{V}(i) = \bar{r}(i, \mu(i)) + \alpha \sum_j P_{ij}V(j)
$$
Slide 6
By combining the policy and the transition matrices we can define this T in a simple matrix form. This is much easier to see in the annotated slides where he wrote out how we get this new P.
$$
T_\mu(V) = \bar{r}_\mu + \alpha P V
$$
We choose the infinity norm for this proof because it works. We can choose any valid norm to show the result of uniqueness.
Does it also work with other norms?
The $P$ matrix contains probilities of transitioning between states given the policy. So it's basically changing it to a markov chain based on the policy.
Note that $\bar{r}_\mu$ is now a constant so it disappears from the subtraction.
For line 2 look at annotated slides. Still a bit confused about how it is working.
Lines 3 to 4 is also a bit odd. It works because transition probabilities must be positive.
The last equation we get to on this slide has $||x-y||\infty$ which is independent of of $i$ and $j$ so it can be taken out. Then we sum over the $j$s is always just 1 so the max is 1. So we just get $\alpha ||x-y||\infty$ so we have proven the contraction mapping.
Therefore since the value function is a solution it is the unique solution and value iteration will find the policy value.
Slide 8 - Proof of contraction mapping theorem
I was confused by the last 2 lines. How was it independent of $l$? I think it is because we know that $(1 + \alpha + \alpha^2)$ is less than $\sum^\infty_{i=0} \alpha^i$ so we can substitute the infinite sum in and get it independent of $l$. Then that terms goes to 0 as $n$ goes to infinity so the distance between terms goes to 0. This works since $\alpha \geq 0$.
This gives us that $x_n$ is a cauchy sequence so we have guaranteed convergence to a limit.
Slide 10 - Fixed point uniqueness
Do we always converge to the same point.
Task $x^$ as a limit of the sequence. Is it the only limit? Since $T$ is a contraction mapping, we know that repeated application of $T$ always moves us closer to $x^$. Therefore the limit as we keep applying $T$ gets arbitrarily close to $x^$ so $x^$ is unique.