Skip to content

20/01/2026 12:03 PM - Lecture

⬅️ [19/01/2026 4:51 PM - HW1](<./19_01_2026 4_51 PM - HW1.md>) | ⬆️ [ECE 567](<./README.md>) | [22/01/2026 11:59 AM - Lecture](<./22_01_2026 11_59 AM - Lecture.md>) ➡️

20/01/2026 12:03 PM - Lecture

Lecture 4
lecture-04.pdf

Value iteration

Estimate the value function and then greedy choose actions to maximize value.

Policy iteration:
Start with a policy and iteratively improve it.

Solve the bellman equation exactly
$V^(x) = max_a E[r(x, a) + \alpha V^(x')]$

Specifically this is the time discounted version where the future maximum reward has an exponential decay term

Model based

We have the model. Have state space, action space, transition probabilities (condiitoned on action), and reward are known.

If all finite, then we can do tabular learning where we literally have a table of values.

Problem: We must known V to compute V.
Solution: Start from a random value table. Pretend that our guess is the true value and then solve the bellman equation for each state given our table. Once we have done that for all states, replace our old table with the new table until we converge.

Note we are infinite horizon so the subscript of the value function is no longer stage. It is instead iteration of value iteration.

$V_{k+1} \leftarrow max_a (\bar{r}(x, a) + \alpha\sum_{x'} P_{x, x'}(a) V_k(x'))$

It is provable that this converges to the optimal $V^*$.

We can then recover the optimal action by the argmax of the value iteration (bellman equation) step.

We only know that $V_k$ gets closer to $V^*$, but that does not guarantee optimal policy at any given point.

Generally we stop iterating once all entries change by less than $\epsilon$ in an iteration.

Policy Iteration

In value iteration, if we stop somewhere in the iteration process, a greedy approach may not lead to a good policy.

In policy iteration, we instead set an initial policy, compute the value of each state under the policy, and then compute a new policy based on the value in the normal way. Repeat until the policy does not change on the next iteration.

Note that the value function is defined as a set of linear equations and so can be solved quickly. It has the form of the bellman equation, except that we have the actions we take in each state so it just evaluation instead of a maximization over actions.

In policy iteration, you proveably improve the value function at each iteration. If you cannot improve, the policy is optimal.

This is a more costly algorithm than value iteration, but it converges in a fininte number of iterations and improves monotonically.


⬅️ [19/01/2026 4:51 PM - HW1](<./19_01_2026 4_51 PM - HW1.md>) | ⬆️ [ECE 567](<./README.md>) | [22/01/2026 11:59 AM - Lecture](<./22_01_2026 11_59 AM - Lecture.md>) ➡️