09/04/2026 12:02 PM - Lecture
09/04/2026 12:02 PM - Lecture
Average reward RL
Discount factor is 1. So all our stuff that rely on discount factor being less than 1 fails.
Maximize the average reward as N goes to infinity.
Since we have a stationary distribution, the reward is just the stationary distribution times the reward you get for being in every state.
This also means that it does not matter what the initial state is since we always converge to the stationary distribution and stay there forever.
I guess if it is an absorbing markov chain we just end up giving reward equal to the reward for being in the final state. As long as the policy does converge to the final state.
This also means that there is no way to use the value function to choose the action since every state has the same value.
So consider a finite horizon policy. Cap the number of steps we can take. Take a value function where we start in time $m$ and state $i$ and proceed until $N$ steps (globally. It's not the number of steps from the start time, it is the total number of steps including those you already took) and maximize the average reward over that horizon.
So when we look at total reward (which goes to infinity) and don't take the average, the difference between different start states remains. It goes to 0 on average, but in sum it does not go to 0. So we can't use the total reward because it goes to infinity, but if we subtract off the long run average then that is non-infinite and we can use that relatively tiny different to know what the best action is.
So if we know $V^*$ and $h$ that satisfy the bellman equation, then the policy of taking greedy actions over the relative value function, then we... what? What does it do?
Obviously the long term average reward is optimal, but what about in finite times?
Thinking through it, we are losing possible total reward whenever we are not in the stationary policy so the relative value function is highest when we are getting to the stationary policy as quickly as possible. So in the transient we try to get to the stationary policy quickly.
The h is bounded (wait why not sub-linear? THat question doesn't make since since h is not time varying) requirement is interesting. What does that imply for the problem?
Finite state space also means it converges to 0 since h is not time varying.
We proved that if the bellman equation holds then the average reward is bounded by $V^*$.
We didn't go over the proof, but it can also be proven the $V^*$ is achievable by an optimal policy.
You need to pick a reference state to have a unique solution since any offset is valid. We choose h(0)=0 for our example.