13/01/2026 12:05 PM
13/01/2026 12:05 PM
In a SDP we need to account for being in every state as we cannot know where we are. So our policy is a function that maps from state to action for all states.
If I know the future, then it is easy to solve. So start where there is no future. Maximize your reward. Or in this case your expected reward given an action.
So to find out your policy maximize your reward to go:
Ensure that you have the expected reward for all future states from the states. For each action, compute the expected reward by multiplying the probabilties of each future state by the probability of getting into that state. Choose your max.
The value function (of a given policy) is the expected value of the rewards for the final state plus the reward for making each transition up to the final state.
In general we usually say that the value function is the max over all policies of the expected total reward.
The bellman equation relates the optimal value function at time K to the optimal value at time K+1 and the reward for taking different actions. We need to compute the expected reward given the action as well as the expected value for the future state.
Backward-forward algorithm: Compute the optimal value function starting from the last node going backward. Then select the action that maximize the expected reward at each step.
Example
You have $2 and have to play a game 3 times. For each game, the chance of winning
is 0.4 and the money you bet is doubled, and chance of losing is 0.6 and you lose the
money you bet. You can only bet an integer dollar amount.
Goal: Find a policy that maximizes your chance of ending up with $4 or more.
Stage: Index of turns
State: Amount of money
Action: Amount of money we bet
Reward: At time=3 reward is 1 if we are over 4 and 0 otherwise
Infinite Horizon (as opposed to episodic):
Let's transition to something with the Markov property. Now we don't care about time, we only care about our current state and action.
We a discount factor to prevent accumulating reward to infinity. Is there no better way to do that?
We care less about the rewards we might get in the far future. The value we can get now is more important.
In this case, the value function also collapases to just a function of the states. Time is no longer a factor.
We can use the bellman equation as a loss function in deep RL later on.
If the number of states and actions in large, solving the bellman equation exactly is infeasible.