17/03/2026 12:02 PM - Lecture
17/03/2026 12:02 PM - Lecture
Project phase 1 deadline march 31st.
Still on lecture 16/17. It's slightly different than last time.
Pick the arm where we think we might have highest reward is the intuition. Always be greedy with best reward, but be optimistic with what the reward might be.
Regret is bounded by the UCB Bonus is the argument.
We know that the UCB for arm 2 was greater than for arm 1 which means that we can say that we have an indequality and replace arm 1 stuff with arm 2 which then makes things bounded by just variables of arm 2 which then allows us to bound the regret only with arm 2.
But why do we do this? Why not just leave it as arm 1 and arm 2 together? That's a tighter bound?
Remember that selecting arm 1 gives us 0 regret by definition in the 2 arm scenario so we don't need to include the arm 1 term.
Remember I is number of arms.
Thompson Sampling performs better for multi-arm bandit, but not very applicable to modern RL.