17/03/2026 12:02 PM - Lecture

⬅️ [17/03/2026 10:48 AM - HW7](<./17_03_2026 10_48 AM - HW7.md>) | ⬆️ [ECE 567](<./README.md>) | [24/03/2026 12:02 PM - Lecture](<./24_03_2026 12_02 PM - Lecture.md>) ➡️

Project phase 1 deadline march 31st.

Still on lecture 16/17. It's slightly different than last time.

rl_lect_16_17.pdf

Pick the arm where we think we might have highest reward is the intuition. Always be greedy with best reward, but be optimistic with what the reward might be.

Regret is bounded by the UCB Bonus is the argument.

We know that the UCB for arm 2 was greater than for arm 1 which means that we can say that we have an indequality and replace arm 1 stuff with arm 2 which then makes things bounded by just variables of arm 2 which then allows us to bound the regret only with arm 2.

But why do we do this? Why not just leave it as arm 1 and arm 2 together? That's a tighter bound?

Remember that selecting arm 1 gives us 0 regret by definition in the 2 arm scenario so we don't need to include the arm 1 term.

Remember I is number of arms.

Thompson Sampling performs better for multi-arm bandit, but not very applicable to modern RL.

⬅️ [17/03/2026 10:48 AM - HW7](<./17_03_2026 10_48 AM - HW7.md>) | ⬆️ [ECE 567](<./README.md>) | [24/03/2026 12:02 PM - Lecture](<./24_03_2026 12_02 PM - Lecture.md>) ➡️