12/03/2026 12:03 PM - Lecture

⬅️ [10/03/2026 11:57 AM - Lecture](<./10_03_2026 11_57 AM - Lecture.md>) | ⬆️ [ECE 567](<./README.md>) | [17/03/2026 10:48 AM - HW7](<./17_03_2026 10_48 AM - HW7.md>) ➡️

lecture-16.pdf

Exploration & Exploitation

If we haven't tried options, we can't know whether it is optimal.

Also even if you tried options in the past, maybe try it again with new skills

But we also need to refine strategies to see how good they are.

Multi-arm bandit

Arm you pull random and reward you get is random. But reward is statitonary. Same dist for every timestep.

Epsi greedy: Greedy choose highest monte-carlo estimated expected value.

Upper confidence bound

Works best for discrete (even very large) action sets

Play each arm once (collect an initial dataset)
Play greedy actions, but with a UCB bonus that measures the confidence we have that the max value is less than a value.

What if means are very large? Doesn't the bonus turn not change, but if you have high means it becomes less important? Just scaling every reward shouldn't change our exploration?

Large K is the total number of plays we make. So doesn't change after we set it initially? So it is not infinite horizon?

But what about epsilon greedy with diminishing epsilon?

Take insights from epirical regret minimization for my tree thing. I was thinking about taht previously. Think about advantage and the maximum score of all desendant of the current node.
"Regret is the difference between the payoff you could have received if you had perfect information (the optimal decision in hindsight) and the payoff you actually received based on your model's decision."
We know the optimal policy during train time so we can use empirical risk minimization.
I should go through the math and see if I am already doing this somehow using my current baseline.

For using UCB in my system, we could first generate a large number of potentials and then cluster using entailment. Then choose which one to expand based on number of times we explored that path. This substitutes for the current expansion strategy of just random selection of a path.

Note that K cannot depend on the state otherwise we cannot bring the expectation in. I think?

First term is the least optimisitc difference between optimal and i_k and least optimistic difference between estimated and true mean. But this also corresponds to being most optimistic about how good each could be.

But for a confidence interval you do have to set a parameter of what % chance you lie in the interval.

⬅️ [10/03/2026 11:57 AM - Lecture](<./10_03_2026 11_57 AM - Lecture.md>) | ⬆️ [ECE 567](<./README.md>) | [17/03/2026 10:48 AM - HW7](<./17_03_2026 10_48 AM - HW7.md>) ➡️