07/04/2026 12:05 PM - Lecture

⬅️ [24/03/2026 12:02 PM - Lecture](<./24_03_2026 12_02 PM - Lecture.md>) | ⬆️ [ECE 567](<./README.md>) | [09/04/2026 12:02 PM - Lecture](<./09_04_2026 12_02 PM - Lecture.md>) ➡️

lecture-22-contrastive-rl.pdf

Contrastive Reinforcement Learning

How do we compute the partition function if we just straight learn the probability distribution?

Not sure what's actually going on on these early slides

Turn density estimation into classification.
Is this data or is it noise?

data is drawn from the data distribution.
Noise is drawn from another chosen distribution that is different from the data distribution.
This can be fit into standard classifiation frameworks. If your data is cats and your noise is dogs (or anything else).
Except that for our data and noise distributions, the same sample can be valid under both.

Why does this avoid normalization?

Learning a classifier gives us the density function.

What constraints do we have on the noise distribution? Is there a theoretically optimal noise distribution?

Thm: If sampling from data or noise is equally likely (prob 1/2 for each. If not then the equation is modified), then the optimal model is $f_\theta(x) = log\frac{P_+}{P_-}$

This is proven on the slides.

We know $P_-$. So then $P_+ = P_-\cdot e^{f_\theta(x)}$

Assume x has the same support (PDFs for pos and neg are over the same space.)

We only consider the ideal scenario where our model is expressive enough to perfectly solve the problem. i.e. we can pick the $f_i$ that maximizes the inside of the integral independently and so can globally minimize the loss. It is capable of doing so.

Then if we take the derivative of the inside wrt to $f_i$ for each term individually then it needs not change.

If we were not expressive enough we would want to change the model when we looked at any individual sample.
Also note that this isn't really over samples. It is over the domain, x.

I'm not getting the next step immediately. How does the optimality condition give us the next part? Does this just have to be true for everything because the gradient is 0?

We actually took the gradient with respect to $f$.

Goal conditioned RL

No reward, only a goal at the end. Or basically just very sparse reward.
Goal conditioned policy: $\pi(a | s, g)$. Give the goal as part of the sate basically.
$\rho(g)$ is a prior distribution over goals. We choose this.

We can then define a random goal policy (or marginalized policy). In the policy, a goal is sampled at each step and then the action is chosen using the goal conditioned policy.
$\pi(a | s) = \sum_g \pi( a | s, g) \rho (g)$
Considers all possible goals you may have.

We want to define a goal dependent reward. If you can take just one step and achieve the goal, you get a reward for that.

If we pick a random future time using a geometric distribution and then ask what the probability of being in state s is. Then that is the discounted probability of being in a state at a future time from the occupancy measure slide.

This then allows us to define the Q function in terms of this distribution. In fact it simply is this distribution.
This is intuitive. Basically we are asking what the probability is that we reach the goal state in the future (discounted by being in the future).

We use the discount to make it non-infinite, but what if we assume an absorbing markov chain?

We converted something related to a Q function to a distribution over $s, a, g$. Take $x=(s, a, g)$ then our Q distribution can be thought of as $f(x)$. Just a distribution over a state.

This is actually interesting for my setup. I have a similar setup where I have a final goal and some probability of reaching the goal in the future. And we are learning the probability that we get it right in the future which I do need. But not as good as distributional RL as we only get yes or no answers.

We sample a time and then create a goal that is satisfied by being in the state.

Yea, this is kinda interesting. Perhaps if we can generate questions that satisfy the given answer then we can use something similar to this. But we need a distribution over goals? Wait what is goal in this case? We need to know the probability of the goal in order to recover Q but I don't know the goal? No, you don't need to know the goal you just need to take the argmax since we are taking discrete actions. But I don't have discrete actions soooo...

What if we have a specific goal set and we cannot just sample a random time?

We might want to use this because it is much easier to train a classifier, but it also has its downsides.

I should look more into this.

⬅️ [24/03/2026 12:02 PM - Lecture](<./24_03_2026 12_02 PM - Lecture.md>) | ⬆️ [ECE 567](<./README.md>) | [09/04/2026 12:02 PM - Lecture](<./09_04_2026 12_02 PM - Lecture.md>) ➡️