Missed lectures - Andy's notes

⬆️ [ECE 567](<./README.md>) | [13/01/2026 12:05 PM](<./13_01_2026 12_05 PM.md>) ➡️

Missed lecture - Andy's notes

lecture-20-rlhf.pdf

lecture-21-grpo-zeroorder.pdf

Reinforcement Learning from Human Feedback

Reward Misspecification

The handcrafted reward fails to capture the true objective

Reward Hacking

Agent indentifies a "loophole" that maximizes the policy reward but doesn't learn the task at hand
Examples
Navigation/Robotics: You have checkpoints for a car to pass to get to the end. The robot instead learns to cross the checkpoint back and forth instead of going forward.
This is the transition into preference based rewards from human created reward signals

RLHF

Famous Example
GPT 3
What is a GPT Transformer?
The goal is here to predict the next token
GPT Transformer
- Step 1: Input text > tokens > token embeddings and positional embeddings
- Step 2: Embedding > Transformer block 1 > Transformer block n
- Step 3: Linear Layer > Softmax over vocab > Next token distribution

RL from human preferences

Why preferences and not ratings?
Humans are better at relative rather than absolute judgments
Different humans have different interpretations of what scores mean such as 5 and 7
Creating relative judgments allows for more consistentcy across annotators
Example
Consider an mdp (S,A,P,H)
We do not assume access to the true reward
Instead we observer preferences over trajectories
Our goal here is to learn a policy that induces trajectories preferred by humans
Reward model learning from preferences
For every prompt we have two answers
We do not give rewards/preferences to partial answers
Once we are given the two trajectories, we "prefer" a response
Use Bradley terry model to get a probability for a trajectory over another
Training an llm with human feedback using PPO-Clip
We need to train three large models (reward, critic, actor)
We first collect trajectories (prompt, res1, res1)
We then annotate the responses for a preferred Once
We train a reward model on these trajectories by using binary cross entropy
The reward model is hard to train and requires lots of harness engineering and often misspecifies
Solution: Move to DPO
This is due to the reformulation of the optimal policy being proportional to the kl divergence by some param we can use this as our probability of choosing a preferred trajectory
This eliminates the need for training a reward model and a critic

RL from Verifiable rewards

Key difference from RLHF: Human gives a preference (choose an output over the other) whereas RLVR uses a logical way to check whether the answer is right or not (only requires one output and the verifier checks it)
GRPO - Group Relative Policy Optimization
For PPO we need a q function for the advantage function which is Q(s,a) - V(s) - critic estimates value function (needs lots of data and sometimes error from critic can be dominating factor)
For GRPO, lets remove the critic network and lets move to group based relative rewards. For each initial state we sample a group of outputs from the current policy. We then get a reward from a verifier for each trajectory. We then convert from absolute to relative rewards for within group comparison.
We can create a normalized advantage by taking r_i subtract by the mean of the group rewards and normalize by the std
Now we replace the critic advantage to this new group relative advantage so GRPO is structurally similar to PPO
This is great because we aviud a learned critic/value model
Zeroth-Order Policy Optimization for RLHF
Optimization review which is going over gradient descent which basically is telling us grab the gradient and subtract that after being multiplied by some learning rate from the original input to get the next output which is closer. x_k+1 = x_k - h grad(F(x))
Zeroth Order Method: Gradient Free
- Pick a random direction with respect to a standard guassian, now we are at x_k + mu_k after taking the random direction and moving, our goal here is to make the function value as small as possible, we now take this new value which we can call new_k we then take new_k - x_k and check if its becoming smaller or bigger if smaller continue to move in that direction if bigger go opposite
- There is a connection to RLHF here and that is the Bradley Terry Model which uses a similar idea to check if a trajectory is more preferred or not
- REMEMBER IF YOU HAVE THE GRADIENT THAT IS FASTER THIS IS ONLY IF YOU CANNOT GET A GRADIENT
Why does a random perturbation work?
- The mean of the update is the gradient of the function
- There is a proof on lec 21 slide 12
Stochastic Zeroth-Order Policy Optimization for RLHF
- Use perturbation to get a slightly different actor so now we have two actors which gives two different trajectories then you ask the human which trajectories are preferred
- Steps
- Sample a random direction v_t and form a perturbed policy theta' = theta + mu_t*v_t
- For each comparison: sample a batch from each actor and then query M human evaluators on which batch is better and then record
- Aggregate to policy preference to find the improvement direction
- Update the policy
SZPO w/ Known Preference Model
- If we know the preference comes from the Bradley Terry Model, we can estimate the probability that a human will think a perturbed policy is better than the current policy
- After estimating the prob, we can invert the logisitic function (RHS of Bradley Terry Model) and get the difference
- Remember function differences are the differences between the value functions
- So we need M evaluators to get a probability to estimate the probability of a preference of an actor over another and we need N batches to get the average different
SZPO w/ Unknown Preference Model
- The thing here is we cannot estimate the value difference for a correction of the policy
- Who cares about the difference lets just estimate the direction of improvement and then move in that direction with a constant

⬆️ [ECE 567](<./README.md>) | [13/01/2026 12:05 PM](<./13_01_2026 12_05 PM.md>) ➡️