Missed lectures - Andy's notes
Missed lecture - Andy's notes
Reinforcement Learning from Human Feedback
Reward Misspecification
- The handcrafted reward fails to capture the true objective
Reward Hacking
- Agent indentifies a "loophole" that maximizes the policy reward but doesn't learn the task at hand
- Examples
- Navigation/Robotics: You have checkpoints for a car to pass to get to the end. The robot instead learns to cross the checkpoint back and forth instead of going forward.
- This is the transition into preference based rewards from human created reward signals
RLHF
- Famous Example
- GPT 3
- What is a GPT Transformer?
- The goal is here to predict the next token
- GPT Transformer
- Step 1: Input text > tokens > token embeddings and positional embeddings
- Step 2: Embedding > Transformer block 1 > Transformer block n
- Step 3: Linear Layer > Softmax over vocab > Next token distribution
RL from human preferences
- Why preferences and not ratings?
- Humans are better at relative rather than absolute judgments
- Different humans have different interpretations of what scores mean such as 5 and 7
-
Creating relative judgments allows for more consistentcy across annotators
-
Example
- Consider an mdp (S,A,P,H)
- We do not assume access to the true reward
- Instead we observer preferences over trajectories
-
Our goal here is to learn a policy that induces trajectories preferred by humans
-
Reward model learning from preferences
- For every prompt we have two answers
- We do not give rewards/preferences to partial answers
- Once we are given the two trajectories, we "prefer" a response
-
Use Bradley terry model to get a probability for a trajectory over another
-
Training an llm with human feedback using PPO-Clip
- We need to train three large models (reward, critic, actor)
- We first collect trajectories (prompt, res1, res1)
- We then annotate the responses for a preferred Once
- We train a reward model on these trajectories by using binary cross entropy
-
The reward model is hard to train and requires lots of harness engineering and often misspecifies
-
Solution: Move to DPO
- This is due to the reformulation of the optimal policy being proportional to the kl divergence by some param we can use this as our probability of choosing a preferred trajectory
- This eliminates the need for training a reward model and a critic
RL from Verifiable rewards
-
Key difference from RLHF: Human gives a preference (choose an output over the other) whereas RLVR uses a logical way to check whether the answer is right or not (only requires one output and the verifier checks it)
-
GRPO - Group Relative Policy Optimization
- For PPO we need a q function for the advantage function which is Q(s,a) - V(s) - critic estimates value function (needs lots of data and sometimes error from critic can be dominating factor)
- For GRPO, lets remove the critic network and lets move to group based relative rewards. For each initial state we sample a group of outputs from the current policy. We then get a reward from a verifier for each trajectory. We then convert from absolute to relative rewards for within group comparison.
- We can create a normalized advantage by taking r_i subtract by the mean of the group rewards and normalize by the std
- Now we replace the critic advantage to this new group relative advantage so GRPO is structurally similar to PPO
-
This is great because we aviud a learned critic/value model
-
Zeroth-Order Policy Optimization for RLHF
-
Optimization review which is going over gradient descent which basically is telling us grab the gradient and subtract that after being multiplied by some learning rate from the original input to get the next output which is closer. x_k+1 = x_k - h grad(F(x))
-
Zeroth Order Method: Gradient Free
- Pick a random direction with respect to a standard guassian, now we are at x_k + mu_k after taking the random direction and moving, our goal here is to make the function value as small as possible, we now take this new value which we can call new_k we then take new_k - x_k and check if its becoming smaller or bigger if smaller continue to move in that direction if bigger go opposite
- There is a connection to RLHF here and that is the Bradley Terry Model which uses a similar idea to check if a trajectory is more preferred or not
- REMEMBER IF YOU HAVE THE GRADIENT THAT IS FASTER THIS IS ONLY IF YOU CANNOT GET A GRADIENT
-
Why does a random perturbation work?
- The mean of the update is the gradient of the function
- There is a proof on lec 21 slide 12
-
Stochastic Zeroth-Order Policy Optimization for RLHF
- Use perturbation to get a slightly different actor so now we have two actors which gives two different trajectories then you ask the human which trajectories are preferred
- Steps
- Sample a random direction v_t and form a perturbed policy theta' = theta + mu_t*v_t
- For each comparison: sample a batch from each actor and then query M human evaluators on which batch is better and then record
- Aggregate to policy preference to find the improvement direction
- Update the policy
-
SZPO w/ Known Preference Model
- If we know the preference comes from the Bradley Terry Model, we can estimate the probability that a human will think a perturbed policy is better than the current policy
- After estimating the prob, we can invert the logisitic function (RHS of Bradley Terry Model) and get the difference
- Remember function differences are the differences between the value functions
- So we need M evaluators to get a probability to estimate the probability of a preference of an actor over another and we need N batches to get the average different
-
SZPO w/ Unknown Preference Model
- The thing here is we cannot estimate the value difference for a correction of the policy
- Who cares about the difference lets just estimate the direction of improvement and then move in that direction with a constant