18/03/2026 2:13 PM - Vikas Talk
18/03/2026 2:13 PM - Vikas Talk
How long did it take the student to get RL working? 2 Years? What?
Uniform space problems
I was thinking about this. how to go back and explore old strategies with new skills? How do you separate strategies and skills? This is exploration exploitation, but differentiating what we mean by exploring what.
How do we specify rewards
Also relevant to my idea recently of diffusers with optimal paths.
Try not to use custom implementations.
Good hyperparameters. Good strategies.
Lower learning rate.
Some weird things like small batch improve performance of RL. Due to higher noise.
So perhaps even just artifically lower batch size.
Domain shifts are useful for generalization, but reduce simulation performance.
Perhaps this is similar to
Reduce distribution as much as possible.
How do you balance different rewards? Important to avoid reward hacking. Stop your robot from just standing still ya know.
Just use existing encoders. We can't be sample inefficient so start from a good place. Also better for generalization.
Pretraining -> Training on simulated data -> Training on example human data -> RL
So actually perhaps
Also just becoming more problem specific as you go further.
Also training on example human data can just mean oracle data that has some way to cheat to do better.
Look into literature on language games.