25/02/2026 12:59 PM

⬅️ [19/02/2026 10:28 AM - SURE Applicants](<./19_02_2026 10_28 AM - SURE Applicants.md>) | ⬆️ [2026 - February](<./README.md>) | [25/02/2026 1:23 PM - One-on-one](<./25_02_2026 1_23 PM - One-on-one.md>) ➡️

25/02/2026 12:59 PM

I have discovered return_tokens_as_token_ids and return_token_ids as ways to get the actual tokens so that I can do RL.

There are basically three distributions that we need to get the logprobs from.
1. $\pi_{old}$: The tokens generated when we roll out the trajectories. These can be stored in the sidecar as we are doing the generation. These are used to compute the importance ratio in the loss.
2. $\pi_{ref}$: The logprobs of the trajectory tokens under the base model. These are used to estimate the KL divergence of the reference policy to the new policy. This is generally computed right before the loss is calculated since it is just a forward pass and does not increase VRAM usage. You just disable the peft adapter and compute the logprobs.
3. $\pi_{\theta}$: The logprobs of the trajectory tokens under the current model. Used for both the KL divergence and the importance ratios.

25/02/2026 11:03 PM
Cool, now I have a github with my notebook.

⬅️ [19/02/2026 10:28 AM - SURE Applicants](<./19_02_2026 10_28 AM - SURE Applicants.md>) | ⬆️ [2026 - February](<./README.md>) | [25/02/2026 1:23 PM - One-on-one](<./25_02_2026 1_23 PM - One-on-one.md>) ➡️