16/03/2026 9:13 AM
⬅️ [12/03/2026 9:26 PM - Paper Review](<./12_03_2026 9_26 PM - Paper Review.md>) | ⬆️ [2026 - March](<./README.md>) | [17/03/2026 9:24 AM](<./17_03_2026 9_24 AM.md>) ➡️
16/03/2026 9:13 AM
TODO
- [x] Add tokens to tree generation
- [x] Change transformers model handler to allow for multiple LoRAs to be loaded
RL Implementation Plan
- Edit the sidecar to return the logit logprobs
- I think
return_tokens_as_token_idsis what I want to have it given me the logprobs while still getting access to the next for further iterations.
- I think
- Work on the training script
- Write the loss function for tree normalized policy optimization
- Note that I should be using the LORA with the SFT applied as my base model, not the true base model. So really I'm swapping back and forth between SFT and current model.
⬅️ [12/03/2026 9:26 PM - Paper Review](<./12_03_2026 9_26 PM - Paper Review.md>) | ⬆️ [2026 - March](<./README.md>) | [17/03/2026 9:24 AM](<./17_03_2026 9_24 AM.md>) ➡️