16/03/2026 9:13 AM

⬅️ [12/03/2026 9:26 PM - Paper Review](<./12_03_2026 9_26 PM - Paper Review.md>) | ⬆️ [2026 - March](<./README.md>) | [17/03/2026 9:24 AM](<./17_03_2026 9_24 AM.md>) ➡️

16/03/2026 9:13 AM

TODO

[x] Add tokens to tree generation
[x] Change transformers model handler to allow for multiple LoRAs to be loaded

RL Implementation Plan

Edit the sidecar to return the logit logprobs
1. I think return_tokens_as_token_ids is what I want to have it given me the logprobs while still getting access to the next for further iterations.
Work on the training script
1. Write the loss function for tree normalized policy optimization
2. Note that I should be using the LORA with the SFT applied as my base model, not the true base model. So really I'm swapping back and forth between SFT and current model.

⬅️ [12/03/2026 9:26 PM - Paper Review](<./12_03_2026 9_26 PM - Paper Review.md>) | ⬆️ [2026 - March](<./README.md>) | [17/03/2026 9:24 AM](<./17_03_2026 9_24 AM.md>) ➡️