Skip to content

16/02/2026 10:45 AM

⬅️ [13/02/2026 2:52 PM](<./13_02_2026 2_52 PM.md>) | ⬆️ [2026 - February](<./README.md>) | [17/02/2026 11:12](<./17_02_2026 11_12.md>) ➡️

16/02/2026 10:45 AM
Got in late.
Starting the day with some RGSC stuff, then checking up on how much homework I have, and then getting the reward calculation working.
Oh also I need to respond to my old advisor about revisions. I ignored his email last week because of my grandfather, but I need to tackle that now. I'll tell him I'll have it done by next monday.

Tasks:
- [x] @TODO RGSC Stuff
- [x] Minutes to officers
- [x] Meeting reminder
- [x] Revision update to Laschowski
- [x] Homework check
- [x] Calculate reward and add to visual
- [x] Add question asking reward
- [x] Add questions repetition cost
- [x] Add reward to visual

16/02/2026 12:03 PM
Courses check:
- Vision:
- Still need to do pre-lecture reading/lecture
- Still need to do PACES
- Next PS due 24th EOF (Next tuesday). Mostly implementation and the derive part is something we have done in leture. Expect around 4 hours.
- RL
- Need to review previous HW for quizzes
- This week's homework is just algorithm implementation. Should take a couple hours.

16/02/2026 1:51 PM
Wait do I need to use average future reward to prevent reward buildup from things like question precense score? Or should I have those things be costs so that you can only lose points for doing something bad in the future and not gain points for doing stuff right? Cause the max strategy makes it impossible to build up rewards for correct future inferences, but you can still get reward for having a question.

Also, does it make sense to accumulate future reward for things like the question precense or entailment? Those are rewards that kind of only make sense for the current node. A future question being entailed by the current question doesn't make the current question bad, it means the model hasn't learned to ask different questions yet.

16/02/2026 2:58 PM
The thing that takes by far the most time is generating the rewards for the inferences. I am going to try doing rewards without the prompt to justify the reward since it is basically just matching to the existing answers so perhaps it is easy enough.

16/02/2026 10:23 PM
I am thinking about training strategies now. Perhaps it makes sense to experiment with breadth versus depth. Like it it better to initially train on 2 depth trees with 4 or 5 children?

Well actually this brings me to thinking about evaluation. I guess I evaluate by doing what I am now where I take a sampling of trees, compute the rewards, and then record the results. Then I can do things like graph the depth versus reward and depth versus inference score. As well as track the prevelance of failure cases like repeated questions or failing to ask a question.

  • [ ] @TODO Also should I change the system that selects which clusters to use to select those with the most samples in them? Or sample in proportion to the number of samples in each? Or is it good to force diversity by sampling randomly overly the clusters?

⬅️ [13/02/2026 2:52 PM](<./13_02_2026 2_52 PM.md>) | ⬆️ [2026 - February](<./README.md>) | [17/02/2026 11:12](<./17_02_2026 11_12.md>) ➡️