29/04/2026 8:57 PM - Presentation

⬅️ [29/04/2026 2:00 PM - CQE Practice Sessions](<./29_04_2026 2_00 PM - CQE Practice Sessions.md>) | ⬆️ [2026 - April](<./README.md>) | [30/04/2026 8:47 AM - One-on-one](<./30_04_2026 8_47 AM - One-on-one.md>) ➡️

Based on Evolution Strategies at the Hyperscale.

Potentially better for long-trajectory fine-tuning as it has much-reduced memory usage for many-token inputs. They demonstrate training on a 14B model (I think at full prevision). It unfortunately does not play well with the KV cache which means you need to mess with the transformer forward pass to get it to work, but in general it has better memory complexity.

If we want to define an objective that trades off between depth and precision then that can be done more easily with this as it is a non-differentiable fitness function.

Obviously a more natural fit for non-differentiable objectives.
Want to match the distribution of information extraction? You have a few ways:
1. Fine tune on human data
1. This has the flaw that you pick up whatever biases are in the dataset. Perhaps you want to train on this dataset and transfer to another one and this dataset
2. Define a non-differentiable objective of directly matching the information extracted without caring about the content of the text.
1. This then has to be trained using either RL which converts the non-differentiable objective into scored pairs or groups based on input noise and create a differentiable objective with that or
2. Use ES to perturb the actual model weights to form a virtual gradient which is conceptually more neat and has some advantages but introduces practical problems with scaling with transformers.

But in fact, ES has an advantage over GRPO in this domain. If you do have a set of different responses to a question and you want to match information extraction isntead of the exact values, then GRPO cannot do that. It acts exclusively on a sample level. ES naturally handles this because the fitness can be truly anything. You want to have it be the KL divergence between the distribution of information extracted using humans and using an LLM, it can do that no problem. It just means you have to have your fitness function be to roll out mutliple trajectories and compute the KL divergence between information extracted.

In conclusion, a space to keep an eye on in general for RL, but not mature enough to do experiments with yet.

⬅️ [29/04/2026 2:00 PM - CQE Practice Sessions](<./29_04_2026 2_00 PM - CQE Practice Sessions.md>) | ⬆️ [2026 - April](<./README.md>) | [30/04/2026 8:47 AM - One-on-one](<./30_04_2026 8_47 AM - One-on-one.md>) ➡️