Evolution Strategies at the Hyperscale
https://arxiv.org/abs/2511.16652
Can I tie this to my research. Maybe I start out as showing my lackluster initial results.
Better for fitting models to non-differentiable functions (like matching distribution of information to humans. Information similarity cannot be discovered until after finishing generation so we either need to use classic RL like DPO or GRPO).
Why use Evolutionary Algorithms
More natural for handling things like outcome only rewards.
More robust to noisy and ill conditioned optimisation landscapes.
Lots of others.
No need to communicate gradients across devices. Scalar fitness is enough.
Can fine-tune in the datatype used for inference wheras with backprop you need to be converting between them to get both differentiability and precision.
So why aren't we using them?
Past works have used full rank matrices at perturbations which are very memory intense.
If you need to load a new whole matrix into memory any time you want to compute activations, it will be incredibly slow. Loading just a small vector in is incredibly fast.
I wonder if you could use this same strategy in something like Craftax to train the RNN using only perturbations instead of backprop.
What do they do?
Make perturbations low rank.
If our perturbation is
$$
E \in \mathcal{R}^{m\times n}
$$
then take two low rank matrixes
$$
A \in \mathcal{R}^{m\times r} \text{ and } B \in \mathcal{R}^{r\times n} \text{ where } E = \frac{1}{\sqrt{r}} AB^T
$$
much like a LORA. Now instead of each perturbation being of size $mn$ they are each $(m+n)r$ which for large language models is orders of magnitude smaller.
Also, they generate perturbations on demand using a seed instead of storing each perturbation so they don't even need to store the low rank matrices.
When they are evaluating a batch, they share the base activations and designed a forward pass that applies all the perturbations in parallel. Then when the perturbation is actually applied, they weight all the perturbations together which means the true update is not low rank.
They also dedicate a lot of work to deriving the optimal noise scaling parameter and showing that the low rank perturbations converge to gaussian fast in terms of the rank.
Evolutionary strategies
Define a distribution over the fitness parameter space that is parameterized by some variables (separate from the model parameters).
$$
\pi(x | \theta)
$$
Note that in ES, the parameters that directly effect the fitness are called the fitness parameters. When the fitness is applied as extent to which a model solves a problem, the fitness parameters are just the model parameters.
The goal can the be reframed as maximizing the exected fitness over the distribution of fitness parameters (models).
$$
J(\theta) = \mathbb{E}_{x \sim \pi(x | \theta)}[f(x)]
$$
Like in RL we can then differentiate this to get an expression of the score function
Is this literally the policy gradient theorem? Or just something close to it?
$$
\nabla_\theta J(\theta) = \mathbb{E}{x \sim \pi(x | \theta)}[\nabla\theta log \pi(x | \theta)f(x)]
$$
So the score function comes back in.
We can then draw samples from this space that are slight perturbations from other models to form the gradient through monte-carlo sampling.
$$
\nabla_\theta J(\theta) = \frac{1}{N}\sum_{i=1}^N \nabla_\theta log \pi(x_i | \theta)f(x_i)
$$
Note that we do not differentiate through the fitness function, only the fitness parameter distribution. This makes the method a zero order optimization method. Meaning that the we take the zeroth derivative of the fitness function. i.e. we just evaluate the function itself.
They use a guassian population distribution.
$$
\pi(x | \theta) = \mathcal{N}(\mu, I_d\sigma^2)
$$
They provide citations to many other works that use ES to rival GRPO in fine-tuning, but say that theirs is the only one that works well for pre-training. And I think basically it just comes down to larger population size.
Implementation
Apparently using low rank matrix approximation is an issue because the score function doesn't have a analytical solution. So they approximate it. Under some assumptions about continuity, mean, symmetry, and finite fourth order moments and unit variance.
They do a bunch of complicated math to prove everything, but in the end they just approximate the score function as the score function of the guassian as it performs the best. They tried a bunch of mean-field approximators, but they all performed worse.
So then they just update the model parameters in a really simple way. I think the idea is that the $\theta$ in the parameterized distribution is mapped to the mean of the normal (since the variance is constant) and so they are effectively forming a monte-carlo estimate of the gradient for the weight matrix.
So can the variance be changed at all? They above state that you need unit variance so I assume that means it can only every have $\sigma=1$. So a learned variance which they do not do would just make everything collapse. I assume they tried that.
Then they just take a gradient ascent step.
Ok so no, the variance does definitely vary with time. They use a decay rate.
Results
I mean, impressive.
They do say something things I don't understand. Like
We use this recipe to train a 14 billion parameter RWKV 7 model on the DeepScaleR dataset and evaluate in more challenging maths reasoning tasks. In this regime, GRPO is infeasible due to the extra memory used by the Adam optimiser Kingma & Ba (2014).
Like, is this true? You can easily train 14B models with backprop. GRPO doesn't change that memory consumption.
An interesting thing they point out is that since they can optimize for any fitness function they can specifically optimize for metrics like pass@k. Which they say is a known limitation of GRPO although I haven't specifically heard about that.
They also point out they can directly fine tune integer quantized LLMs which is actually nice.
My Opinion
This is a mathematically much more elegant way to form gradients for fitness functions like those often used in RL. The details with the very specific values for the variance over time to match the first order gradient are unfortunate in my eyes if I am understanding it correctly.
The fact that you just don't have to worry about differentiability of the objective is very nice. Perhaps in the future we will see ES used for more non-differentiable objectives instead of DPO or GRPO.
Definitely shows promise for language models, but it is a real shame that the efficiency doesn't apply as much for transformers due to the KV cache. If there was only a way to have a low rank approximation of the KV cache as well.
The reduced memory usage due to not storing activations allowing larger batches allowing for faster convergence is neat. Forward pass only mechanisms are definitely nice.
What?
So, what's with the weighted average thing? It seems like they just make a monte-carlo estimate. Where's the weight? Looking at (6) I guess the weight is the fitness. They weight the perturbation, $E$, by the fitness which aligns with their first figure if the fitness is the weight.