"strategy conditioned policy" which is kinda similar to a goal conditioned polic

⬅️ [Bean's Bot](<./Bean's Bot.md>) | ⬆️ [Ideas](<./README.md>)

"strategy conditioned policy" which is kinda similar to a goal conditioned policy except that the input isn't a goal, but a vector that somehow defines a possible strategy like by enforcing that the discounted future state distribution has high variance when the strategy vector is highly different and low variance when the strategy vector is similar. I was thinking that if you can train a strategy like that with something like a VAE smoothing term then you can then have a sort of hierarchical RL system that learns to move through the strategy space to find strategies that are the real maximizers of reward. I was thinking about this because I wanted a way to simulate evolutionary processes where we have high diversity, but still allow the combination of individuals. having multiple different policies does not achieve this because you cannot combine the policies and is also inefficient because you need to have multiple networks.

Also perhaps I can take some insight from the multi-modal VAE paper where we allow strategy islands that do not have a continuous path between them as long as there is a continuous path within the island.

Perhaps we could even take inspiriation from competition in the natural world as an objective. Niche crowding could itself be an objective that pushes the agent to learn. If too many strategy conditioned policies have ended up in the same "place" as defined by some criterion, then any strategy that finds a new unihabited section of the state space will have a massive advantage even if it does not provide a lot of reward. I know methods already do this, but perhaps if we find a good way to have distance in the strategy space reflect some sort of distance between outcomes then it gives a way for the heirarchical RL system to start incentivizing exploring new strategies.

Maybe we focus on environments with many distinct rewards and find strategies that achieve different distributions of rewards. This actually goes well with the evolutionary perspective. Niches are more about filling up the energy availability of a particular strategy and that is about achieving enough goals in that strategy. Perhaps instead of tuning weights we tune an "energy availability" which describes how many individual agents can take that reward before it is depleated. This is more about tuning how much we think those rewards need to be explored than saying how good it is to achieve this. We still have a loss that depends on population that stops too many agents from getting the same distribution of rewards as well. I guess you can think of this as a goal condition RL problem except that the goal is to reach a specific distribution of rewards and we learn a space where being close in the latent space means you are close in the distribution of rewards you recieve. So if we know that a specific distribution of rewards is best, we can find a place in the latent space where we expect to get that distribution of rewards.

I am imagining applying this to something like craftax where rewards act as a way to provide guidance, but we don't want to get too stuck in maximizing any single reward. In highly complex environments like craftax things like curiosity fail because everything is a noisy tv and we really do need some way to reduce the space of "interesting". However, I think this would generally work in environments with few rewards as well. Like think about a walker environment where you both need to stay upright and move to a goal. We don't know the balance of these goals, but we know that there is a distribution of strategies and that the optimal policy lies somewhere in that distribution. So the best thing to do is learn a set of policies that parameterizes that distribution and then learn

These papers are good for seeing how other people have tried to create loss functions that force diversity in strategy.

Continuous Skill Discovery / DIAYN (for the state-distribution variance based on latent vectors).
diversity_is_all_you_need.pdf
dynamics_aware_skill_discovery.pdf
Trajectory VAEs in RL / SPiRL (for the VAE smoothing term applied to behaviors).
spirl.pdf
discovery_of_continuous_skills.pdf This wasn't what the AI was talking about, but I think it's interesting
Quality-Diversity RL (QD-RL) and Latent Space Evolution (for the evolutionary combination aspects).
skillchain-tech.pdf
Also interesting, but not what I was talking about.

https://szhaovas.github.io/2022-09-15-me/ MAP-Elites is a similar idea, but not differentiable.

https://gemini.google.com/app/dfcbbfe09dee9bcc

Diverse strategy conditioned policy optimization

A major problem in reinforcement learning is the collapse of policies down to exploiting a single strategy. Heirarchical RL addresses this by learning policies that achieve specific objectives that can be strung together by a coordinator. Classically, this has been achieved either by specifying a single goal for each agent or by learning representations of single goals (or skills) during training.

Specifying a single goal makes it difficult to naturally find goals that depend on a heirarchy of earlier goals. The coordinator must be involved to tell the policy to achieve the earlier goals. Learning unsupervised represetations of single goals makes it difficult to discover complex goals because it does not take into account the information induced by providing sub-goals.

We want to bridge the gap between the unsupervised learning of goals and the specification single goals. We propose that we inspect not any individual goal, but the distribution of types of rewards that the agent is receiving.

We present "strategy conditioned policies", $\pi(a | x, z)$ where $z$ is a latent vector that represents a strategy (desired distribution of reward types). $z$ has the property that if $z_i$ and $z_j$ are close in the latent space, then the distribution of rewards that the two conditioned agents recieves will be similar by some distribution similarity metric.

To do this we probably need an EM like strategy. First roll out with randomly initialized latent strategies. Then look at the reward distributions we acheive with each. Then learn an encoder that makes the strategies we achieved map back to the random latent strategies we initialized.
For the VAE assumption to be true which later enables exploration, we need some way to enforce smoothness in the strategy space. That is, z being close means the reward distributions is close. This is very non-trivial.

Simplest strategy is to enforce smoothness of the policy. That is the KL divergence of $p(a | s, z)$ and $p(a | s, z')$ is small. This may work, but due to the exponential butterfly effect, it will likely lead to wildly different reward distributions.
Successor feature smoothing. This is a direct optimization of the quantity we want, but I'm not exactly sure how would work. The successor vector, $\psi_{\pi}(s, z)$ predicts the distribution of future rewards for a given policy, but is there a way to translate that back to the policy itself? Perhaps giving strategic rewards during training to convince the policy network to move into states that are likely to give the future reward vector we like? But that would make the successor vector prediction invalid, defeating the whole purpose. Maybe there's a smarter way? I think there is a known way to get policies to align with a successor vecot. I think. I'm not sure.
Perhaps just sample many many rollouts with slightly perturbed strategy vectors and punish those that diverge greatly with a loss that remains small for small distribution differences but blows up quickly outside a given range?

We take inspiration from biological systems to define the concept of niches and niche capacity. Strategies that are close in the latent space are said to belong to the same niche.

How do we actually get exploration of new niches? Can we apply a loss that acts on the policy that is dependent on the niche fullness? Can we apply reward for taking actions that are unlikely for policies in a full niche? Will niches naturally separate in the latent space into islands if we do not apply a prior?
Perhaps we really do model evolution. We maintain a population of latent vectors and those that lie too close on the latent space die and those that lie in empty spaces reproduce (sample nearby strategy vectors). Maybe also maintain a rolling buffer of previous generations so that we don't start cycling back and forth between strategies. The trivial strategy of receiving no reward quickly fills up and we heavily sample any strategy vectors that get any other distribution. The sampling reproduction strategy also gives us a way to have multiple similar strategy policies to enforce the VAE smoothness. We sample the new policy vectors as normal perturbations of the original and part of the loss is enforcing that they are still similar to their ancestor's predictions. But do we actually maintain the ancestor network? Like every EM cycle we have both the ancestor and the current? Or just use an EMA model like normal? Or maintain just the successor vector network if that is getting used somehow?

We can also take inspiration from RLE. There we have state-dependent rewards that are nonetheless completely random. They have task rewards also be part of the total reward, but maybe we have the state-dependent rewards be the only reward the agent receives and we optimize the state dependent reward network and evolve the population of inputs to the state dependent reward network. The state dependent reward network must be optimized such that population samples close together result in similar reward distributions while we are also resampling the population itself to have higher density in places

Needs

Strategy vector as input to $p(a | s, z)$ such that if $z_i$ and $z_j$ are close then the distribution of achieved rewards is close.

If $z$ is a valid strategy then $z' = z + \epsilon$ with $\epsilon \sim \mathcal{N}(0, \sigma)$ is a valid strategy.

More fleshed out ideas

The strategy vector network learns to generate state-dependent reward structures that cause the policy to learn a policy with a given reward distribution.

Potential Algo 1 (Heirarchical RLE)

Sample a population $Z = {z_1, z_2, ..., z_n}$.

Initialize the state dependent reward network.

Train the random policy $\pi(a | s, z_n)$ to collect trajectories with their random reward structure.

This gives us a dataset that maps state dependent reward structures to the distribution of final rewards. Then basically optimize backwards to shift the zs using a contrastive loss or something to have distance more exactly reflect the

Ok first problem. We now want to optimize the state dependent reward network as well as the strategy vector
Perhaps the input to the network is a function of z that we can jointly optimize to remain the same? Or optimize the policy model to take the same actions it did before under the new zs?

⬅️ [Bean's Bot](<./Bean's Bot.md>) | ⬆️ [Ideas](<./README.md>)