Textual diffusion through repeated summarization

⬆️ [Ideas](<./README.md>) | [Singular value regularization](<./Singular value regularization.md>) ➡️

This idea comes from the concept of combining autoencoders with textual diffusion. The idea is that we can train models to take more logical paths to the full text in textual diffusion using a training process where we repeatedly summarize long texts into shorter and shorter sections and then train the textual diffusion model to recover the original text. How we bridge the gap between different summarization is a problem.

The nice part of this is that you can use a local-global strategy where you build sections of text in an expanding tree where at the base of the tree you have an initial seed and then as you go into higher sections you can diffuse in a block, but condition on all the lower down blocks which still contain valid information at another level of abstraction.

Perhaps it would simply be better to work in a continuous embedding space where it is easier to define the flow between levels of summarization.

Or really just split it out into multiple textual diffusions in a tree with each one just being conditioned on all its ancestors. But then why not just use an autoregressive model for it. The whole point is that the lower higher levels should be editable into the lower levels.

Perhaps more generally I am thinking that it would interesting to know if there are ways to jointly learn a noising and denoising process such that the noising process is constrained to end up with random noise at the far end, but such that it is easier at each step for the denoiser to reconstruct the original.

Perhaps we can learn somewhat like a GAN, except that the two networks are working together. We have the normal denoiser $p_\theta (x_{t-1} | x_t)$ that learns like normal. But then we also have a learned noiser $p_\phi (x_t | t_{t-1})$ that is trained to minimize the reconstruction loss at each timestep. Naturally the optimal solution is for $p_\phi$ to be the identity so there needs to be a system to ensure that information is removed. That could be something like having a term in the loss for $p_\phi$ like a KL divergence between the proposed noised sequence and random noise on the input and gets higher importance as you go deeper into the process. Maybe there's a smarter way to do that.

After training, we throw away the noiser and just use the denoiser which has learned to generate text along a natural course such that every step is as easy as it can be. So we might expect to see the noiser learning to remove filler words at first and then condense gramatical structures and then finally remove key information.

Not sure how novel this is. AI suggests connections to Schrödinger Bridges or Flow Matching networks. I see the point. I think the idea might really be exactly Schrödinger Bridges applied to multi-step processes. If it is already standard to do this then the novelty would lie in application to a domain. Less intersting.

https://arxiv.org/abs/2404.12940 they might have done this.

For actual summarization add an objective through a text encoder to maximize similarity for noised text to denoised text? If we do think that a summarization objective would be useful.

https://arxiv.org/pdf/2407.10998
I think this paper might have done exactly this. Wait no I think they do it based on attention scores.

⬆️ [Ideas](<./README.md>) | [Singular value regularization](<./Singular value regularization.md>) ➡️