30/04/2026 9:35 AM - SNU Presentation
30/04/2026 9:35 AM - SNU Presentation
https://openaccess.thecvf.com/content/CVPR2025/papers/Seo_Efficient_Personalization_of_Quantized_Diffusion_Model_without_Backpropagation_CVPR_2025_paper.pdf
They do zero'th orther!
Hayeon
VLAs do not rely on visual understanding. They replay fixed movement patterns.
Does not appear to be using prompt. Also does not seem to be.
New Question: Why does the model lose its ability to generalize after fine-tuning?
Overfitting to specific downstream tasks during fine tuning.
Improve generalization to new environments.
So why would ovefitting to environment cause insensitivity to position?
Early stopping is a clear way to improve generalization then. But then we get an under-adapted model which fails sometimes and succeeds just based on initial noise. So generate multiple action trajectories and then select the reliable one.
How do you select the noise then?
Have they explored consistency? Like generating multiple and then checking if there is a cluster.
Do attention scores correlate with the trajectory selected noise?
Can you compute token importance just using attention scores? Why compute mulitple actual paths?
What is the runtime of the algorithm? How many trajectories do you need to roll out to select noise?
They use a single batch to select the trajectory
What is the flow policy? How