30/04/2026 9:35 AM - SNU Presentation

⬅️ [30/04/2026 8:47 AM - One-on-one](<./30_04_2026 8_47 AM - One-on-one.md>) | ⬆️ [2026 - April](<./README.md>) | [30/04/2026 3:36 PM - Amber CQE Prep](<./30_04_2026 3_36 PM - Amber CQE Prep.md>) ➡️

https://openaccess.thecvf.com/content/CVPR2025/papers/Seo_Efficient_Personalization_of_Quantized_Diffusion_Model_without_Backpropagation_CVPR_2025_paper.pdf

They do zero'th orther!

Hayeon

VLAs do not rely on visual understanding. They replay fixed movement patterns.

Does not appear to be using prompt. Also does not seem to be.

New Question: Why does the model lose its ability to generalize after fine-tuning?

Overfitting to specific downstream tasks during fine tuning.
Improve generalization to new environments.

So why would ovefitting to environment cause insensitivity to position?

Early stopping is a clear way to improve generalization then. But then we get an under-adapted model which fails sometimes and succeeds just based on initial noise. So generate multiple action trajectories and then select the reliable one.

How do you select the noise then?

Have they explored consistency? Like generating multiple and then checking if there is a cluster.

Do attention scores correlate with the trajectory selected noise?
Can you compute token importance just using attention scores? Why compute mulitple actual paths?

What is the runtime of the algorithm? How many trajectories do you need to roll out to select noise?

They use a single batch to select the trajectory

What is the flow policy? How

⬅️ [30/04/2026 8:47 AM - One-on-one](<./30_04_2026 8_47 AM - One-on-one.md>) | ⬆️ [2026 - April](<./README.md>) | [30/04/2026 3:36 PM - Amber CQE Prep](<./30_04_2026 3_36 PM - Amber CQE Prep.md>) ➡️