19/02/2026 8:03 AM - SNU

⬅️ [22/01/2026 2:09 PM - Lab Meeting](<./22_01_2026 2_09 PM - Lab Meeting.md>) | ⬆️ [Lab Meetings](<./README.md>) | [19/02/2026 1:50 PM](<./19_02_2026 1_50 PM.md>) ➡️

VLAs basically overfit to object layouts. Are they memorizing trajectories?

They are running sumulations right now.

What model were they using for this?
A: OpenVLA

So were these exact tasks seen at train time?

I would be interested to see the attention maps. Is it attending to the image at all?
They did. In one view it attends exclusively to the wrist position. In the other view it attends to basically every object equally.
Are they checking the attention map for the specific alphabet soup token? Or just generally?

Interesting. The past paper libero-pro thinks low diversity of language inputs can cause the overfitting.

But if there is high diversity in visual inputs how is it overfitting to position? Isn't position a visual input?

Seems to be a conserved across models

⬅️ [22/01/2026 2:09 PM - Lab Meeting](<./22_01_2026 2_09 PM - Lab Meeting.md>) | ⬆️ [Lab Meetings](<./README.md>) | [19/02/2026 1:50 PM](<./19_02_2026 1_50 PM.md>) ➡️