12/03/2026 9:41 AM - SNU Meeting

⬅️ [11/03/2026 6:53 PM - SNU Presentation Prep](<./11_03_2026 6_53 PM - SNU Presentation Prep.md>) | ⬆️ [2026 - March](<./README.md>) | [12/03/2026 12:30 PM](<./12_03_2026 12_30 PM.md>) ➡️

VLAs are not robust to environment changes.

Have they looked at the attention? Yea I think so they did that last time.

SOmetimes succeed without prompts lol.

So what is passed in with no prompt? What is the task? A specific object? How would the same model perform across multiple tasks without a prompt?

Models always attend to graspable objects, but not the specific graspable objects that it should be.

Pass in additional information by changing the attention map to focus on the correct object.

Action coherence guidance.

How do you generate the vector field for bounding box biased attention?

How are you actually biasing the attention mechanism?

Are you running with and without bias and then combining them?

Is the camera actuatable? Yes. OOO.

Have they thought about learning the metric for failure detection? Is it still standard to use pre-defined metrics?

⬅️ [11/03/2026 6:53 PM - SNU Presentation Prep](<./11_03_2026 6_53 PM - SNU Presentation Prep.md>) | ⬆️ [2026 - March](<./README.md>) | [12/03/2026 12:30 PM](<./12_03_2026 12_30 PM.md>) ➡️