22/01/2026 2:09 PM - Lab Meeting

⬅️ [22/01/2026 8:00 AM - SK Team](<./22_01_2026 8_00 AM - SK Team.md>) | ⬆️ [Lab Meetings](<./README.md>) | [19/02/2026 8:03 AM - SNU](<./19_02_2026 8_03 AM - SNU.md>) ➡️

I got Spiedo again. Jason loves it so good move.

Yayuan Presentation

Same as this morining.

For the in context visual instruction, would it not be useful to also have video as input? Like basically style transfer the current situation onto the existing instructional video?

Can we tell where there is conflict between the prompt frame and auxilary prompt? Like if we say "He cuts the carrot with the knife in his left hand", but the knife is in his right hand, what should we do? What about generate the video and then use the mistake detection thing to tell if there is incongruence?

How do you get ground truth for bounding boxes and part of the prompt that is a mistake?

This got answered. It comes from a contrastive model pointing out where is the incongruence.

Does the existing dataset annotations lend itself well to having the "action-subject" (verb-noun) model? What kind of mistakes are conducive to creating the dataset? Does this bias the dataset to specific types of mistakes? Does it need to be very specific?
Could you extend this dataset by using an LLM to rephrase the instruction to something higher level or more complex so that you can learn to catch mistakes in more difficult samples that are not just verb-noun?

Maybe I missed it. How did they get the ground truth for point of no return frame?

For the temporal attribution ground truth did they train it to predict a single frame? Or have a falloff around where the PNR is? Or have an edge that are not included in the loss?

Jason

Can we have a mathematical way to consider a region in a space a token instead of just a point?

⬅️ [22/01/2026 8:00 AM - SK Team](<./22_01_2026 8_00 AM - SK Team.md>) | ⬆️ [Lab Meetings](<./README.md>) | [19/02/2026 8:03 AM - SNU](<./19_02_2026 8_03 AM - SNU.md>) ➡️