12/03/2026 9:26 PM - Paper Review

⬅️ [12/03/2026 1:30 PM - Lab Meeting](<./12_03_2026 1_30 PM - Lab Meeting.md>) | ⬆️ [2026 - March](<./README.md>) | [16/03/2026 9:13 AM](<./16_03_2026 9_13 AM.md>) ➡️

Typo: "Most methods follow a rigid procedure that the model blindly crops the video once"

"""
Others (Li et al., 2024;
Hyun et al., 2025; Wang et al., 2025c) resort to visual
token compression
"""
Oh... interesing. I should look at these.

Review format:
Summary
The paper presents a unified framework that uses a single model to perform both temporal grounding and question answering. The model first uses a coarse sampling of the entire video to roughly localize events of interests, crops the video to the segment, repeats these steps until an answer is evident, and then answers the question. They implement a data generation pipeline to facilitate SFT and then propose a reward structure for a GRPO RL pipeline to self-improve model capabilities. They find SOTA performance on final answer metrics, and competitive performance on the intermediate task of temporal grounding.

Strengths & Weaknesses
Soundness - 4
The paper compares the learned model to the SOTA using empirical results on known long video comprehension datasets as well as on the benchmark generated for this paper. These cover a large range of video lengths and provides support for the claim that this model is now SOTA in long video question answering for videos up to 20 minutes.

Presentation - 3
Figures explain the method well.

Significance - 3
Long form video QA is an important task with broad applications. Methods developed for this task also translate well to other long sequence understanding tasks. This paper will likely influence future works in the methodology of producing SFT training samples and rewards for RL fine tuning.

Originality - 2, but I could be convinced otherwise. I don't know how novel papers in this field generally are.
The paper follows the strategy of prior works like LongVT by using a chain of thought strategy with repeated global-local reductions to search videos for context and extract relevant information from fine-grained crops.

This paper introduces a novel masking strategy for generating multi-turn SFT data that is meaningfully different than prior works like LongVT and may contribute to the increased performance.

Specific Issues
The unified framework used in this paper creates a sequential processing bottleneck where multiple cropped video clips cannot be analyzed in parallel. This could increase computation time significantly, but performance metrics are not provided so it is difficult to know.

My questions
Is the reason that long videos (>20 minutes) still present a challenge due to the initial sparse sampling of frames missing key events?

If there are events in a video that last for only a few frames, is it probable that these events will be missed even by this unified model? Are other mechanisms needed to search videos for such events?

Given that there is a reward term that depends on having ground truth intervals, can this method self-improve given data that consists only of (video, question, answer) tuples? Or is it restricted to datasets that consist of (video, question, temporal interval, answer) tuples?

Things like typos
"Most methods follow a rigid procedure that the model blindly crops the video once" - slight grammar issue

⬅️ [12/03/2026 1:30 PM - Lab Meeting](<./12_03_2026 1_30 PM - Lab Meeting.md>) | ⬆️ [2026 - March](<./README.md>) | [16/03/2026 9:13 AM](<./16_03_2026 9_13 AM.md>) ➡️