17/04/2026 12:45 PM - ReID More explantation

⬅️ [16/04/2026 8:05 AM - SNU presentation prep](<./16_04_2026 8_05 AM - SNU presentation prep.md>) | ⬆️ [2026 - April](<./README.md>) | [20/04/2026 10:09 AM](<./20_04_2026 10_09 AM.md>) ➡️

Cody is supposed to be getting some numbers from the videos. On top of that we can have training metrics and do not include between-tracklet tests, but are useful for comparing between different methods.

https://wandb.ai/veldrovive/sam3-reid/runs/9yk5duon
This run is the one we actually use. It uses a transformer as our our contrastive model and DINOv3 Large to produce the input embeddings. It also includes a large set of augmentations that allow it to generalize to inter-scene comparisons.

transform = v2.Compose([  
    # --- 1. Safe Spatial Transforms ---  
    v2.RandomHorizontalFlip(p=0.5),  

    v2.RandomApply([RandomSubjectZoom(scale_range=(1.0, 2.0))], p=0.25),  

    # Slight rotation (±5 degrees), translation (±5%), and scaling (95% to 105%)  
    # The mask will perfectly track with these changes.  
    v2.RandomAffine(degrees=5, translate=(0.05, 0.05), scale=(0.95, 1.05)),  

    # --- 2. Color and Lighting (Photometric) ---  
    v2.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),  
    # v2.RandomGrayscale(p=0.1),  

    # Randomly apply Gaussian Blur to 10% of images to simulate poor focus  
    v2.RandomApply([v2.GaussianBlur(kernel_size=(5, 9), sigma=(0.1, 5.0))], p=0.1),  

    v2.RandomApply([ApplyBackgroundMask(bg_val=0.0)], p=0.5),  
])

The important ones are RandomSubjectZoom which randomly crops to close to the student to allow the model to learn scale invariance which was a problem with early tests where it would assign all people that were large in the frame to the same individual, and ApplyBackgroundMask which sometimes blacks out the background and only keeps the segmented student which is useful to teach the model to ignore the background.

https://wandb.ai/veldrovive/sam3-reid/runs/aj63hfls
This is the same as the last one, except that we use the DINOv3 Small model. It is demonstrating that using a more expressive DINO model can be helpful.

https://wandb.ai/veldrovive/sam3-reid/runs/ar6s2sk4?nw=nwuserveldrovive
This run uses the DINOv3 Huge model and it demonstrates diminishing returns with using larger DINO models. It takes much longer to run, but does not perform any better on our metrics than DINOv3 Large.

https://wandb.ai/veldrovive/sam3-reid/runs/qiqgistp
This run uses the same augmentations except that it always applies the background mask. It demonstrates that masking out the individual does actually reduce performance. I don't know exactly why, but it seems that the model is more robust when it has the full context. Perhaps it is just an effect of DINO being trained on full images and not images that have most of the scene masked out.

https://wandb.ai/veldrovive/sam3-reid/runs/c10pie5n
This is an old version that did not use the augmentations. It gets slightly worse results on our metrics, but the actual problem is hidden. It performs badly on inter-tracklet cases because it does not have scale invariance.

We also note in general that the model separates negative groups well, but struggles to always gather together positive groups. We hypothesize this is simply because there are many frames that are uninformative and therefore cannot be used to uniquely specify an individual. The best thing to do in this case is put that frame at a medium distance to everything so that it falls between the margins and is not counted as a hard negative or a hard positive by circle loss. This is why it is important to use SAM3 tracklets to group together multiple frames and use all those frames together for the ReID task.

⬅️ [16/04/2026 8:05 AM - SNU presentation prep](<./16_04_2026 8_05 AM - SNU presentation prep.md>) | ⬆️ [2026 - April](<./README.md>) | [20/04/2026 10:09 AM](<./20_04_2026 10_09 AM.md>) ➡️