SAM Model Video Segmentation Project

Exploring combining SAM with optical flow or SAM with RGB images, and also being able to continuously track the identity of the same object.

The goal of this project is motion segmentation-discovering and segmenting moving objects in video. This is a widely researched area with many careful and sometimes complex methods and training schemes, including: self-supervised learning, learning from synthetic data sets, object-centered representations, modeless representations, and more. The interest in this article is to determine whether the Segment Anything Model (SAM) will help accomplish this task.

Two models that combine SAM with optical flow are studied, which leverage SAM’s segmentation capabilities and optical flow’s ability to discover and group moving objects. First, we adjusted SAM to take stream (rather than RGB) as input. In the second model, SAM takes RGB as input and uses flow as segmentation hints. These surprisingly simple methods, without any further modifications, significantly outperform all previous methods in both single object and multi-object benchmarks. We also extend these frame-level segmentation to sequence-level segmentation that maintains object identities. Again, this simple model outperforms previous methods on multiple video object segmentation benchmarks.

More introduction:

This article focuses on the role of the Segment Anything Model (SAM) in this task. We tested two models, combining SAM with optical flow technology, aiming to leverage SAM’s segmentation capabilities and optical flow technology’s moving object recognition and clustering capabilities.

In the first model, we modified SAM to take optical flow data instead of RGB images as input. In the second model, SAM uses RGB images as input, while optical flow data is used as an auxiliary cue for segmentation.

These concise methods, without other modifications, significantly outperform all previous methods and perform well in both single and multiple object benchmarks.

In addition, we have extended these frame-level segmentation to the sequence level, allowing continuous tracking of the identity of the same object. This simple model also outperforms all previous methods in benchmarking multiple video object segmentation.

This research was supported by the UK EPSRC CDT in AIMS (EP/S024050/1), Clarendon Fellowships and the UK EPSRC Project Grant for Visual Artificial Intelligence (EP/T028572/1).

If you want to learn more, you can click on the link below the video.
Thank you for watching this video. If you like it, please subscribe and like it. thank

Project address:https://robots.ox.ac.uk/~vgg/research/flowsam/

Video: