Object segmentation is a foundational and critically vital field in modern computer vision. It plays an important role in applications requiring extensive visual components, equivalent to object localization and identification, and demands real-time, fast, and accurate segmentation. This importance has made object segmentation a consistently hot research topic, with significant work done in areas like instance segmentation, semantic segmentation, and panoptic segmentation.
With the evolution of object segmentation, the Segment Anything Model (SAM) has emerged as a remarkable tool, showcasing outstanding segmentation abilities and quickly being adopted in various computer vision applications. Frameworks using a pre-trained SAM architecture have achieved impressive performance in downstream vision tasks. Nevertheless, despite its capabilities and high accuracy in segmentation tasks, SAM’s complex and heavy architecture necessitates substantial computational power, hindering its implementation on computationally constrained devices.
Addressing SAM’s computational challenges, researchers have developed the Tiny Segment Anything Model (TinySAM), which retains the zero-shot performance of the unique framework while being more lightweight. TinySAM uses a full-stage knowledge distillation method with online hard prompts to create a more efficient student model. Post-training quantization adapted to promptable segmentation tasks further reduces computational needs. Moreover, TinySAM’s design goals for hierarchical segmentation, almost doubling the inference speed without compromising performance.
This text delves into the TinySAM framework, exploring its foundational principles, architecture, and performance in comparison with other state-of-the-art segmentation frameworks. Let’s explore these facets in additional detail.
The Segment Anything Model has helped within the rapid progress of several computer vision applications owing to its commendable segmentation capabilities coupled with a large segmentation dataset that houses over 11 million images and over a billion image masks. Owing to its exceptional performance on tasks segmenting objects with arbitrary categories and shapes, it serves as the inspiration for frameworks performing downstream tasks like image inpainting, object tracking, 3D vision, and more. Moreover, the Segment Anything Model also offers remarkable zero-shot segmentation performance that has benefitted sensitive industries that work with a limited amount of information including the medical research and medical imaging industries.
Although one cannot query the remarkable segmentation capabilities offered by the Segment Anything Model on a wide selection of downstream vision tasks, it does have its downside when it comes to a posh architectural overload, high computational requirements, and significant operational costs. For a system running on a contemporary GPU, the inference time of a SAM model will be as high as as much as 2 seconds for a 1024×1024 image. Because of this, it’s a highly difficult task to implement SAM applications on devices with limited computational abilities. To beat this hurdle, recent works like MobileSAM and FastSAM have tried to develop a SAM model with more computational efficiency. The MobileSAM framework attempts to exchange the heavy component within the image encoder with the architecture of the TinyViT framework whereas the FastSAM model transfers the segment task to an instance segmentation task with just one category with the YoloV8 model. Although these methods were in a position to achieve some level of success when it comes to reducing the computational requirements, they may not maintain the performance especially on downstream zero-shot tasks.
TinySAM or the Tiny Segment Anything Model is an attempt to scale back the computational requirement of the present SAM model without hindering the performance on zero-shot downstream tasks. Moreover, the TinySAM framework proposes to implement a full-stage knowledge distillation method in its architecture with the aim of improving the flexibility of the compact student network. The TinySAM framework distills the coed network in an end to finish manner under the supervision of the teacher network from different stages. To spice up performance further, the framework allows the distillation process to attend more to hard examples by implementing a further online hard prompt sampling strategy. Moreover, to moreover reduce computational costs, the TinySAM framework exposes the promptable segmentation tasks to post-training quantization components.
The foremost chunk of the computation requirement of a Segment Anything Model is since the model generates massive masks from the grid prompt points to segment all the pieces within the image. To beat the computational requirement of this segmentation strategy, the TinySAM framework employs a hierarchical segment all the pieces strategy that nearly doubles the inference speed without degrading the performance. With these methods employed in its architecture, the TinySAM framework offers significant reduction in computational requirements, and sets recent limits for efficient segment anything tasks.
TinySAM : Architecture and Methodology
Before we talk concerning the architecture and methodology of the TinySAM framework, it can be crucial to first have a take a look at its predecessor, the SAM framework. Ever since its introduction, the Segment Anything Model has demonstrated remarkable performance, versatility, and generalization capabilities across a spread of downstream vision and object segmentation tasks.
At its core, the SAM model consists of three subnetworks: the prompt encoder, the image encoder, and the mask decoder. The first aim of the prompt encoder is to encode the arbitrary shaped masks, input points and boxes, and free form text with positional information. The image encoder is a heavy ViT or vision transformer based network that extracts the input image into embeddings. The model uses different networks to process the geometric and the text prompts. Finally, the mask decoder comprises a two-way transformer that receives the output of the prompt and the image encoder to generate the ultimate mask prediction. With the dataset, the SAM framework demonstrates remarkable top quality segmentation capabilities for objects regardless of their shape and category. Moreover, the Segment Anything Model demonstrates remarkable performance and efficiency across zero-shot downstream vision tasks including object proposal, edge detection, text to mask prediction, and instance segmentation. Owing to its top quality segmentation abilities, and versatile prompt offerings, the SAM frameworks form the inspiration for vision applications. With that being said, one cannot ignore the high computational requirement of the normal SAM architecture with numerous parameters making it almost inconceivable for developers to deploy SAM based applications on devices with constrained resources.
Knowledge Distillation
Knowledge distillation is a very important approach to spice up the performance of compact networks through the training phase. The knowledge distillation method that uses the output of the teacher network to supervise the training of the lightweight student network. The knowledge distillation method will be split into two subcategories: distillation for intermediate features, and distillation for network outputs, with a majority of research work around knowledge distillation specializing in image classification tasks.
With that being said, the next figure demonstrates the generic architecture of the TinySAM framework together with the performance overview on zero-shot instance segmentation tasks.
In the primary stage, the TinySAM framework implements knowledge distillation designed specifically for the SAM framework, and to activate the distillation process further, the model uses a web-based hard prompt sampling to mine the hard knowledge to the coed network from the teacher network. Within the second stage, the TinySAM framework adapts the post-training quantization method to promptable segmentation tasks and implements it on the lightweight student network. Finally, the model implements the hierarchical segment all the pieces inference mode designed for segmentation tasks leading to doubling the inference speed with negligible accuracy loss.
Full-Stage Knowledge Distillation
As mentioned earlier, the Segment Anything Model consists of three sub-networks at its core: the prompt encoder, the image encoder, and the mask decoder, with the image encoder component built on a vision transformer, and having high computational requirements. To tackle this issue, the MobileSAM framework replaced the vision transformer with a TinyViT or Tiny Vision Transformer, although the substitution wasn’t effective given the numerous performance decay. To make sure no performance decay, the TinySAM framework implements a full stage knowledge distillation method that guides the lightweight image encoder from the educational level to the multiple knowledge level. Along with the standard loss between the ground-truth labels and the anticipated results, the TinySAM framework introduces quite a few distillation losses during different stages as shown in the next figure.
Quantization
Model Quantization is a preferred approach in computer vision frameworks, and is used to compress the model by quantizing weights or activations from higher to lower bandwidth in an attempt to scale back computational complexity and storage requirements without degrading the output quality significantly.
The first aim of quantization in TinySAM is to project the floating point tensor to the bit integer tensor using a scaling factor with the metric for measuring the gap between the matrix multiplication and the quantized matrix playing an important role for optimizing the scaling factor.
Hierarchical Segment Anything
The Segment Anything Model proposes to make use of an automatic mask generator that samples points as a grid to segment all the pieces within the image. Nevertheless, it has been indicated that using dense point grid ends in over-fine grained segmentation outputs and the method requires massive computational requirements and incurs high operational costs. Moreover, on one end, too many sampling points for a whole object might result in several sections of the thing to be segmented incorrectly as separate masks whereas on the opposite end, the time cost of the all the pieces mode inference is primarily as a result of the rationale that the image encoder has been shrinkled significantly. To cut back the operational cost of the all the pieces mode, the TinySAM framework uses a hierarchical mask generation approach, with the difference within the strategy with the unique SAM framework demonstrated in the next image.
Different from the approach implemented in the unique SAM framework, the TinySAM model uses only 25% points on either side, thus utilizing only one/16 of the available points in the unique setting. The model then infers the mask decoder and the prompt encoder with these prompts and gets the output. The model then filters some masks with confidence exceeding a certain threshold, and masks the corresponding locations as areas for potential final predictions. For the reason that model treats these regions because the segmentation results of instances with high confidence, it has no have to generate point prompts. The strategy not only helps in stopping over-fine grained segmentation of the thing nevertheless it also helps in bringing down the operational costs and computational requirements significantly. The framework then merges and post-processes the outcomes of those two rounds to acquire the ultimate masks.
TinySAM : Experiments and Results
To speed up the distillation process, the TinySAM framework computes and stores the image embeddings from the teacher network upfront, owing to which it shouldn’t be mandatory for the model to compute the heavy image encoder of the teacher network repeatedly through the training phase anymore. For post training quantization, the TinySAM framework quantizes all of the matrix multiply layers, the convolution layers, the deconvolution layers, and the linear layers, with the model using chanel-wise scaling aspects for each the convolution and the deconvolution layers. For the matrix multiply layers, the model implements head-wise scaling aspects whereas for the linear layers, the model implements linear-wise scaling aspects. The model also conducts evaluation on zero-shot downstream tasks.
For example segmentation tasks in a zero-shot setting, the TinySAM framework follows the experimental settings of its predecessor, the Segment Anything Model, and uses object detection results of the Vision Transformer Det-H or VitDet-H framework as an example segmentation. As demonstrated in the next image, the TinySAM framework outperforms existing methods when it comes to instance segmentation accuracy and the FLOPs rating.
Moreover, the qualitative performance of the TinySAM model is demonstrated in the next image for zero-shot instance segmentation with the green box representing the box prompts.
By way of zero-shot points valid mask evaluation, the TinySAM model outperforms the MobileSAM framework significantly on different datasets, and delivers substantially higher results when a fewer variety of points are utilized as prompts by the framework.
Moreover, the next table summarizes the outcomes of the acceleration and reduce in computational requirements achieved because of this of the hierarchical all the pieces mode strategy. The model applies the identical stability rating and threshold value with different strategies for a good comparison, and the outcomes are summarized below.
Final Thoughts
In this text, now we have talked about TinySAM, a proposed framework that pushes the boundaries for segmenting any task, and obtains an efficient model architecture with less computational requirements and accuracy at par with the unique SAM framework. TinySAM or the Tiny Segment Anything Model that maintains and delivers the zero-shot performance of the unique framework. The TinySAM framework first implements a full-stage knowledge distillation method that uses online hard prompts to distill a light-weight student model. The TinySAM framework then adapts the post-training quantization to promptable segmentation tasks that further helps in reducing the computational requirements. Moreover, the framework also goals to segment all the pieces hierarchically that nearly doubles the inference speed without affecting the performance.