Home Community Microsoft Research Introduces Florence-2: A Novel Vision Foundation Model with a Unified Prompt-based Representation for a Number of Computer Vision and Vision-Language Tasks

Microsoft Research Introduces Florence-2: A Novel Vision Foundation Model with a Unified Prompt-based Representation for a Number of Computer Vision and Vision-Language Tasks

0
Microsoft Research Introduces Florence-2: A Novel Vision Foundation Model with a Unified Prompt-based Representation for a Number of Computer Vision and Vision-Language Tasks

There was a noticeable trend in Artificial General Intelligence (AGI) systems toward using pre-trained, adaptable representations, which offer task-agnostic benefits in various applications. Natural language processing (NLP) is a very good example of this tendency since sophisticated models exhibit flexibility with thorough knowledge covering several domains and tasks with straightforward instructions. The recognition of NLP encourages a complementary strategy in computer vision. Unique obstacles arise from the need for broad perceptual capacities in universal representation for various vision-related activities. Whereas natural language processing (NLP) focuses totally on text, computer vision has to handle complex visual data comparable to characteristics, masked contours, and object placement. In computer vision, achieving universal representation necessitates skillful handling of varied difficult tasks arranged in two dimensions, as shown in Figure 1. 

Figure 1

Spatial Hierarchy: The model has to acknowledge spatial information at different sizes, comprehending fine-grained pixel details and image-level ideas. To support the complex spatial hierarchy in vision, the model should be able to managing a variety of granularities.

Semantic Granularity: In computer vision, universal representation should cover a variety of semantic granularities. The paradigm moves from abstract titles to more detailed explanations, providing flexible comprehension for various uses. 

This pursuit is characterised by distinctiveness and substantial challenges. A key hurdle is the necessity for more, hindering the event of a foundational model able to capturing the intricate nuances of spatial hierarchy and semantic granularity. Existing datasets, comparable to ImageNet, COCO, and Flickr30k Entities, tailored for specialised applications, are extensively labeled by humans. To beat this constraint, it’s imperative to generate extensive annotations for every image on a bigger scale. One other challenge is the absence of a that seamlessly integrates spatial hierarchy and semantic granularity in computer vision. With task-specific design, traditional models perform well in tasks like semantic segmentation, object identification, and film captioning. Nevertheless, creating an entire, cohesive model that may adjust to different vision tasks in a task-independent way is crucial, even taking over latest duties with little to no task-specific fine-tuning.

Through unified pre-training and network design, the model pioneers the combination of spatial, temporal, and multi-modal features in computer vision. The primary evolutionary iteration excels in transfer learning through task-specific fine-tuning using customized adapters and pre-training with noisy text-image pairings. Nevertheless, its reliance on big task-specific datasets and adapters ends in gaps with regards to tackling the 2 major issues mentioned above. On this work, researchers from Azure provide a universal backbone that’s attained using multitask learning with wealthy visual annotations. This results in a prompt-based, unified representation for various vision tasks, which successfully tackles the problems of incomplete comprehensive data and lack of a uniform architecture.

Large-scale, high-quality annotated data is crucial for multitask learning. Somewhat than depending on time-consuming human annotation, their data engine creates an in depth visual dataset named fld, which has 5.4B annotations for 126M photos. There are two effective processing modules on this engine. The primary module departs from the traditional single and manual annotation strategy through the use of specialized models to annotate photos jointly and autonomously. Much like the wisdom of crowds theory, many models collaborate to create a consensus, leading to a more impartial and trustworthy picture interpretation. Using basic models which were learned, the second module repeatedly refines and filters these automatic annotations.

Their model uses a sequence-to-sequence (seq2seq) architecture, integrating a picture encoder and a multi-modality encoder-decoder by leveraging this huge dataset. This architecture supports a variety of vision tasks without requiring task-specific architectural adjustments, in keeping with the NLP community’s goal of flexible model creation with a uniform foundation. Every annotation within the dataset is consistently standardized into textual outputs. This permits the consistent optimization of a single multitask learning strategy using the identical loss function because the goal. The result is a versatile vision foundation model, or model, that may handle a variety of functions, including object recognition, captioning, and grounding, all under the control of a single model with standardized parameters. Textual prompts are utilized to activate tasks, consistent with the methodology employed by large language models (LLMs).

Their method achieves a universal representation and has wide-ranging use in lots of visual tasks. Key findings consist of:

  • The model is a versatile vision foundation model that gives latest state-of-the-art zero-shot performance in tasks, including referencing expression comprehension on RefCOCO, visual grounding on Flick30k, and captioning on COCO.
  • Notwithstanding its small size, it competes with more specialized models after being fine-tuned using publicly available human-annotated data. Most notably, the improved model sets latest benchmark state-of-the-art scores on RefCOCO.
  • The pre-trained backbone outperforms supervised and self-supervised models on downstream tasks, COCO object detection and instance segmentation, and ADE20K semantic segmentation. Their model, which uses the Mask-RCNN, DINO, and UperNet frameworks, delivers significant increases of 6.9, 5.5, and 5.9 points on COCO and ADE20K datasets, respectively and quadruples the training efficiency of pre-trained models on ImageNet.

Take a look at the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to affix our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more.

In case you like our work, you’ll love our newsletter..


Aneesh Tickoo is a consulting intern at MarktechPost. He’s currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed toward harnessing the ability of machine learning. His research interest is image processing and is enthusiastic about constructing solutions around it. He loves to attach with people and collaborate on interesting projects.


↗ Step by Step Tutorial on ‘Find out how to Construct LLM Apps that may See Hear Speak’

LEAVE A REPLY

Please enter your comment!
Please enter your name here