Home News Mobile-Agents: Autonomous Multi-modal Mobile Device Agent With Visual Perception Mobile-Agents: Autonomous Multi-modal Mobile Device Agent

Mobile-Agents: Autonomous Multi-modal Mobile Device Agent With Visual Perception Mobile-Agents: Autonomous Multi-modal Mobile Device Agent

0
Mobile-Agents: Autonomous Multi-modal Mobile Device Agent With Visual Perception
Mobile-Agents: Autonomous Multi-modal Mobile Device Agent

The appearance of Multimodal Large Language Models (MLLM) has ushered in a brand new era of mobile device agents, able to understanding and interacting with the world through text, images, and voice. These agents mark a major advancement over traditional AI, providing a richer and more intuitive way for users to interact with their devices. By leveraging MLLM, these agents can process and synthesize vast amounts of data from various modalities, enabling them to supply personalized assistance and enhance user experiences in ways previously unimaginable.

These agents are powered by state-of-the-art machine learning techniques and advanced natural language processing capabilities, allowing them to grasp and generate human-like text, in addition to interpret visual and auditory data with remarkable accuracy. From recognizing objects and scenes in images to understanding spoken commands and analyzing text sentiment, these multimodal agents are equipped to handle a big selection of inputs seamlessly. The potential of this technology is vast, offering more sophisticated and contextually aware services, equivalent to virtual assistants attuned to human emotions and academic tools that adapt to individual learning styles. In addition they have the potential to revolutionize accessibility, making technology more approachable across language and sensory barriers.

In this text, we will probably be talking about Mobile-Agents, an autonomous multi-modal device agent that first leverages the flexibility of visual perception tools to discover and locate the visual and textual elements with a mobile application’s front-end interface accurately. Using this perceived vision context, the Mobile-Agent framework plans and decomposes the complex operation task autonomously, and navigates through the mobile apps through step-by-step operations. The Mobile-Agent framework differs from existing solutions because it doesn’t depend on mobile system metadata or XML files of the mobile applications, allowing room for enhanced adaptability across diverse mobile operating environments in a vision centric way. The approach followed by the Mobile-Agent framework eliminates the requirement for system-specific customizations leading to enhanced performance, and lower computing requirements. 

Within the fast-paced world of mobile technology, a pioneering concept emerges as a standout: Large Language Models, especially Multimodal Large Language Models or MLLMs able to generating a big selection of text, images, videos, and speech across different languages. The rapid development of MLLM frameworks prior to now few years has given rise to a brand new and powerful application of MLLMs: autonomous mobile agents. Autonomous mobile agents are software entities that act, move, and performance independently, while not having direct human commands, designed to traverse networks or devices to perform tasks, collect information, or solve problems. 

Mobile Agents are designed to operate the user’s mobile device on the bases of the user instructions and the screen visuals, a task that requires the agents to own each semantic understanding and visual perception capabilities. Nonetheless, existing mobile agents are removed from perfect since they’re based on multimodal large language models, and even the present cutting-edge MLLM frameworks including GPT-4V lack visual perception abilities required to function an efficient mobile agent. Moreover, although existing frameworks can generate effective operations, they struggle to locate the position of those operations accurately on the screen, limiting the applications and skill of mobile agents to operate on mobile devices. 

To tackle this issue, some frameworks opted to leverage the user interface layout files to help the GPT-4V or other MLLMs with localization capabilities, with some frameworks managing to extract actionable positions on the screen by accessing the XML files of the appliance whereas other frameworks opted to make use of the HTML code from the online applications. As it may well be seen, a majority of those frameworks depend on accessing underlying and native application files, rendering the tactic almost ineffective if the framework cannot access these files. To deal with this issue and eliminate the dependency of local agents on underlying files on the localization methods, developers have worked on Mobile-Agent, an autonomous mobile agent with impressive visual perception capabilities. Using its visual perception module, the Mobile-Agent framework uses screenshots from the mobile device to locate operations accurately. The visual perception module houses OCR and detection models which are liable for identifying text inside the screen and describing the content inside a selected region of the mobile screen. The Mobile-Agent framework employs rigorously crafted prompts and facilitates efficient interaction between the tools and the agents, thus automating the mobile device operations. 

Moreover, the Mobile-Agents framework goals to leverage the contextual capabilities of cutting-edge MLLM frameworks like GPT-4V to attain self-planning capabilities that enables the model to plan tasks based on the operation history, user instructions and screenshots holistically. To further enhance the agent’s ability to discover incomplete instructions and flawed operations, the Mobile-Agent framework introduces a self-reflection method. Under the guidance of rigorously crafted prompts, the agent reflects on incorrect and invalid operations consistently, and halts the operations once the duty or instruction has been accomplished. 

Overall, the contributions of the Mobile-Agent framework could be summarized as follows:

  1. Mobile-Agent acts as an autonomous mobile device agent, utilizing visual perception tools to perform operation localization. It methodically plans each step and engages in introspection. Notably, Mobile-Agent relies exclusively on device screenshots, without the usage of any system code, showcasing an answer that is purely based on vision techniques.
  2. Mobile-Agent introduces Mobile-Eval, a benchmark designed to judge mobile-device agents. This benchmark includes quite a lot of the ten mostly used mobile apps, together with intelligent instructions for these apps, categorized into three levels of difficulty.

Mobile-Agent : Architecture and Methodology

At its core, the Mobile-Agent framework consists of a cutting-edge Multimodal Large Language Model, the GPT-4V, a text detection module used for text localization tasks. Together with GPT-4V, Mobile-Agent also employs an icon detection module for icon localization. 

Visual Perception

As mentioned earlier, the GPT-4V MLLM delivers satisfactory results for instructions and screenshots, but it surely fails to output the situation effectively where the operations happen. Owing to this limitation, the Mobile-Agent framework implementing the GPT-4V model must depend on external tools to help with operation localization, thus facilitating the operations output on the mobile screen. 

Text Localization

The Mobile-Agent framework implements a OCR tool to detect the position of the corresponding text on the screen every time the agent must tap on a selected text displayed on the mobile screen. There are three unique text localization scenarios. 

Scenario 1: No Specified Text Detected

Issue: The OCR fails to detect the desired text, which can occur in complex images or attributable to OCR limitations.

Response: Instruct the agent to either:

  • Reselect the text for tapping, allowing for a manual correction of the OCR’s oversight, or
  • Select an alternate operation, equivalent to using a unique input method or performing one other motion relevant to the duty at hand.

Reasoning: This flexibility is mandatory to administer the occasional inaccuracies or hallucinations of GPT-4V, ensuring the agent can still proceed effectively.

Scenario 2: Single Instance of Specified Text Detected

Operation: Mechanically generate an motion to click on the middle coordinates of the detected text box.

Justification: With just one instance detected, the likelihood of correct identification is high, making it efficient to proceed with a direct motion.

Scenario 3: Multiple Instances of Specified Text Detected

Assessment: First, evaluate the variety of detected instances:

Many Instances: Indicates a screen cluttered with similar content, complicating the choice process.

Motion: Request the agent to reselect the text, aiming to refine the choice or adjust the search parameters.

Few Instances: A manageable variety of detections allows for a more nuanced approach.

Motion: Crop the regions around these instances, expanding the text detection boxes outward to capture additional context. This expansion ensures that more information is preserved, aiding in decision-making.

Next Step: Draw detection boxes on the cropped images and present them to the agent. This visual assistance helps the agent in deciding which instance to interact with, based on contextual clues or task requirements.

This structured approach optimizes the interaction between OCR results and agent operations, enhancing the system’s reliability and adaptableness in handling text-based tasks across various scenarios. Your complete process is demonstrated in the next image.

Icon Localization

The Mobile-Agent framework implements an icon detection tool to locate the position of an icon when the agent must click on it on the mobile screen. To be more specific, the framework first requests the agent to supply specific attributes of the image including shape and color, after which the framework implements the Grounding DINO method with the prompt icon to discover all of the icons contained inside the screenshot. Finally, Mobile-Agent employs the CLIP framework to calculate the similarity between the outline of the press region, and calculates the similarity between the deleted icons, and selects the region with the best similarity for a click. 

Instruction Execution

To translate the actions into operations on the screen by the agents, the Mobile-Agent framework defines 8 different operations. 

  • Launch Application (App Name): Initiate the designated application from the desktop interface.
  • Tap on Text (Text Label): Interact with the screen portion displaying the label “Text Label”.
  • Interact with Icon (Icon Description, Location): Goal and tap the desired icon area, where “Icon Description” details attributes like color and shape of the icon. Select “Location” from options equivalent to top, bottom, left, right, or center, possibly combining two for precise navigation and to scale back mistakes.
  • Enter Text (Input Text): Input the given “Input Text” into the energetic text field.
  • Scroll Up & Down: Navigate upwards or downwards through the content of the current page.
  • Go Back: Revert to the previously viewed page.
  • Close: Navigate back to the desktop directly from the present screen.
  • Halt: Conclude the operation once the duty is achieved.

Self-Planning

Every step of the operation is executed iteratively by the framework, and before the start of every iteration, the user is required to supply an input instruction, and the Mobile-Agent model uses the instruction to generate a system prompt for all the process. Moreover, before the beginning of each iteration, the framework captures a screenshot and feeds it to the agent. The agent then observes the screenshot, operation history, and system prompts to output the subsequent step of the operations. 

Self-Reflection

During its operations, the agent might face errors that prevent it from successfully executing a command. To reinforce the instruction success rate, a self-evaluation approach has been implemented, activating under two specific circumstances. Initially, if the agent executes a flawed or invalid motion that halts progress, equivalent to when it recognizes the screenshot stays unchanged post-operation or displays an incorrect page, it can be directed to think about alternative actions or adjust the present operation’s parameters. Secondly, the agent might miss some elements of a fancy directive. Once the agent has executed a series of actions based on its initial plan, it can be prompted to review its motion sequence, the newest screenshot, and the user’s directive to evaluate whether the duty has been accomplished. If discrepancies are found, the agent is tasked to autonomously generate recent actions to meet the directive.

Mobile-Agent : Experiments and Results

To judge its abilities comprehensively, the Mobile-Agent framework introduces the Mobile-Eval benchmark consisting of 10 commonly used applications, and designs three instructions for every application. The primary operation is simple, and only covers basic application operations whereas the second operation is a little more complex than the primary because it has some additional requirements. Finally, the third operation is probably the most complex of all of them because it comprises abstract user instruction with the user not explicitly specifying which app to make use of or what operation to perform. 

Moving along, to evaluate the performance from different perspectives, the Mobile-Agent framework designs and implements 4 different metrics. 

  • Su or Success: If the mobile-agent completes the instructions, it is taken into account to be a hit. 
  • Process Rating or PS: The Process Rating metric measures the accuracy of every step in the course of the execution of the user instructions, and it’s calculated by dividing the variety of correct steps by the entire variety of steps. 
  • Relative Efficiency or RE: The relative efficiency rating is a ratio or comparison between the variety of steps it takes a human to perform the instruction manually, and the variety of steps it takes the agent to execute the identical instruction. 
  • Completion Rate or CR: The completion rate metric divides the variety of human-operated steps that the framework completes successfully with the entire variety of steps taken by a human to finish the instruction. The worth of CR is 1 when the agent completes the instruction successfully. 

The outcomes are demonstrated in the next figure. 

Initially, for the three given tasks, the Mobile-Agent attained completion rates of 91%, 82%, and 82%, respectively. While not all tasks were executed flawlessly, the achievement rates for every category of task surpassed 90%. Moreover, the PS metric reveals that the Mobile-Agent consistently demonstrates a high likelihood of executing accurate actions for the three tasks, with success rates around 80%. Moreover, in keeping with the RE metric, the Mobile-Agent exhibits an 80% efficiency in performing operations at a level comparable to human optimality. These outcomes collectively underscore the Mobile-Agent’s proficiency as a mobile device assistant.

The next figure illustrates the Mobile-Agent’s capability to understand user commands and independently orchestrate its actions. Even within the absence of explicit operation details within the instructions, the Mobile-Agent adeptly interpreted the user’s needs, converting them into actionable tasks. Following this understanding, the agent executed the instructions via a scientific planning process.

Final Thoughts

In this text we now have talked about Mobile-Agents, a multi-modal autonomous device agent that originally utilizes visual perception technologies to exactly detect and pinpoint each visual and textual components inside the interface of a mobile application. With this visual context in mind, the Mobile-Agent framework autonomously outlines and breaks down the intricate tasks into manageable actions, easily navigating through mobile applications step-by-step. This framework stands out from existing methodologies because it doesn’t rely on the mobile system’s metadata or the mobile apps’ XML files, thereby facilitating greater flexibility across various mobile operating systems with a concentrate on visual-centric processing. The strategy employed by the Mobile-Agent framework obviates the necessity for system-specific adaptations, resulting in improved efficiency and reduced computational demands.

LEAVE A REPLY

Please enter your comment!
Please enter your name here