A Machine Learning researcher shared the discharge of their latest project, GPT-4V-Act, with the Reddit community recently. This concept was sparked by a recent discussion of the visual grounding strategy referred to as Set-of-Mark in GPT-4V. Intriguingly, tests demonstrated that GPT-4V with this capability could analyze a user interface screenshot and offer the precise pixel coordinates needed for guiding a mouse and keyboard to finish a certain task.
To this point, the agent has shown capable of constructing posts on Reddit, conducting product searches, and starting the checkout process despite only being subjected to limited testing. Interestingly, it also recognized auto-labeler flaws when attempting to play a game and sought to correct the activity.
Using GPT-4V(ision) and an internet browser in perfect harmony, GPT-4V-Act is an articulate multimodal AI helper. It simulates human control all the way down to low-level mouse and keyboard input and output. The goal is to supply a simple flow of labor between humans and computers, resulting in the event of technologies that greatly improve the usability of any UI, facilitate the automation of workflows, and make using automated UI testing possible.
The way it Functions
By combining GPT-4V(ision) and Set-of-Mark Prompting with a person auto-labeler, we achieve GPT-4V-Act. Each user interface element that will be interacted with is given its numeric ID by this auto-labeler.
GPT-4V-Act can infer the mandatory steps for completing a task from a task and a screenshot. The number labels will be used as tips that could precise pixel coordinates when input by a mouse or keyboard.
Crucial note
Since GPT-4V(ision) has not been released to most of the people, a current ChatGPT Plus subscription is required for multimodal prompting on this project. It must be noted that this project’s use of an unapproved GPT-4V API may violate the corresponding ChatGPT Term of Service condition.
The usage of language models (LMs) that include capabilities like function calls is on the rise. These run totally on APIs and textual representations of states. Agents with a user interface (UI) could also be more useful basically situations where these are impractical. Because the agent’s interaction with the pc is analogous to a human’s, training will be done through expert demonstrations without requiring extensive specialized knowledge.
Try the Project Page. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to hitch our 32k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more.
When you like our work, you’ll love our newsletter..
We’re also on WhatsApp. Join our AI Channel on Whatsapp..
Dhanshree
” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2022/11/20221028_101632-Dhanshree-Shenwai-169×300.jpg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2022/11/20221028_101632-Dhanshree-Shenwai-576×1024.jpg”>
Dhanshree Shenwai is a Computer Science Engineer and has an excellent experience in FinTech firms covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is smitten by exploring recent technologies and advancements in today’s evolving world making everyone’s life easy.