Home News Multimodal AI Evolves as ChatGPT Gains Sight with GPT-4V(ision)

Multimodal AI Evolves as ChatGPT Gains Sight with GPT-4V(ision)

Multimodal AI Evolves as ChatGPT Gains Sight with GPT-4V(ision)

In the continuing effort to make AI more like humans, OpenAI’s GPT models have continually pushed the boundaries. GPT-4 is now able to just accept prompts of each text and pictures.

Multimodality in generative AI denotes a model’s capability to supply varied outputs like text, images, or audio based on the input. These models, trained on specific data, learn underlying patterns to generate similar recent data, enriching AI applications.

Recent Strides in Multimodal AI

A recent notable leap on this field is seen with the mixing of DALL-E 3 into ChatGPT, a big upgrade in OpenAI’s text-to-image technology. This mix allows for a smoother interaction where ChatGPT aids in crafting precise prompts for DALL-E 3, turning user ideas into vivid AI-generated art. So, while users can directly interact with DALL-E 3, having ChatGPT in the combo makes the means of creating AI art far more user-friendly.

Take a look at more on DALL-E 3 and its integration with ChatGPT here. This collaboration not only showcases the advancement in multimodal AI but in addition makes AI art creation a breeze for users.


Google’s health then again introduced Med-PaLM M in June this 12 months. It’s a multimodal generative model adept at encoding and interpreting diverse biomedical data. This was achieved by fine-tuning PaLM-E, a language model, to cater to medical domains utilizing an open-source benchmark, MultiMedBench. This benchmark, consists of over 1 million samples across 7 biomedical data types and 14 tasks like medical question-answering and radiology report generation.

Various industries are adopting modern multimodal AI tools to fuel business expansion, streamline operations, and elevate customer engagement. Progress in voice, video, and text AI capabilities is propelling multimodal AI’s growth.

Enterprises seek multimodal AI applications able to overhauling business models and processes, opening growth avenues across the generative AI ecosystem, from data tools to emerging AI applications.

Post GPT-4’s launch in March, some users observed a decline in its response quality over time, a priority echoed by notable developers and on OpenAI’s forums. Initially dismissed by an OpenAI, a later study confirmed the difficulty. It revealed a drop in GPT-4’s accuracy from 97.6% to 2.4% between March and June, indicating a decline in answer quality with subsequent model updates.


ChatGPT (Blue) & Artificial intelligence (Red) Google Search Trend

The hype around Open AI’s ChatGPT is back now. It now comes with a vision feature GPT-4V, allowing users to have GPT-4 analyze images given by them. That is the latest feature that is been opened as much as users.

Adding image evaluation to large language models (LLMs) like GPT-4 is seen by some as an enormous step forward in AI research and development. This type of multimodal LLM opens up recent possibilities, taking language models beyond text to supply recent interfaces and solve recent sorts of tasks, creating fresh experiences for users.

The training of GPT-4V was finished in 2022, with early access rolled out in March 2023. The visual feature in GPT-4V is powered by GPT-4 tech. The training process remained the identical. Initially, the model was trained to predict the subsequent word in a text using a large dataset of each text and pictures from various sources including the web.

Later, it was fine-tuned with more data, employing a way named reinforcement learning from human feedback (RLHF), to generate outputs that humans preferred.

GPT-4 Vision Mechanics

GPT-4’s remarkable vision language capabilities, although impressive, have underlying methods that continues to be on the surface.

To explore this hypothesis, a brand new vision-language model, MiniGPT-4 was introduced, utilizing a complicated LLM named Vicuna. This model uses a vision encoder with pre-trained components for visual perception, aligning encoded visual features with the Vicuna language model through a single projection layer. The architecture of MiniGPT-4 is straightforward yet effective, with a give attention to aligning visual and language features to enhance visual conversation capabilities.


MiniGPT-4’s architecture features a vision encoder with pre-trained ViT and Q-Former, a single linear projection layer, and a complicated Vicuna large language model.

The trend of  autoregressive language models in vision-language tasks has also grown, capitalizing on cross-modal transfer to share knowledge between language and multimodal domains.

MiniGPT-4 bridge the visual and language domains by aligning visual information from a pre-trained vision encoder with a complicated LLM. The model utilizes Vicuna because the language decoder and follows a two-stage training approach. Initially, it’s trained on a big dataset of image-text pairs to know vision-language knowledge, followed by fine-tuning on a smaller, high-quality dataset to boost generation reliability and value.

To enhance the naturalness and value of generated language in MiniGPT-4, researchers developed a two-stage alignment process, addressing the shortage of adequate vision-language alignment datasets. They curated a specialized dataset for this purpose.

Initially, the model generated detailed descriptions of input images, enhancing the detail by utilizing a conversational prompt aligned with Vicuna language model’s format. This stage geared toward generating more comprehensive image descriptions.

Initial Image Description Prompt:

For data post-processing, any inconsistencies or errors within the generated descriptions were corrected using ChatGPT, followed by manual verification to make sure prime quality.

Second-Stage Wonderful-tuning Prompt:

This exploration opens a window into understanding the mechanics of multimodal generative AI like GPT-4, shedding light on how vision and language modalities might be effectively integrated to generate coherent and contextually wealthy outputs.

Exploring GPT-4 Vision

Determining Image Origins with ChatGPT

GPT-4 Vision enhances ChatGPT’s ability to research images and pinpoint their geographical origins. This feature transitions user interactions from just text to a mixture of text and visuals, becoming a handy tool for those inquisitive about different places through image data.


Asking ChatGPT where a Landmark Image is taken

Complex Math Concepts

GPT-4 Vision excels in delving into complex mathematical ideas by analyzing graphical or handwritten expressions. This feature acts as a great tool for people trying to solve intricate mathematical problems, marking GPT-4 Vision a notable aid in educational and academic fields.


Asking ChatGPT to know a posh math concept

Converting Handwritten Input to LaTeX Codes

One in all GPT-4V’s remarkable abilities is its capability to translate handwritten inputs into LaTeX codes. This feature is a boon for researchers, academics, and students who often must convert handwritten mathematical expressions or other technical information right into a digital format. The transformation from handwritten to LaTeX expands the horizon of document digitization and simplifies the technical writing process.

GPT-4V's ability to convert handwritten input into LaTeX codes

GPT-4V’s ability to convert handwritten input into LaTeX codes

Extracting Table Details

GPT-4V showcases skill in extracting details from tables and addressing related inquiries, a significant asset in data evaluation. Users can utilize GPT-4V to sift through tables, gather key insights, and resolve data-driven questions, making it a strong tool for data analysts and other professionals.

GPT-4V deciphering table details and responding to related queries

GPT-4V deciphering table details and responding to related queries

Comprehending Visual Pointing

The unique ability of GPT-4V to understand visual pointing adds a brand new dimension to user interaction. By understanding visual cues, GPT-4V can reply to queries with a better contextual understanding.


GPT-4V showcases the distinct ability to understand visual pointing

Constructing Easy Mock-Up Web sites using a drawing

Motivated by this tweet, I attempted to create a mock-up for the unite.ai website.

While the final result didn’t quite match my initial vision, here’s the result I achieved.

ChatGPT Vision based output HTML Frontend

ChatGPT Vision based output HTML Frontend

Limitations & Flaws of GPT-4V(ision)

To investigate GPT-4V, Open AI team carried qualitative and quantitative assessments. Qualitative ones included internal tests and external expert reviews, while quantitative ones measured model refusals and accuracy in various scenarios reminiscent of identifying harmful content, demographic recognition, privacy concerns, geolocation, cybersecurity, and multimodal jailbreaks.

Still the model shouldn’t be perfect.

The paper highlights limitations of GPT-4V, like incorrect inferences and missing text or characters in images. It might hallucinate or invent facts. Particularly, it isn’t suited to identifying dangerous substances in images, often misidentifying them.

In medical imaging, GPT-4V can provide inconsistent responses and lacks awareness of normal practices, resulting in potential misdiagnoses.

Unreliable performance for medical purposes.

Unreliable performance for medical purposes (Source)

It also fails to know the nuances of certain hate symbols and should generate inappropriate content based on the visual inputs. OpenAI advises against using GPT-4V for critical interpretations, especially in medical or sensitive contexts.

The arrival of GPT-4 Vision (GPT-4V) brings along a bunch of cool possibilities and recent hurdles to hop over. Before rolling it out, a whole lot of effort has gone into ensuring risks, especially relating to pictures of individuals, are well looked into and reduced. It’s impressive to see how GPT-4V has stepped up, showing a whole lot of promise in tricky areas like medicine and science.

Now, there are some big questions on the table. As an example, should these models have the option to discover famous folks from photos? Should they guess an individual’s gender, race, or feelings from an image? And, should there be special tweaks to assist visually impaired individuals? These questions open up a can of worms about privacy, fairness, and the way AI should fit into our lives, which is something everyone must have a say in.


Please enter your comment!
Please enter your name here