[Yuan Chuang Conference Preview] 1024 Programmers’ Day (two days before), meet at the Open Source China office, let’s talk about AI! >>>

Will LMM completely replace large language models? GPT-V, a new milestone in artificial intelligence, is in pre-public beta testing in the United States, and the first interpretation of the medical field/OCR practice + 166-page GPT-V trial report

ChatGPT Vision , also widely known as GPT-V or GPT-4V, represents a new milestone in artificial intelligence technology. As a representative of LMM (Large Multimodal Model) , it not only inherits the text processing capabilities of LLM (Large Language Model), but also adds image processing functions to realize multi-modal interaction of text and images. Compared with traditional LLM, GPT-V is more powerful and flexible, capable of deeper understanding and generation of image-related content. This evolution has opened up countless new application possibilities. From image description and creative design to complex image and text combination tasks, GPT-4V has demonstrated excellent performance and broad potential.

How to use : GPT-V is currently open to ChatGPT Plus accounts in the United States.

Related links : ChatGPT can now see, hear, and speak

Related introduction : GPTV_System_Card.pdf

166-page GPT-V trial report : Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

Performance : For ChatGPT4, the speed is about 40% slower than plain text Chinese Prompt. (Note that after switching from GPT4 to GPT-V, the speed of plain text Chinese Prompt increased by about 200%)

Cost : $20 USD per month, speed limit 3 hours and 25 messages, API not yet open

Research version : 10-12 ChatGPT Vision (picture chat/GPT multi-modal function) ChatGPT September 25 Version

Some of the pictures in this article are from practice (OCR/medicine, etc.), and some are from the Internet (Twitter/Arxiv trial report). The title picture is generated by ChatGPT DALL-E 3.

This article attempts to explore the new AI milestone LMM from the following cases :

TLDR summary :

1. ChatGPT Vision prefers creative image understanding rather than machine detail recognition.

2. ChatGPT Vision attempts to use the emergent capabilities of large models to understand the entire image from a high level and treat the image like text, rather than OCR-style matching recognition . The difference between ChatGPT Vision and OCR is similar to the difference between semantic search and keyword search.

3. ChatGPT Vision has a major hallucination problem , and Prompting methods such as Few-shot/CoT cannot improve this (insufficient emergence ability), indicating that ChatGPT Vision is still in the early stage like ChatGPT v3 . It is expected that in the next few years, as the parameter scale will further increase, multi-modal large models may repeat the development path of text large models.

(Figure 1: The few-shot method cannot improve the accuracy of GPT-4V in obtaining dashboard information, and red represents incorrect answers)

(Figure 2: The CoT method cannot stably and qualitatively improve the accuracy of GPT-4V in counting apples. Red represents incorrect answers and green represents correct)

Applicable scenarios :

1. Contextual conceptual work, such as automatic review and preliminary screening of article header images, photo-assisted screening of skin diseases, DR/CT photo imaging diagnosis, providing HTML alt text, etc.

2. Creative work with pictures, such as generating product sales copy, product usage suggestions, and creative product names in Section 1.3.

Not applicable scenarios :

1. Text work with accuracy requirements: such as OCR scenarios such as patient diagnosis and treatment orders in Section 1.1.

2. Complex pictures with too many subjects, such as reproducing the front-end web page code in Section 1.3.

Potential scenarios :

1. Summary work of clear pictures/screenshots, including complex rational/perceptual analysis of pictures in Section 1.3, such as network heat map explanations and educational scenes

2. High-latitude large model reasoning, such as identifying movies/attractions, customer service, climate analysis, design drawings/website design suggestions in Section 1.3, etc.

1.0 Basic functions

Multi-picture context recognition capability

(Figure 3: The researcher entered multiple shopping bills/menu pictures, and GPT-4V gave the correct answer)

Understand the symbols (such as arrows) and emphasis (such as red circles) in the picture

(Figure 4: GPT-4V can correctly understand and answer pictures marked with arrows/circles)

1.1 OCR function (serious hallucinations, poor Chinese effect)

Chinese : Computer screenshot (accurate)

Chinese : Clear Chinese table pictures (serious hallucinations)

Chinese: Real patient diagnosis and treatment sheet with complex information (severe hallucinations)

Patient diagnosis and treatment slip (the original picture is given to GPT, the following pictures have been processed twice to blur the patient information)

English: Accurate: strong contrast, slight lens distortion, moderate pixel noise images

Arabic: Partially accurate: handwritten translation into English

muin on X: "GPT-4V will be able to transcribe and translate manuscripts and texts. I am excited to try out Arabic manuscripts to see how well it does. It does a phenomenal job with transcription even better than most humans. https://t.co/K6y6WffLvz" / X

1.2 Medical diagnosis (high accuracy)

Lung X-ray: Determining pneumonia (accurate)

Image source: GitHub - linhandev/dataset: List of medical imaging data sets "An Index for Medical Imaging Datasets" https://github.com/linhandev/dataset https://www.kaggle.com/datasets/nih-chest-xrays /data#:~:text=Class%20descriptions,Hernia

Image input

GPT-V discrimination result: ( Pneumonia, 0.7 )

1.Pneumonia :

◦Confidence: 0.7

◦Cause: Increased density in the upper and lower lung areas on X-rays may indicate the presence of pneumonia, but this cannot be clearly determined based on the pictures.

2. Pulmonary edema:

◦Confidence: 0.6

◦Cause: Increased density in the lower lung area may be related to fluid accumulation, such as pulmonary edema due to heart failure.

3. Interstitial lung disease:

◦Confidence: 0.5

◦Reason: There is some evidence of stromal changes, but comparison with previous imaging is required.

4. Bronchitis:

◦Confidence: 0.4

◦Cause: Chronic bronchitis may cause some of the opacity observed.

5. Tuberculosis:

◦Confidence: 0.3

◦Cause: Although TB can appear opaque and possibly solidified, this image does not strongly demonstrate the typical features of TB.

6. Lung cancer:

◦Confidence: 0.2

◦Cause: Although there are no clear nodules or masses, lung cancer cannot be completely ruled out due to the limitations of X-ray in showing small nodules behind other structures.

Skin diseases: melanoma diagnosis (partially accurate)

Data source: ISIC Archive

https://gallery.isic-archive.com/#!/topWithHeader/onlyHeaderTop/gallery?filter=%5B%5D

Image input

GPT-V discrimination results ( common nevus 0.6, melanoma 0.3 )

1.Melanocytic nevus (common nevus)

◦Confidence score: 0.6

▪Cause : This lesion exhibits similar characteristics to common moles, with a dark central area. It is common among people around the age of 50.

2. Senile keratosis

◦Confidence score: 0.4

▪Cause : Senile keratosis is a common noncancerous skin growth in older adults that may vary in color.

3. Abnormal moles

◦Confidence score: 0.5

▪Cause : This type of mole is different from common moles and is usually larger. Given the size (1.4mm) and location (lower extremity), this diagnosis needs to be considered.

4. Melanoma

◦Confidence score: 0.3

▪Cause : Melanoma is a type of skin cancer, and while the images provided have features that may be consistent with melanoma, a thorough evaluation by a dermatologist is required to confirm the diagnosis.

5. Basal cell carcinoma

◦Confidence score: 0.2

▪Cause : It is the most common form of skin cancer, but the images provided do not have the typical signs of basal cell carcinoma. However, given the age of the patient, it still needs to be considered.

6. Hemangioma

◦Confidence score: 0.2

▪Cause : Hemangioma is a benign tumor composed of blood vessels. The red in the image might suggest this, but from the overall appearance, it's less likely.

Age warts (0.7)

Data source: ISIC Archive

Image input

GPT-V discrimination results ( senile warts 0.7 )

1. Age Warts - Confidence: 0.7

2.Melanocytic nevus - Confidence: 0.5

3. Basal cell carcinoma - Confidence: 0.4

4. Actinosis - Confidence: 0.4

5. Squamous cell carcinoma - Confidence: 0.3

The image shows a magnified view of the skin. Age warts are more common in the head and neck area and often appear in middle-aged and older people, so I gave them a higher level of confidence.

Other brain/chest X-ray diagnostics (high accuracy)

(Green represents correct, red represents incorrect, and yellow represents hallucination)

1.3 High-latitude recognition ability (high quality, although there are hallucination problems)

Dianping/Xiaohongshu Copywriting

Product usage suggestions (left), product sales copy (right)

Car accident vehicle identification

Red represents errors

Understand flowcharts

Explain a joke

Novel product naming

Circuit diagram description

Landmark description

Recipe generation

Analyze ball game situations/athlete postures (with hallucinations)

Abran Maldonado on X: "One of my favorite demos I tested out, in honor of football season, ChatGPT Vision will forever change coaching and sports analytics. Whether i build it or not, ChatGPT for coaching will be on every sideline in the league. Mark my words. https://t.co/uUYhsKpEGh" / X

Illusion (the player is using a backhand instead of a forehand):

Identify movies/cameras/attractions, etc.

Provide users with advice and assistance on product installation/photography, etc.

X: "ChatGPT can now see, hear, and speak. Rolling out over next two weeks, Plus users will be able to have voice conversations with ChatGPT (iOS & Android) and to include images in conversations (all platforms). https://t.co/uNZjgbR5Bm https://t.co/paG0hMshXb" / X

Tutoring Poker (Existence of Illusions)

Screenshot of webpage to HTML code (lack of understanding)

result:

Whiteboard skeleton to front-end project

Mckay Wrigley on X: "You can give ChatGPT a picture of your team’s whiteboarding session and have it write the code for you. This is absolutely insane. https://t.co/bGWT5bU8MK" / X

https://twitter.com/mckaywrigley/status/1707101465922453701

Complex rational/emotional analysis of clear pictures

Pietro Schirano on X: "This is absolutely wild. I am completely speechless. https://t.co/wGTAx1hFgS" / X

https://twitter.com/skirano/status/1706874309124194707?

Mckay Wrigley on X: "ChatGPT breaks down this diagram of a human cell for a 9th grader. This is the future of education. https://t.co/L0Za0ZB5rs" / X

Complex rational analysis of complex pictures with many subjects

Alex Northstar on X: "Thanks ChatGPT, that can read & understand better than humans! https://t.co/TgVSuHgf8j" / X

https://twitter.com/NorthstarBrain/status/1707668600281063514

1.4 Shortcomings and risks

Injection risks of images (including text invisible to the human eye)

(In the picture, the user added the Sephora cosmetics promotion watermark in invisible light-colored words, which is imperceptible to the human eye but perceptible to GPT-V)

(In the picture, the user uses invisible light-colored words to tell GPT-4V to definitely give employment suggestions for this resume)

Cannot be used for face recognition

When GPT-4 was first released in March 2023, the GPT-4V facial recognition function may have security and privacy issues, so the release of GPT-4V (GPT-4 with vision capabilities) was postponed.

In early models, users could have theoretically uploaded photos of people and asked to identify them, an obvious invasion of privacy. According to the technical paper, GPT-4V (which powers ChatGPT Vision) now rejects such requests 98% of the time.

GPT-4V(ision) technical work and authors

Author: JD Health Li Zhuolun

Source: JD Cloud Developer Community Please indicate the source when reprinting

Multi-modal GPT-V is born! 36 scene analysis capabilities of ChatGPT Vision, will LMM fully replace large language models? | JD Cloud Technical Team

1.0 Basic functions

Multi-picture context recognition capability

Understand the symbols (such as arrows) and emphasis (such as red circles) in the picture

1.1 OCR function (serious hallucinations, poor Chinese effect)

Chinese : Computer screenshot (accurate)

Chinese : Clear Chinese table pictures (serious hallucinations)

Chinese: Real patient diagnosis and treatment sheet with complex information (severe hallucinations)

English: Accurate: strong contrast, slight lens distortion, moderate pixel noise images

Arabic: Partially accurate: handwritten translation into English

1.2 Medical diagnosis (high accuracy)

Lung X-ray: Determining pneumonia (accurate)

Image input

GPT-V discrimination result: ( Pneumonia, 0.7 )

Skin diseases: melanoma diagnosis (partially accurate)

Image input

GPT-V discrimination results ( common nevus 0.6, melanoma 0.3 )

Age warts (0.7)

Image input

GPT-V discrimination results ( senile warts 0.7 )

Other brain/chest X-ray diagnostics (high accuracy)

1.3 High-latitude recognition ability (high quality, although there are hallucination problems)

Dianping/Xiaohongshu Copywriting

Product usage suggestions (left), product sales copy (right)

Car accident vehicle identification

Understand flowcharts

Explain a joke

Novel product naming

Circuit diagram description

Landmark description

Recipe generation

Analyze ball game situations/athlete postures (with hallucinations)

Identify movies/cameras/attractions, etc.

Provide users with advice and assistance on product installation/photography, etc.

Tutoring Poker (Existence of Illusions)

Screenshot of webpage to HTML code (lack of understanding)

Whiteboard skeleton to front-end project

Complex rational/emotional analysis of clear pictures

Complex rational analysis of complex pictures with many subjects

1.4 Shortcomings and risks

Injection risks of images (including text invisible to the human eye)

Cannot be used for face recognition

Guess you like