Will LMM completely replace large language models? GPT-V, a new milestone in artificial intelligence, is in pre-public beta testing in the United States, and the first interpretation of the medical field/OCR practice + 166-page GPT-V trial report
ChatGPT Vision , also widely known as GPT-V or GPT-4V, represents a new milestone in artificial intelligence technology. As a representative of LMM (Large Multimodal Model) , it not only inherits the text processing capabilities of LLM (Large Language Model), but also adds image processing functions to realize multi-modal interaction of text and images. Compared with traditional LLM, GPT-V is more powerful and flexible, capable of deeper understanding and generation of image-related content. This evolution has opened up countless new application possibilities. From image description and creative design to complex image and text combination tasks, GPT-4V has demonstrated excellent performance and broad potential.
How to use : GPT-V is currently open to ChatGPT Plus accounts in the United States.
Related links : ChatGPT can now see, hear, and speak
Related introduction : GPTV_System_Card.pdf
166-page GPT-V trial report : Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
Performance : For ChatGPT4, the speed is about 40% slower than plain text Chinese Prompt. (Note that after switching from GPT4 to GPT-V, the speed of plain text Chinese Prompt increased by about 200%)
Cost : $20 USD per month, speed limit 3 hours and 25 messages, API not yet open
Research version : 10-12 ChatGPT Vision (picture chat/GPT multi-modal function) ChatGPT September 25 Version
Some of the pictures in this article are from practice (OCR/medicine, etc.), and some are from the Internet (Twitter/Arxiv trial report). The title picture is generated by ChatGPT DALL-E 3.
This article attempts to explore the new AI milestone LMM from the following cases :
TLDR summary :
(Figure 1: The few-shot method cannot improve the accuracy of GPT-4V in obtaining dashboard information, and red represents incorrect answers)
(Figure 2: The CoT method cannot stably and qualitatively improve the accuracy of GPT-4V in counting apples. Red represents incorrect answers and green represents correct)
Applicable scenarios :
Not applicable scenarios :
Potential scenarios :
1.0 Basic functions
Multi-picture context recognition capability
(Figure 3: The researcher entered multiple shopping bills/menu pictures, and GPT-4V gave the correct answer)
Understand the symbols (such as arrows) and emphasis (such as red circles) in the picture
(Figure 4: GPT-4V can correctly understand and answer pictures marked with arrows/circles)
1.1 OCR function (serious hallucinations, poor Chinese effect)
Chinese : Computer screenshot (accurate)
Chinese : Clear Chinese table pictures (serious hallucinations)
Chinese: Real patient diagnosis and treatment sheet with complex information (severe hallucinations)
Patient diagnosis and treatment slip (the original picture is given to GPT, the following pictures have been processed twice to blur the patient information)
English: Accurate: strong contrast, slight lens distortion, moderate pixel noise images
Arabic: Partially accurate: handwritten translation into English
muin on X: "GPT-4V will be able to transcribe and translate manuscripts and texts. I am excited to try out Arabic manuscripts to see how well it does. It does a phenomenal job with transcription even better than most humans. https://t.co/K6y6WffLvz" / X
1.2 Medical diagnosis (high accuracy)
Lung X-ray: Determining pneumonia (accurate)
Image source: GitHub - linhandev/dataset: List of medical imaging data sets "An Index for Medical Imaging Datasets" https://github.com/linhandev/dataset https://www.kaggle.com/datasets/nih-chest-xrays /data#:~:text=Class%20descriptions,Hernia
Image input
GPT-V discrimination result: ( Pneumonia, 0.7 )
1.Pneumonia :
◦Confidence: 0.7
◦Cause: Increased density in the upper and lower lung areas on X-rays may indicate the presence of pneumonia, but this cannot be clearly determined based on the pictures.
2. Pulmonary edema:
◦Confidence: 0.6
◦Cause: Increased density in the lower lung area may be related to fluid accumulation, such as pulmonary edema due to heart failure.
3. Interstitial lung disease:
◦Confidence: 0.5
◦Reason: There is some evidence of stromal changes, but comparison with previous imaging is required.
4. Bronchitis:
◦Confidence: 0.4
◦Cause: Chronic bronchitis may cause some of the opacity observed.
5. Tuberculosis:
◦Confidence: 0.3
◦Cause: Although TB can appear opaque and possibly solidified, this image does not strongly demonstrate the typical features of TB.
6. Lung cancer:
◦Confidence: 0.2
◦Cause: Although there are no clear nodules or masses, lung cancer cannot be completely ruled out due to the limitations of X-ray in showing small nodules behind other structures.
Skin diseases: melanoma diagnosis (partially accurate)
Data source: ISIC Archive
https://gallery.isic-archive.com/#!/topWithHeader/onlyHeaderTop/gallery?filter=%5B%5D
Image input
GPT-V discrimination results ( common nevus 0.6, melanoma 0.3 )
Age warts (0.7)
Data source: ISIC Archive
Image input
GPT-V discrimination results ( senile warts 0.7 )
The image shows a magnified view of the skin. Age warts are more common in the head and neck area and often appear in middle-aged and older people, so I gave them a higher level of confidence.
Other brain/chest X-ray diagnostics (high accuracy)
(Green represents correct, red represents incorrect, and yellow represents hallucination)
1.3 High-latitude recognition ability (high quality, although there are hallucination problems)
Dianping/Xiaohongshu Copywriting
Product usage suggestions (left), product sales copy (right)
Car accident vehicle identification
Red represents errors
Understand flowcharts
Explain a joke
Novel product naming
Circuit diagram description
Landmark description
Recipe generation
Analyze ball game situations/athlete postures (with hallucinations)
Abran Maldonado on X: "One of my favorite demos I tested out, in honor of football season, ChatGPT Vision will forever change coaching and sports analytics. Whether i build it or not, ChatGPT for coaching will be on every sideline in the league. Mark my words. https://t.co/uUYhsKpEGh" / X
Illusion (the player is using a backhand instead of a forehand):
Identify movies/cameras/attractions, etc.
Provide users with advice and assistance on product installation/photography, etc.
X: "ChatGPT can now see, hear, and speak. Rolling out over next two weeks, Plus users will be able to have voice conversations with ChatGPT (iOS & Android) and to include images in conversations (all platforms). https://t.co/uNZjgbR5Bm https://t.co/paG0hMshXb" / X
Tutoring Poker (Existence of Illusions)
Screenshot of webpage to HTML code (lack of understanding)
result:
Whiteboard skeleton to front-end project
Mckay Wrigley on X: "You can give ChatGPT a picture of your team’s whiteboarding session and have it write the code for you. This is absolutely insane. https://t.co/bGWT5bU8MK" / X
https://twitter.com/mckaywrigley/status/1707101465922453701
Complex rational/emotional analysis of clear pictures
Pietro Schirano on X: "This is absolutely wild. I am completely speechless. https://t.co/wGTAx1hFgS" / X
https://twitter.com/skirano/status/1706874309124194707?
Mckay Wrigley on X: "ChatGPT breaks down this diagram of a human cell for a 9th grader. This is the future of education. https://t.co/L0Za0ZB5rs" / X
Complex rational analysis of complex pictures with many subjects
Alex Northstar on X: "Thanks ChatGPT, that can read & understand better than humans! https://t.co/TgVSuHgf8j" / X
https://twitter.com/NorthstarBrain/status/1707668600281063514
1.4 Shortcomings and risks
Injection risks of images (including text invisible to the human eye)
(In the picture, the user added the Sephora cosmetics promotion watermark in invisible light-colored words, which is imperceptible to the human eye but perceptible to GPT-V)
(In the picture, the user uses invisible light-colored words to tell GPT-4V to definitely give employment suggestions for this resume)
Cannot be used for face recognition
When GPT-4 was first released in March 2023, the GPT-4V facial recognition function may have security and privacy issues, so the release of GPT-4V (GPT-4 with vision capabilities) was postponed.
In early models, users could have theoretically uploaded photos of people and asked to identify them, an obvious invasion of privacy. According to the technical paper, GPT-4V (which powers ChatGPT Vision) now rejects such requests 98% of the time.
GPT-4V(ision) technical work and authors
Lei Jun: The official version of Xiaomi’s new operating system ThePaper OS has been packaged. A pop-up window on the Gome App lottery page insults its founder. The U.S. government restricts the export of NVIDIA H800 GPU to China. The Xiaomi ThePaper OS interface is exposed. A master used Scratch to rub the RISC-V simulator and it ran successfully. Linux kernel RustDesk remote desktop 1.2.3 released, enhanced Wayland support After unplugging the Logitech USB receiver, the Linux kernel crashed DHH sharp review of "packaging tools": the front end does not need to be built at all (No Build) JetBrains launches Writerside to create technical documentation Tools for Node.js 21 officially releasedAuthor: JD Health Li Zhuolun
Source: JD Cloud Developer Community Please indicate the source when reprinting