ChatGPT App major evolution! It can see, listen and speak, and the details of the multi-modal model are announced at the same time.

This article comes from the WeChat public account "Qubit" (ID: QbitAI), author: Mengchen, and Jinglianwen Technology is authorized to publish it. do

OpenAI released two big news in a row. First, ChatGPT can be viewed, listened to, and spoken.

The new version of ChatGPT opens up a more intuitive way of interaction that can show AI what is being talked about.

Take a photo, for example, and ask how to adjust the bike seat height.

The official also gave another practical scenario idea: open the refrigerator, take a photo, ask the AI ​​what to eat for dinner, and generate a complete recipe.

The update will be rolled out to ChatGPT Plus subscribers and Enterprise users over the next two weeks, and is supported on both iOS and Android.

At the same time, more details of the multi-modal GPT-4V model have also been released.

The most surprising thing is that the multi-modal version was trained as early as March 2022...

Seeing this, some netizens asked: How many startup companies died in just 5 minutes?

Everything you need to watch and listen to, a new way of interaction

In the updated ChatGPT mobile APP, you can directly upload photos and ask questions about the content in the photos.

For example, "How to adjust the height of a bicycle seat", ChatGPT will give detailed steps.

It doesn’t matter if you are completely unfamiliar with bicycle structure. You can also circle a part of the photo and ask ChatGPT “Is this what you are talking about?”.

Just like pointing something to someone with your hands in the real world.

If you don’t know what tool to use, you can even open the toolbox and take a picture of it to ChatGPT. Not only will it point out that the required tool is on the left, you can even understand the text on the label.

Users who were qualified in advance also shared some test results.

Automated workflow diagrams can be analyzed.

But I didn’t recognize which movie one of the stills was from.

△Friends who recognize you are welcome to reply in the comment area

The voice part demonstration is still a linkage easter egg from the DALL·E 3 demonstration last week.

Let ChatGPT tell a 5-year-old child's fantasy of "Super Sunflower Hedgehog" into a complete bedtime story.

△DALL·E3 Demonstration

The excerpt of the story ChatGPT tells this time is as follows:

For more specific details of multiple rounds of voice interaction during the process, as well as voice audition, please refer to the video.

Multi-modal GPT-4V capabilities revealed

Combining all the published video demonstrations and the content in the GPT-4V System Card, fast-moving netizens have summarized the secrets of GPT-4V’s visual capabilities.

Object detection: GPT-4V can detect and identify common objects in images, such as cars, animals, household items, etc. Its object recognition capabilities were evaluated on standard image datasets.

Text Recognition: This model features optical character recognition (OCR) capabilities that can detect printed or handwritten text in images and transcribe it into machine-readable text. This was tested on images such as documents, logos, headers, etc.

Face recognition: GPT-4V can locate and identify faces in images. It has the ability to identify gender, age and racial attributes based on facial features. Its facial analysis capabilities are measured on datasets such as FairFace and LFW.

CAPTCHA solving: GPT-4V demonstrated visual reasoning capabilities when solving text- and image-based CAPTCHAs. This demonstrates the model's advanced puzzle-solving capabilities.

Geolocation: GPT-4V’s ability to identify cities or geographic locations depicted in landscape images demonstrates that the model incorporates knowledge about the real world, but also represents a risk of privacy breaches.

Complex images: The model struggles to accurately interpret complex scientific diagrams, medical scans, or images with multiple overlapping text components. It misses contextual details.

It also summarizes the current limitations of GPT-4V.

Spatial relationships: It can be difficult for a model to understand the precise spatial layout and location of objects in an image. It may not correctly convey the relative position between objects.

Object overlap: When objects in an image heavily overlap, GPT-4V sometimes cannot distinguish where one object ends and the next object begins. It can mix different objects together.

Background/foreground: The model does not always accurately perceive objects in the foreground and background of an image. It may incorrectly describe object relationships.

Occlusion: When some objects in an image are partially occluded or occluded by other objects, GPT-4V may not recognize the occluded objects or miss their relationship to surrounding objects.

Details: Models often miss or misinterpret complex details in very small objects, text, or images, leading to incorrect relationship descriptions.

Contextual reasoning: GPT-4V lacks strong visual reasoning capabilities to deeply analyze the context of images and describe implicit relationships between objects.

Confidence: The model may incorrectly describe object relationships and be inconsistent with the image content.

At the same time, the System Card also emphasized that "the performance is currently unreliable in scientific research and medical use."

In addition, further research will continue on whether the model should be allowed to recognize public figures, and whether the model should be allowed to infer issues such as gender, race, or emotion from images of people.

Some netizens have already decided that the first thing they will ask after the update is what is in the backpack in Sam Altman’s photo.

So, have you thought about what to ask first?

Guess you like

Origin blog.csdn.net/weixin_55551028/article/details/133384921