The journey of exploration of high-performance computing and multi-modal processing: NVIDIA GH200 performance optimization and GPT-4V’s computing power accelerates the future

★Multi-modal large model; GPU computing power; LLMS; LLM; LMM; GPT-4V; GH200; image recognition; target positioning; image description; visual question and answer; visual dialogue; Nvidia; Nvidia; H100; L40s; A100; H100; A800; H800, AI computing power, AI algorithm

With the continuous development of artificial intelligence technology, multi-modal large models have become an increasingly important development trend. Multimodal large models extend language models by integrating multiple perceptual abilities such as vision to achieve more powerful general artificial intelligence. GPT-4V (GPT-4's recently opened visual modality) Large Multi-Models (LMMs) extend Large Language Models (LLMs) to enhance multi-sensory skills (such as visual understanding, etc.) to achieve more powerful general intelligence. This article focuses on an in-depth analysis of GPT-4V to further deepen the understanding of LMM. The core of the analysis in this article is the tasks that GPT-4V can perform, and also includes test samples used to detect the quality and versatility of its capabilities.

The research results show that GPT-4V has unprecedented capabilities in processing interleaved multi-modal inputs, and its versatility makes it a powerful multi-modal integrated intelligent system. The unique ability of GPT-4V is mainly reflected in understanding the visual markers drawn on the input image, and it can also generate new human-computer interaction methods such as visual guidance prompts. This article will discuss the preliminary exploration of GPT-4V, the impact of multi-modality on computing power, the strength of NVIDIA's most powerful AI chip GH200, and the Blue Ocean Brain large model training platform.

Preliminary exploration of GPT-4V

This article uses a qualitative case design method to comprehensively explore GPT-4V. The evaluation focuses on case studies rather than traditional quantitative evaluation, aiming to inspire subsequent research to establish evaluation benchmarks for large-scale multimodal models. Considering that different interaction modes may have an impact on model performance, zero-sample prompts are mainly used to reduce dependence on contextual examples and thus better evaluate GPT-4V's ability to independently process complex multi-modal inputs.

1. Input mode of GPT-4V

GPT-4V is a single-model language system for text input and has the ability to accept image-text pair input. As a pure text input model, GPT-4V shows powerful language processing capabilities. For text input, GPT-4V only requires plain text input and output for various language and encoding tasks. Another application mode of GPT-4V is to accept a single image-text pair input, which can complete various visual and visual language tasks (such as image recognition, target positioning, image description, visual question answering, visual dialogue, and generation of dense image description, etc.) . In addition, GPT-4V also supports interleaved image-text input mode. This flexible input method makes it suitable for a wider range of application scenarios, such as calculating the total tax amount of multiple receipt images, extracting query information from multiple images, and correlating Interleaved image text information, etc. Handling this kind of interleaved input is also the basis for few-shot learning and other advanced hinting techniques, further enhancing the applicability of GPT-4V.

GPT-4V supports the use of multi-image and interleaved image-text input

2. Working methods and prompting technology of GPT-4V

GPT-4V can understand and follow textual instructions, produce the desired textual output or learn to complete a new task. Red indicates less informative answers.

The unique advantage of GPT-4V lies in its powerful ability to understand and follow natural language instructions. Instructions can specify the output text format required for various visual language tasks in natural language form. In addition, GPT-4V is able to complete challenging tasks such as abstract reasoning problems involving intermediate steps by understanding complex instructions. GPT-4V has great potential to adapt to unknown applications and missions.

1. Visual pointing and visual reference prompts

Pointing is a basic aspect of interaction between people. In order to provide comparable interaction channels, various forms of "pointing" are explored to represent spatial areas of interest in pictures (such as digital coordinate boxes, arrows, boxes, circles, hand-drawings, etc.) . In view of the flexibility of drawing on images, a new prompting method is proposed, namely "visual referential prompting", which specifies the target (such as drawing a visual indicator or handwritten scene text) by editing the pixels of the input image. Unlike traditional text cues, visual referential cues accomplish tasks through image pixel editing. For example, you can generate a simple description based on a drawn object while maintaining an understanding of the overall scene, or associate a specified object with a scene text index, or answer side-by-side or tricky questions, etc.

2. Visual + text prompts

Visual citation cues can be combined with other image text cues to present a clean and detailed interface. GPT-4V exhibits strong prompt flexibility, especially in integrating different input formats and seamlessly mixing guidance. GPT-4V has strong generalization and flexibility, can understand multi-modal instructions like humans, and has the ability to adapt to unknown tasks.

At the same time, GPT-4V can handle multi-modal instructions (including images, sub-images, text, scene text and visual pointers), which makes it more scalable and versatile. In addition, GPT-4V can associate abstract language instructions with visual examples as multi-modal demonstrations, which is more consistent with human learning styles than text-only instructions or contextual few-shot learning.

 

Constraint hints are returned in JSON format. The image is the sample id of the sample. Red highlights incorrect answers.

In large language models (LLM), a new contextual few-shot learning capability was observed in The_Dawn_of_LMMs: Preliminary_Explorations_with_GPT-4V(ision) report, that is, LLM can generate the expected output by adding contextual examples with the same format, without parameter updates. Similar capabilities are also observed in multimodal models where query inputs are formatted image-text pairs. Demonstrate the contextual few-shot learning capabilities of GPT-4V, emphasizing that in some cases, a sufficient number of examples is crucial, especially when zero-shot or one-shot instructions are insufficient.

For example, in the complex scene of a speedometer, GPT-4V successfully predicts the correct reading after being provided with 2 contextual examples. In another case of line graphs for multi-step reasoning, GPT-4V was able to reach the correct conclusion only with bijective prompts given additional examples. These verification examples demonstrate the important role of contextual few-shot learning in improving LMM performance, becoming a viable fine-tuning alternative.

Zero-shot performance in challenging scenarios for reading speedometers. GPT-4V is able to accurately read the speedometer and avoid failures even with different prompting methods. Red indicates incorrect answers.

3. Visual language ability

1. Image description in different domains

GPT-4V's ability and generalization when processing "image-text pair" input. It is asked to generate natural language descriptions and cover the following topics: celebrity recognition, landmark recognition, food recognition, medical image understanding, logo recognition, scene understanding and reverse examples.

In terms of celebrity recognition,GPT-4V can accurately identify celebrities from different backgrounds and understand scenes and background information, such as identifying presidential speeches at the 2023 G7 Summit.

In terms of landmark recognition,GPT-4V can accurately describe landmarks and generate vivid and detailed narratives to capture the essence of landmarks.

In terms of food recognition,GPT-4V can accurately identify various dishes and capture the intricate details of the dishes.

In terms of medical image understanding,GPT-4V can identify dental structures in X-rays and determine potential problems based on CT scans.

In terms of logo recognition,GPT-4V can accurately describe the design and meaning of the logo.

In terms of scene understanding,GPT-4V can describe the position and color of vehicles in road scenes and read road sign speed limit prompts.

In terms of reverse examples,When encountering misleading problems, GPT-4V can correctly describe the image content and not be misled.

Celebrity recognition and description results: GPT-4V can recognize various celebrity description visual information (including their careers, actions, backgrounds and events) details

2. Object positioning, counting and dense subtitles

GPT-4V performs well in understanding the spatial relationship between people and objects in images, and is able to analyze the spatial information in images and correctly understand the relative positions of people and objects. GPT-4V’s object counting capabilities can successfully count the number of objects appearing in an image, such as apples, oranges, and people. But counting can be wrong when objects are occluded or the scene is cluttered.

 

Spatial relationship understanding results: GPT-4V can identify the spatial relationship between objects in the image

3. Object positioning

Object localization is a difficult problem in computer vision, and the GPT-4V model was able to generate bounding box coordinates to locate people in images through simple text prompts in preliminary experiments, but it may encounter challenges in complex scenes. When the scene or background is relatively simple and less cluttered, the positioning results have potential, but in more complex scenes (such as object occlusion), the model still needs further prompting technology to improve object positioning performance. In terms of target positioning results, GPT-4V is able to approximate the bounding box coordinates of the specified object, but the model still has limitations in more complex scenes.

 

4. Dense subtitle generation

Dense subtitle generation requires a detailed description of each image region and usually requires a complex system including an object detector, a celebrity recognition model, and an image subtitle generation model. In order to examine the ability of this model in dense subtitle generation, text prompts were used. The results showed that the model successfully located and identified individuals in the image and provided concise descriptions.

 

4. Multimodal knowledge and common sense

GPT-4V performs well at interpreting emoticons and understanding humorous elements, gathering information from text and images and understanding humorous effects. In scientific knowledge reasoning tasks, GPT-4V was also able to correctly answer questions covering a wide range of topics. In addition, GPT-4V also shows strong capabilities in multi-modal common sense reasoning, able to use bounding boxes in images to identify actions performed by individuals and infer details in the scene. With more specific input prompts, it can also discern subtle clues in images and provide possible hypotheses.

 

Results for Joke and Meme Understanding: GPT-4V demonstrates impressive ability to understand humor in memes

5. Scenario text, tables, charts and document reasoning

GPT-4V can accurately recognize and interpret scene text in images, including handwritten and printed text, and can extract key mathematical information to solve problems. In addition, you have the ability to understand and reason about details such as charts, flowcharts, x-axis, y-axis, etc., and you can also convert detailed information of flowcharts into Python code. GPT-4V can also understand various types of documents (such as floor plans, posters, and exam papers) and provide reasonable answers. In more challenging cases, GPT-4V demonstrates impressive results, but occasionally some implementation details may be missed.

Scene text recognition results: GPT-4V can recognize many challenging scene text scenes

6. Multi-language and multi-modal understanding

GPT-4V successfully recognized input text prompts in different languages ​​through natural image testing and generated image descriptions in the corresponding correct languages. In scenarios involving text recognition in multi-language scenes, GPT-4V can correctly identify and understand text in different scenes and translate it into different languages. Furthermore, in the multicultural understanding ability test, GPT-4V was able to understand cultural nuances and generate reasonable multilingual descriptions.

Results of multilingual image description: GPT-4V is able to generate descriptions in different languages ​​based on the image

7. Visual reference tips for interacting with humans

The ability to point to specific spatial locations is crucial in human-computer interaction, especially visual dialogue in multimodal systems. GPT-4V understands well visual instructions drawn directly on the image. Therefore, a new model interaction method called "visual reference prompts" is proposed. The core idea is to draw visual instructions or scene text edits in image pixel space as human reference instructions.

Finally, the scientists explored ways to enable GPT-4V to generate visual pointer output for interaction with humans. These visual pointers are intuitive to both humans and machines, making them a good channel for human-machine interaction. GPT-4V can recognize different types of visual markers as pointers and generate subtitles with underlying descriptions. Compared with traditional visual language models, it can handle the more challenging problem of generating visual descriptions that focus on specific regions of interest. Additionally, GPT-4V understands coordinates and implements spatial referencing without additional box token fine-tuning. Despite some spatial inaccuracies, GPT-4V works more reliably with cues with overlaid visual indicators than text coordinates.

 

Inspired by GPT-4V's ability to understand and process visual pointing, a new way of interacting with GPT-4V is proposed, namely visual reference prompts. This approach takes advantage of the technique of editing directly in the pixel space of the input image, thereby adding new possibilities for human-computer interaction. For example, GPT-4V can naturally associate the object pointed by the arrow with a given object index; can understand the question written on the image and point to the corresponding edge or angle; and can point to any area in the image.

Visual reference cues provide a new way to interact and are expected to facilitate a variety of different use cases. GPT-4V is able to generate its own instruction output, further facilitating the closed-loop interaction process in human-computer interaction. For example, visual indication output is generated by letting GPT-4V predict region coordinates in text format. Including example guidance instructions in the prompts helps GPT-4V understand the definition of coordinates and thus generate better instruction output. This ability to iteratively generate, understand, and execute instructions will help GPT-4V achieve better performance in a variety of complex visual reasoning tasks.

 

8. Emotional Intelligence Test

GPT-4V demonstrates empathy and emotional intelligence in human interactions, understanding and sharing human emotions. According to the definition of human emotional intelligence test, its ability to:

1. Recognize and interpret emotions in facial expressions

2. Understand how visual content elicits emotions

3. Generate appropriate text output under the desired emotions and emotional attitudes

GPT-4V understands how different visual content inspires human emotions

Next we explore GPT-4V’s ability to understand how visual content elicits emotions. This ability is critical to predicting how different visual content evokes human emotions and responses (such as anger, wonder, and fear). This ability is extremely important in usage scenarios such as home robots.

 

GPT-4V judges image aesthetics based on social standards and norms

In addition to understanding visual emotions, GPT-4V can also align with human subjective judgments, such as aesthetic perspectives. As shown in the figure, GPT-4V can judge the aesthetics of images based on social standards.

Discussion on the impact of multi-modality on computing power

1. CLIP opens the door to image and text alignment and may become the core foundation for realizing multi-modality

At present, the relatively mainstream method of visual + language multi-modal large models is to use pre-trained large language models and image encoders to connect them with an image and text feature alignment module, so that the language model can understand the image features and perform deeper Question-and-answer reasoning.​ 

According to the news and papers related to GPT-4V officially released by OpenAI and Microsoft, it is not possible to understand in detail its implementation of multi-modality, especially the specific method of visual model. It may be possible to learn from the CLIP released by OpenAI and its iterated BLIP, BLIP2, etc. On the model, we have a preliminary understanding of how to implement multi-modal large models.​ 

1. The CLIP model realizes feature alignment of images and text, and the infrastructure has been released in 2021

Computer vision systems in the past have primarily been trained as image classification models, which limits their ability to generalize when dealing with unknown categories. To obtain a large amount of extensive weakly supervised training data, learning visual representations directly from raw text becomes a more promising approach.

The CLIP model proposed by OpenAI in 2021 uses a pre-training method of image-text contrastive learning. This pre-training model can learn to associate image visual features with matching text on large-scale data. Even without fine-tuning, it can be directly used for downstream visual tasks to achieve good results. CLIP overcomes the previous limitations of requiring a large amount of annotated data.

 

2. The input of CLIP is a paired image-text pair, the output is the corresponding feature, and then comparative learning is performed on the feature, which can achieve zero-shot image classification.

The CLIP model accepts a series of training sample pairs consisting of images and corresponding description text as input. Images are passed through an image encoder to extract visual features, while text is passed through a text encoder to extract semantic features. The model will calculate the similarity between the visual features of each image and the matching text features as a positive sample; it will also calculate the similarity between the visual features of each image and the unmatched text features as a positive sample. Negative sample. The training goal of CLIP is to maximize the similarity of all positive sample pairs and minimize the similarity of all negative sample pairs. This means that the features between matching image and text pairs are as similar as possible, while the features between unmatched image and text pairs are as different as possible. Through this pre-training method, the CLIP model can be widely used in downstream image understanding tasks without additional fine-tuning.

 

Using the CLIP model in zero-shot image classification, first design description text according to each category, such as "a picture of {label}". Text features are extracted by inputting these description texts. Assuming there are n categories, then n text feature vectors will be obtained. Then, input the image that needs to be predicted, extract its image features, and calculate the similarity between this image feature and n category text features. The text label corresponding to the category with the highest similarity is the model's prediction for the image. The similarity is further converted into logits, and after softmax processing, the predicted probability of each category is obtained. The pre-trained CLIP model can be directly used for the above zero-shot classification without additional training or fine-tuning.

3. The biggest innovation of CLIP is the use of extremely large-scale data sets for direct training, which is simple and effective

The innovation of the CLIP model is that it does not propose a new network architecture, but adopts an efficient image-text matching model and trains it on a large data set. Before the release of CLIP, major visual datasets, such as COCO and VisualGenome, were manually annotated, with very good quality, but the data volume was only in the millions. In comparison, YFCC100M has 100 million data, but the quality is uneven. After filtering, only 15 million are left, which is equivalent to the data size of ImageNet. Due to insufficient data volume, OpenAI constructed a WIT data set containing 4 billion data points, generated through 50 million queries. Each query corresponds to a data volume of approximately 200,000 image-text pairs. This data volume is equivalent to training GPT-2 . The existence of WIT's large data volume makes the training of CLIP model more sufficient.

4. In 2021, the optimal model will require approximately 256 NVIDIA V100 images and 12 days of training, and the effect will be significantly better than the traditional vision system.

OpenAI trains a series of CLIP models based on various ResNet and Vision Transformer architectures. The largest ResNet model used 592 NVIDIA V100 GPUs for 18 days of training, while the largest ViT model used 256 V100 GPUs for 12 days. The results show that the ViT model outperforms the ResNet model, and the larger ViT model outperforms the smaller ViT model. The final optimal model is ViT-L/14@336px. Compared with earlier work, CLIP's performance in zero-shot classification has been significantly improved, showing that its zero-shot learning ability has reached a new height.

 

Comparison of the effects of CLIP and previous visual classification models

CLIP bridges text and image understanding by mapping visual and semantic features into a unified embedding space through pre-trained image text matching. The emergence of this technology makes it possible to reason in multi-modal contexts. Based on models such as CLIP, large-scale language models such as ChatGPT have gained visual understanding capabilities. The CLIP series models lay the foundation for unified pre-training of visual language and are the key to realizing multi-modal ChatGPT.

2. The multi-modal application space is vast, and the computing power requirements may increase by orders of magnitude.

The training of multi-modal models requires an order of magnitude increase in computing power, and may require tens of thousands of GPU cards. It is reported that Inflection, a large-scale language model equivalent to GPT-3.5, used approximately 3,500 NVIDIA H100 GPUs during training. For startups, training large language models typically requires thousands of H100 GPUs, while the fine-tuning process requires tens to hundreds. There are also reports that GPT-4 may be trained on 10,000 to 25,000 Nvidia A100 GPUs, while the number of H100 GPUs required for GPT-5 may be 25,000 to 50,000, which is an increase in scale compared to GPT-3.5 About 10 times.

In the inference stage, from the perspective of data volume, images, videos, and voices have increased by several orders of magnitude compared to text interaction, leading to a sharp expansion in computing power requirements.

1. In terms of text, mainstream software from search to email has gradually opened up.

Major email service providers such as Outlook and Gmail already support the ChatGPT function. Outlook allows automatic generation of email replies based on different needs, while Gmail users can generate complete emails through ChatGPT AI. In addition, Chrome browser also provides free support. According to statistics, more than 330 billion emails are sent around the world every day, nearly half of which are spam. Among email clients, Gmail and Outlook have market shares of 27.2% and 7.8% respectively. Estimating the volume of non-spam emails, the average number of Outlook emails per day is approximately 13.7 billion. According to statistics on the average length of emails and considering the impact of text storage format, it is estimated that the average daily email data volume of Outlook is approximately 25.52TB. Assuming that the usage rate of ChatGPT in the Outlook email scenario is 1%, the amount of data generated may need to be processed every day, which is approximately 261GB, which is nearly 8 times higher than the current question and answer scenario.

 

Outlook uses GPT to generate emails

2. Voice: Teams has been integrated with OpenAI to greatly improve the efficiency of online meetings. 

Microsoft's Teams platform has been integrated with OpenAI to support multiple functions such as automatically generating meeting minutes, chapter divisions, and time stamps. After paying $10 per month, users can use the GPT-3.5 model and obtain services such as automatic generation of meeting minutes, real-time translation, chapter division, and timeline marking. The Teams platform has a variety of main functions, including automatic generation of minutes, real-time translation in 40 languages, AI chapter division, personalized time stamps, privacy-protecting watermarks and encryption, etc. These functions can help users improve work efficiency, save time and costs, and enrich the meeting experience, and the automatically generated minutes and chapter divisions are particularly beneficial. The integration of Teams with GPT-3.5 represents a new direction in productivity tools in the mobile Internet era, providing users with more intelligent services.

Reduce language barriers during meetings with real-time translation and subtitles

As the application of voice input in large models becomes more and more widely used in the Teams platform, the demand for new data volume will also increase accordingly. The storage principle of digital audio shows that the sampling frequency, the number of quantization bits and the number of channels will affect its storage capacity. In phone-quality audio, an 8kHz sampling rate, 8bit quantization, and two-channel storage method are used, and the storage capacity is approximately 2 bytes per second. Assuming that in the voice interaction scenario of Teams, ChatGPT needs to process 1 hour of audio data every day, then the daily new data volume requirement is about 7200 bytes, or 7.03KB.

Considering that Teams currently has over 100 million daily active users, we can estimate that if all users use 1 hour of audio interaction, the daily new data requirement will be approximately 7.03KB * 100 million = 703GB. Compared with current text interaction, the demand for voice data has increased by about 200 times. Therefore, the introduction of voice interaction scenarios will bring a significant increase in data level to the AI ​​system.

The calculation method of the data volume after audio digitization is: in bytes, the storage volume of the audio file after the analog waveform sound is digitized (assuming it is not compressed) is: storage volume = sampling frequency (Hz) x quantization number of bits (bit) /8xnumber of channelsxtime. This calculation method can help us better understand and predict audio data storage needs.

According to Microsoft’s public data, the number of daily active users of the Teams platform has increased from 115 million in 2020 to 270 million in 2022. Assuming that the total meeting time of Teams grows in proportion to the number of users, the total meeting time of Teams in 2022 is estimated to be approximately 6 billion minutes. According to the principle of audio storage and estimated based on phone quality parameters, the storage amount corresponding to 6 billion minutes of audio is approximately 671GB. Assuming that about 50% of users use ChatGPT to generate meeting minutes, the new voice data demand for Teams is about 336GB. It should be noted that this is only a parameter estimate based on phone sound quality, and in fact the difference in audio sampling rate and bit rate may lead to a larger actual data amount. In addition, the proportion of users who use ChatGPT to generate minutes may also be adjusted, thus affecting the final demand.

3. Picture: Filmora connects to the OpenAI service to realize "Wen-shengtu" and "Tu-shengtu" 

Filmora video production software has integrated OpenAI function, which can intelligently generate picture materials with one click. Wondershare Technology provides Filmora with support for OpenAI AI drawing capabilities. Users only need to simply trace the shape and get a complete image generated by AI in a few seconds. In the latest Valentine's Day version, Filmora has realized the conversion from "Venture Pictures" to "Pictures Born". Users only need to enter simple text to obtain high-quality AI-generated pictures. This represents a new direction in the combination of creative tools and AI. By combining with OpenAI, Filmora can help ordinary users easily obtain high-quality images to assist video creation. In the future, Filmora is expected to add more AI-generated content functions to provide users with a more intelligent and efficient creative experience.

 

According to Filmora's image parameter estimation, the daily output data volume of its OpenAI-generated images is approximately 586GB. Filmora's default resolution is 1920*1080, and each picture is about 6MB. Assuming that the number of monthly active users is 3 million and OpenAI is called 100,000 times a day, the daily data volume is approximately 586GB. Etu Brain Map, a subsidiary of Wondershare Technology, has also integrated the AI ​​generated content function. Users only need to enter text to automatically generate various brain maps. The application scenarios of this technology are very wide, including marketing, publishing, art, medical and other fields. In the future, it is expected that the application space of AI-generated images will further expand.

4. Video: AIGC assists in generating animation, and the sea of ​​stars begins 

AIGC technology has broad application prospects in the commercial cartoon "Dog and Boy". This work was jointly created by Netflix, Xiaoice Japan Branch (rinna), and WIT STUDIO. XiaoIce Company is an independent technology research and development entity, formerly known as the Microsoft artificial intelligence XiaoIce team, which was split into an independent company in 2020. On November 7, 2022, Xiaoice completed new financing totaling 1 billion yuan to accelerate the research and development of Xiaoice framework technology for AI Being, and announced the upgrade of its artificial intelligence digital employee (AI Being Employee) product line, including large model dialogue Engine, 3D neural network rendering, super natural speech and AIGC artificial intelligence content generation. XiaoIce's business covers many countries and regions around the world, with many users and audiences.

 

"Dog and Boy" AI participated in the production

Runway Gen2 is open and the video generation fee is $0.20. Runway announced the opening of Gen-1 and Gen-2 models, which are free for public trial. The video length is 4 seconds and consumes 5 points per second. If the points are exhausted, users can choose to pay for it, which is 0.01 USD/point, which means it costs 0.2 USD to generate a video. Gen-2 can quickly generate relevant videos using only text, images, or text plus image descriptions, and is the first publicly available text-to-video model on the market. The video output data volume per second reaches 1MB, which indicates that the prelude to the future sea of ​​stars is beginning. As AIGC technology gradually penetrates into film and television dramas, promotional videos and other fields, the efficiency of video creation is expected to be significantly improved.

Recommended bitrates for Youtube on SDR videos

To sum up, the following conclusions are drawn: At present, the application scenarios of ChatGPT and AIGC are far from being fully explored. Various forms of input and output such as voice, pictures, and videos will bring revolutionary changes to the field of content creation. Wider data forms, more application scenarios, and deeper user experiences will increase the demand for artificial intelligence computing power, which may lead to an era of rapid expansion of computing power.

 

OpenAI large model various scene data volume measurement

3. What is the strength of Nvidia’s most powerful AI chip GH200?

GH200 and H100 are products of the same generation, with the same AI computing chip architecture and equivalent computing capabilities. However, the memory capacity of GH200 is 3.5 times larger than that of H100, which is more beneficial for AI tasks that need to process more complex models or larger amounts of data. Therefore, the advantage of GH200 over H100 lies in its larger capacity of memory rather than computing power.

GH200 contains a Grace CPU chip and a Hopper GPU chip, which are interconnected through high-speed NVLink-C2C with a bandwidth of up to 900GB/s, enabling tight CPU and GPU data exchange. This allows the GH200's GPU to directly access CPU memory. By comparison, in H100 systems, the CPU and GPU are typically connected only via PCIe, and even the latest generation's bandwidth is only 128GB/s, less than one-seventh of the GH200's NVLink-C2C. Therefore, through chip-level optimized design, GH200 achieves more efficient CPU-GPU memory sharing, which is more friendly to AI computing that requires frequent CPU-GPU data exchange.

Blue Ocean Brain Large Model Training Platform

The Blue Ocean Brain large model training platform provides powerful computing power support, including an AI accelerator based on high-speed interconnection of open acceleration modules. It is configured with high-speed memory and supports fully interconnected topology to meet the communication requirements of tensor parallelism in large model training. It supports high-performance I/O expansion and can be extended to Wanka AI cluster to meet the communication needs of large model pipelines and data parallelism. Powerful liquid cooling system hot-swappable and intelligent power management technology, when the BMC receives a PSU failure or error warning (such as power outage, surge, overheating), it automatically forces the system's CPU to enter ULFM (ultra-low frequency mode) to achieve the lowest power. consumption). Committed to providing customers with environmentally friendly and green high-performance computing solutions through "low carbon and energy saving". Mainly used in deep learning, academic education, biomedicine, earth exploration, meteorology and oceanography, supercomputing centers, AI and big data and other fields.

 

1. Why do we need large models?

1. The model effect is better

The effect of large models in various scenes is better than that of ordinary models

2. Stronger creative ability

Large models can perform content generation (AIGC) to facilitate large-scale content production

3. Flexible customization of scenarios

By giving examples, we can customize a large number of application scenarios for large models.

4. Less labeled data

By learning a small amount of industry data, large models can cope with the needs of specific business scenarios.

2. Platform features

1. Heterogeneous computing resource scheduling

A comprehensive solution based on general-purpose servers and dedicated hardware for scheduling and managing multiple heterogeneous computing resources, including CPUs, GPUs, etc. Through powerful virtualization management functions, underlying computing resources can be easily deployed and various models can be run efficiently. At the same time, the hardware acceleration capabilities of different heterogeneous resources are fully utilized to speed up the running and generation speed of the model.

2. Stable and reliable data storage

Supports multiple storage type protocols, including block, file and object storage services. Pool storage resources to achieve free circulation of models and generated data, improving data utilization. At the same time, data protection mechanisms such as multiple copies, multi-level fault domains, and fault self-recovery are adopted to ensure the safe and stable operation of models and data.

3. High-performance distributed network

Provides network and storage of computing resources, forwards them through distributed network mechanisms, transparently transmits physical network performance, and significantly improves the efficiency and performance of model computing power.

4. Comprehensive security guarantee

In terms of model hosting, a strict permission management mechanism is adopted to ensure the security of the model warehouse. In terms of data storage, measures such as privatized deployment and data disk encryption are provided to ensure the security and controllability of data. At the same time, during the model distribution and operation process, comprehensive account authentication and log audit functions are provided to fully ensure the security of the model and data.

3. Common configurations

1. Processor CPU:

Intel Xeon Gold 8358P 32C/64T 2.6GHz 48MB,DDR4 3200,Turbo,HT 240W

Intel Xeon Platinum 8350C 32C/64T 2.6GHz 48MB,DDR4 3200,Turbo,HT 240W

Intel Xeon Platinum 8458P 28C/56T 2.7GHz 38.5MB,DDR4 2933,Turbo,HT 205W

Intel Xeon Platinum 8468 Processor 48C/64T 2.1GHz 105M Cache 350W

AMD EPYC™ 7742 64C/128T,2.25GHz to 3.4GHz,256MB,DDR4 3200MT/s,225W

AMD EPYC™ 9654 96C/192T,2.4GHz to 3.55GHz to 3.7GHz,384MB,DDR5 4800MT/s,360W

2. Graphics card GPU:

NVIDIA L40S GPU 48GB

NVIDIA NVLink-A100-SXM640GB

NVIDIA HGX A800 80GB

NVIDIA Tesla H800 80GB HBM2

NVIDIA A800-80GB-400Wx8-NvlinkSW×8

Guess you like

Origin blog.csdn.net/LANHYGPU/article/details/133922720