Exploring the future of AIGC: CPU source code optimization, multi-GPU programming and China’s computing power bottleneck and development

★Artificial intelligence; big data technology; AIGC; Turbo; DALL·E 3; multi-modal large model; MLLM; LLM; Agent; Llama2; domestic GPU chip; GPU; CPU; high-performance computer; edge computing; large model memory usage ;5G; deep learning; A100; H100; A800; H800; L40s; Intel; NVIDIA; computing power

In recent years, AIGC technology has made great progress. One of the most important technologies is source code-based CPU tuning, which can effectively improve the training speed and efficiency of artificial intelligence models, thereby accelerating the application process of artificial intelligence. At the same time, multi-GPU programming technology is also constantly developing, greatly improving the computing power of artificial intelligence models and better meeting the needs of practical applications.

This article will analyze the latest progress of AIGC, delve into the above topics, as well as the bottlenecks and trends of China’s computing industry.

AIGC development status

After experiencing the "Battle of Hundreds of Models" and "Letting a Hundred Flowers Bloom" stages in the first half of the year, the AIGC industry is now standing in a critical period from "toys" to "tools". The market structure of large models has undergone profound changes, and the focus of the industry has also turned to the “ultimate proposition” of artificial intelligence development—application and commercialization. The transformation of AIGC's research and development paradigm fundamentally improves data production efficiency and lowers the threshold for users and developers.

In order to improve the capabilities and effectiveness of models, the industry has jointly focused on effective ways to amplify model capabilities, such as fine-tuning, prompt engineering, search enhancement generation, AI Agent and other technical means. At the same time, the open source model has developed rapidly, and products have been extended to terminals, integrating more AI application technologies to promote the diversification of application scenarios. However, as policies set entry barriers for the C-side and the standard system covers multiple industries, emphasizing the importance of data, algorithms, models and security factors, the "Battle of 100 Models" returns to rationality and the industry structure enters the stage of integration.

In Q3 2023, 35 financing incidents occurred in the domestic AIGC industry, involving 33 companies and 51 investment institutions. The financing amount was 3.961 billion yuan, with 21 companies from seed to angel rounds (accounting for 63.64%). The general large model (6 cases) and tool platform (6 cases) segments are relatively active. In the application layer, Metaverse/Digital People (5 cases) and marketing (5 cases) are the subdivisions with the most frequent financing events. One domestic AIGC company has completed its listing - Fourth Paradigm (a decision-making artificial intelligence company). In Q3 2023, there was a merger and acquisition event in the domestic AIGC industry-Meituan acquired Lightyear, with a financing amount of 2.065 billion yuan.

1. Technology iteration

1. Multi-modal large model DALL·E 3 brings industry impact

Multimodal large language model (MLLM) is a model that combines training with multimodal information such as text, images, audio and video. Compared with large language model (LLM), it is more in line with the way humans perceive the world. With the support of multi-modal input, users can interact with intelligent assistants in a more flexible way and utilize powerful large models as brains to perform multi-modal tasks.

DALL·E 3 can better capture the subtle differences in semantic description, achieve perfect compliance with prompt words, and effectively avoid confusing elements in detailed requests, making significant progress in picture presentation. At the same time, the combination of the Vincentian graph model and ChatGPT greatly reduces the constraints of the prompt engineering.

2. Long text technology enhances product user experience

"Context length" in LLM refers to the length of input text that the large language model considers when generating predictions. The ability to model longer text enables the model to observe longer context, avoiding the loss of important information. The application effect of large models depends on two core indicators: the amount of model parameters and the context length. Among them, the context length determines the "memory" capacity of the large model. Long text can provide more context and detailed information to assist the model in judging semantics, reduce ambiguity, and improve the accuracy of induction and inference.

At present, the exploration of text length at home and abroad has not reached a "critical point", and long text will still play an important role in future Agent and AI native applications. Agents need to rely on historical information for planning and decision-making, while AI native applications need to rely on context to maintain a coherent and personalized user experience. This is why large model companies such as Dark Side of the Moon and OpenAI pay attention to long text technology.

3. Llama2 sets off a new pattern in the large model market

Llama is a large-scale language model released by Meta that is trained using public data sets. It is welcomed by the AI ​​community because of its compatibility with open source protocols and reproducibility. However, due to the open source agreement, LLaMA is only for academic research and cannot be used for commercial purposes.

Compared with Llama 1, the pre-training corpus of Llama 2 has increased by 40%, reaching 2 trillion Tokens. In September, the number of Llama2 tokens reached 32,768, and it adopted a group query attention mechanism to gain a deeper understanding of text semantics. In the MMLU and GSM8K tests, the Llama 2 70B performs close to GPT-3.5.

 

 

4. AI Agent deeply explores the potential of large models

Agent refers to a software or hardware entity with intelligent characteristics such as autonomy, reactivity, sociality, proactiveness, deliberation and cognition, which is equal to large model + memory + active planning + tools. AI Agent can understand, plan, execute and self-adjust to solve more complex problems. Compared with LLM, AI Agent can think independently and call tools to gradually complete the goal. What is different from RPA is that it can process unknown environmental information.

 

 

Development and comparison of advantages and disadvantages of AI Agent and other technology selection solutions

2. Technology trends

1. Embracing the open source spirit, the rise of domestic models has become a prairie fire.

With the strong support of the country and the promotion of leading manufacturers, domestic models have become an important force in the large language model camp. Although it started late and faced the siege of foreign high-end GPU chips, the rise of domestic models has become a prairie fire. Basic Internet companies are actively promoting the construction of open source ecosystems.

Domestic AI large model development process (as of Q3 2023)

2. Extend large-model products to terminals to promote the diversified development of application scenarios

Technologies such as large model open source, multi-modality, and Agent will bring new, personalized, and humanized interactive experiences. In the future, large models will be deployed on mobile phones, PCs, cars, humanoid robots and other terminals to solve the problems of cloud AI in terms of cost, energy consumption, performance, privacy, security and personalization, and expand the scope of autonomous driving, smart education, smart home Diversified applications in other scenarios. However, how to achieve lightweight deployment and deep integration of software and hardware on the client side is still a difficult issue.

 

The comprehensive cost of enterprise privatization deployment of large models continues to decrease

The cost of implementing large model applications includes data, model and application development costs. Model costs include licensing costs and computing power costs. As Llama2 promotes the free commercialization of domestic models, MaaS is gradually accepted by the market, and the barriers of excessive licensing costs are disappearing. Through QLoRA fine-tuning and GPTQ quantification, small and medium-sized enterprises can also use hundreds of billions of models, significantly reducing computing power costs. The comprehensive cost of enterprise privatization deployment continues to decrease, which is conducive to the penetration of large models into the B-side market.

 

3. How to determine the video memory usage of large models?

When deploying large models, memory usage is a key issue. Due to their huge size, large models either cannot be run due to memory overflow, or the inference speed is slow due to the model being too large. Optimizing inference for large models is different from optimizing inference for small model CNNs. The following will mainly discuss how to calculate the video memory usage of large models.

Taking the popular LLama2 large model as an example, there are three main versions: 7B, 13B, and 70B. B (Billion) means one billion, and M (Million) means one million, so large models such as LLama2 can be called billion- or tens-billion-level models.

For deep learning models, the accuracy usually includes float32, float16, int8, int4, etc. The following low precisions such as int8 and int4 are mainly used for inference acceleration. For example, a float32 will occupy 4 bytes and 32 bits, which will be halved in the future. For example, int8 occupies 8 bits per byte, and the space occupied by int4 will be even smaller. The parameter amount and model accuracy can be used to calculate the video memory usage of the model. Take LLama2-13B as an example:

For float32 precision: 13 * 10^9 * 4 / 1024^3, which is approximately equal to 48.42G
For float16 precision: 13 * 10^9 * 2 / 1024^3, which is approximately equal to 24.21G

By analogy, calculate the video memory usage of LLama2-7B.

For float32 precision: 7 * 10^9 * 4 / 1024^3 is approximately equal to 26.08G;

For float16 precision video memory is halved: approximately equal to 13G;

For int8, the precision is further halved: approximately equal to 6.5G;

For int4, the precision is halved: approximately equal to 3.2G.

This shows the importance of low-bit quantization in memory management of large model deployment. The above inference memory usage only applies to model forward inference and does not apply to model training. The training process will also be affected by factors such as gradient, optimizer parameters, and bs. Generally speaking, the memory usage during training will be many times, or even more than ten times, that during inference. The above inference memory usage is a theoretical value, and it will definitely be more in practice, so some margin needs to be reserved. For example, when measuring LLama2-13B, the theoretical value is about 48.21G, but it actually requires about 52G of video memory. Of course, this method is also suitable for forward inference memory occupation calculation of CNN model.

Source code based CPU tuning

For high-performance applications, such as cloud services, scientific computing, and AAA games, the hardware foundation is crucial. Ignoring hardware factors can lead to performance bottlenecks. Standard algorithms and data structures may not provide optimal performance in certain scenarios.

1. CPU front-end optimization

With the popularity of "flat" data structures, linked lists are gradually being eliminated. Traditional linked lists dynamically allocate memory for each node, resulting in memory access delays and fragmentation. This makes traversing a linked list more time-consuming than traversing an array. Some data structures (such as binary trees) have a natural structure similar to a linked list, and it may be more efficient to use pointer tracking to implement it. In addition, more efficient versions of data structures such as boost::flat_map and boost::flat_set also exist.

The optimal algorithm for a particular problem may not be the best choice in a particular scenario. For example, binary search is efficient at finding elements in a sorted array, but branch prediction errors can cause it to perform poorly on large-scale data. Therefore, linear search may be more efficient when dealing with small arrays of integers. In short, for high-performance applications, it is necessary to have an in-depth understanding of hardware and algorithm performance, and to flexibly select and optimize appropriate algorithms and data structures to adapt to different scenarios.

"Data-driven" optimization is an important tuning technique based on a deep understanding of the data a program processes. Focus on how data is distributed and transformed within the program. One typical example is converting a structure of arrays (SOA) to an array of structures (AOS). Which layout you choose depends on how your code accesses the data. If the program traverses the data structure and accesses only field b, SOA is more efficient, mainly because all memory accesses are performed sequentially; if the program traverses the data structure and accesses all fields of the object (i.e. a, b, and c) a lot operation, AOS is better because all members may be stored in the same cache line, reducing cache line reads and improving memory bandwidth utilization. To perform such optimizations, you need to understand what data your program will process and the distribution of the data, and modify your program accordingly.

Another important data-driven optimization method is "small size optimization", which aims to pre-allocate a fixed amount of memory to the container to avoid dynamic memory allocation. This method is widely used in LLVM infrastructure and can significantly improve performance (for example, for SmallVector, boost::static_vector is also implemented based on the same concept). Modern CPUs are very complex devices, and it's nearly impossible to predict how a certain piece of code will run. The execution of CPU instructions is subject to numerous factors, including many changing components.

1. Machine code layout

Machine code layout refers to the compiler converting source code into a serial sequence of bytes. Because the compiler affects the performance of the binary, the offset of where instructions are placed in memory is taken into account when translating source code into machine code.

2. Basic blocks

A basic block refers to an instruction sequence with a single entry and exit. It can have multiple predecessors and successors, but there is no instruction in the middle of the basic block that can jump out of the basic block. This structure ensures that each piece of code in a basic block will only be executed once, thereby greatly reducing control flow graph analysis and transformation problems.

3. Basic block layout

// hot path

if (cond)

    coldFunc();

// hot path again

If the condition cond is normally true, then the default layout is chosen since another layout would cause two jumps instead of one. coldFunc is an error handling function and is unlikely to be executed frequently, so we choose to maintain direct access between hotspot codes and convert selected branches into unselected branches. The reasons for choosing this layout are as follows:

1) The CPU can execute 2 unselected branches per clock, but only one selected branch can be executed every 2 clock cycles, so the unselected branch takes less time than the selected branch.

2) All hot code is contiguous and there is no cache line fragmentation problem, so the instruction and micro-operation cache can be more fully utilized.

3) Each selected jump instruction means that the bytes after the jump are invalid, so the selected branch is also more time-consuming for reading the unit.

4. Basic block alignment

Performance varies depending on the offset of the instruction in memory. If the loop spans multiple cache lines, performance issues may occur on the CPU front end. Therefore, the nop instruction can be used to advance the loop so that the entire loop resides in one cache line.

LLVM uses -mllvm-align-all-blocks to align basic blocks, but this may cause performance degradation. Inserting nop instructions will increase program overhead, especially on the critical path. Although the nop instruction does not need to be executed, it still needs to be read, decoded and executed from memory, consuming front-end data structure and accounting buffer space.

To precisely control alignment, use the ALIGN assembly directive. Developers first generate assembly lists and then insert ALIGN instructions to meet the needs of specific experimental scenarios.

5. Function splitting

Function splitting is used to optimize functions with complex CFGs and large amounts of cold code in hot paths. By moving cold code into separate functions, you can avoid unnecessary code loading at runtime, thus improving the memory footprint.

In the optimized code, the original function is split into two functions, one containing hot code and the other containing cold code. By moving cold code into separate functions, you can avoid unnecessary code loading at runtime, thus improving the memory footprint. At the same time, use __attribute__((noinline)) to prohibit inlining cold functions to prevent cold functions from being inlined into hot code, thereby affecting performance.

By separating hot code and cold code, the CPU front-end data structures (instruction cache and DSB) can be better utilized and CPU utilization improved. At the same time, placing the new function outside the .text section can avoid loading unnecessary code at runtime, thus improving memory usage.

6. Function grouping

Hotspot functions can be clustered together to improve CPU front-end cache utilization and reduce cache line read requirements. The linker is responsible for planning the arrangement and layout of all functions in the program, and the LLD linker optimizes the function layout through --symbol-ordering-file. The HFSort tool can automatically generate partition sort files based on analysis data.

7. Compilation optimization based on profiling files

Most compilers have a set of transformations that adjust algorithms based on profiling data. This is called PGO (Profile-Directed Optimization). There are two ways to generate profiling data: code instrumentation and sampling-based profiling.

1) Use the LLVM compiler to generate instrumentation code through the -fprofile-instr-generate parameter, and then use the -fprofile-inst-use parameter to recompile the program using the profiling data to generate a PGO-tuned binary file.

2) Generate profiling data required by the compiler based on sampling, and then convert the sampled data generated by linux perf into a form understandable by GCC and LLVM compilers through the AutoFDO tool. But the compiler assumes that all loads behave the same.

8. Optimization of ITLB

Virtual to physical address translation in memory is one of the key areas of front-end optimization. By mapping performance-critical code onto large pages, ITLB (Instruction Translation Buffer) pressure can be relieved. This requires relinking the binary to ensure that the code segments are aligned on appropriate page boundaries. In addition to using huge pages, other techniques can be used to optimize instruction cache performance, such as rearranging functions to make hotspot functions more concentrated, using LTO (link time optimization) / IPO (inline function optimization) to reduce the size of hotspot areas, Use PGO (Profiling Based Compilation Optimization) and avoid excessive inlining.

 

2. CPU backend optimization

During computer processing, after the front-end completes instruction fetching and decoding, if the back-end resources are insufficient to process new micro-operations, the front-end will be unable to continue delivering micro-operations. For example, when the data cache misses or the division unit is overloaded, the backend cannot process instructions efficiently, causing the frontend to stall.

1. Storage bound

When an application makes a large number of memory accesses and spends a long time waiting for the memory access to complete, it is considered storage bound. This means optimizing storage access, reducing the number of storage accesses, or upgrading the storage subsystem.

In TMA, storage bound counts some slots in the CPU pipeline that are blocked due to on-demand load or store instructions. The first step in solving such performance problems is to locate the memory access operations that cause high "storage bound" indicators and identify the specific memory access operations. Then start tuning.

1) Cache-friendly data types

Cache-friendly algorithms and data structures are one of the key elements of performance, focusing on the principle of temporal and spatial locality, with the goal of efficiently reading the required data from the cache.

  • Access data sequentially

The best way to exploit cache space locality is to access memory sequentially. The standard implementation of binary search does not exploit spatial locality, and one way to solve this problem is to store array elements in an Eytzinger layout. The idea is to maintain an implicit binary search tree and pack the binary search tree into an array using a breadth-first search-like layout.

  • Use appropriate containers

Almost all languages ​​provide a variety of ready-made containers, and it is crucial to understand their underlying storage mechanisms and performance impact. When processing data, you need to choose an appropriate data storage method based on the specific conditions of the code.

  • Packed data

One way to improve memory hierarchy utilization is to make the data more compact. A classic example of packed data is the use of bit storage, which can greatly reduce the amount of memory transferred back and forth while saving cache space. However, since b and a share a machine word with c, the compiler needs to perform a shift operation. Packing the data is worthwhile when the cost of the extra computation is less than the cost of inefficient memory transfers.

Programmers can rearrange the layout of fields in a structure or class to reduce memory usage while avoiding the need for the compiler to add structure padding. For example, if you have a structure that contains a Boolean value and an integer, it's better to put the integer first because you can use the bits of the integer to store the Boolean value, thus saving memory.

  • Alignment and padding

Access is most efficient when a variable is stored at a memory address that is evenly divisible by its size. Alignment can cause unused bytes to form gaps, reducing memory bandwidth utilization. To avoid edge cases such as cache contention and spurious sharing, data structure members need to be populated. For example, when two threads access different fields of the same structure, cache consistency issues may cause the program to run significantly slower. The padding method ensures that different fields of the structure are in different cache lines. When using malloc for dynamic allocation, ensure that the returned memory address meets the minimum alignment requirements of the platform target. The bottom line is that for SIMD code, when using the compiler to vectorize built-in functions, the address is usually divisible by 16, 32, or 64.

  • Dynamic memory allocation

Alternatives to malloc tend to be faster, more scalable, and handle memory fragmentation issues more efficiently. One challenge with dynamic memory allocation is that multiple threads may try to apply for memory at the same time, resulting in reduced efficiency.

To solve this problem, you can use a custom allocator to speed up memory allocation. The advantage of this type of allocator is that it has lower overhead because it avoids making a system call for every memory allocation. At the same time, it is also highly flexible, allowing developers to implement their own allocation strategies based on the memory area of ​​the operating system. One strategy is to maintain two different allocators, one for hot data and one for cold data. Keeping hot data together allows cache lines to be shared, improving memory bandwidth utilization and spatial locality. At the same time, this strategy can also improve TLB utilization because hot data occupies fewer memory pages. Additionally, a custom memory allocator can leverage thread-local storage to enable independent allocation for each thread, thereby eliminating inter-thread synchronization issues.

  • Tuning code for memory hierarchy

The performance of some applications depends on the size of a specific layer's cache, the most famous example being the use of loop blocking to improve matrix multiplication.

2) Explicit memory prefetching

When the size of the arr array is large, hardware prefetch may not be able to identify the memory access mode and obtain the required data in advance. To manually add prefetch instructions in the time window between the calculation of j and the request for arrp[j], use __builtin_prefetch as follows:

for (int i = 0; i < N; ++i) {
int j = calNextIndex();
__builtin_prefetch(arr + j, 0, 1);
// ...
doSomeExtensiveComputation();
// ...
x = arr[j];
}

For prefetching to be effective, prefetch hints need to be inserted early to ensure that the values ​​used for calculations are loaded into the cache at the time of calculation, while avoiding premature insertion of prefetch hints to avoid polluting the cache.

Explicit memory prefetching is not portable, and performance improvements on one platform are not guaranteed to have the same effect on another. Additionally, explicit prefetch instructions increase code size and increase pressure on the CPU front end.

3) Optimized for DTLB

The TLB is divided into ITLB and DTLB in L1, and the unified TLB is in L2. L1 ITLB miss latency is very small and is usually hidden by out-of-order execution. Unified TLB misses invoke the page walker, potentially resulting in performance loss.

The Linux default page size is 4KB. Increasing the page size can reduce TLB entries and the number of misses. Intel 64 and AMD 64 support 2MB and 1GB giant pages. The TLB using large pages is more compact and the cost of traversing the kernel page table is reduced.

In Linux systems, applications use huge pages in two ways: explicit huge pages and transparent huge pages. Hugepage memory can be dynamically allocated using the libhugetlbfs library. Developers can control access to huge pages in the following ways: use mmap with the MAP_HUGETLB parameter; use mmap to mount files in the hugetlbfs file system; use shmget with the SHM_HUGETLB parameter.

2. Calculate bound

There are two main types of performance bottlenecks: hardware computing resource shortage and software instruction dependencies. The former refers to execution unit overload or execution port contention, which occurs when the load frequently executes a large number of heavy instructions; the latter refers to dependencies in the program data flow or instruction flow that limit performance. Common optimization methods such as function inlining, vectorization and loop optimization are discussed below, aiming to reduce the total number of instructions executed and improve performance.

1) Function inlining

Inline functions not only eliminate the overhead of function calls, but also extend the scope of compiler analysis and perform more optimizations. But inlining may also increase compiled file size and compilation time. Compilers usually decide whether to inline a function based on a cost model, such as LLVM which takes into account computational cost and the number of calls. In general, small functions, functions with a single call site are more likely to be inlined, while large functions and recursive functions are usually not inlined. Functions called through pointers can be inlined instead of called directly. Developers can use special hints (such as C++11's gnu::always_inline) to force functions to be inlined. Another approach is to profile the data to identify potential inline objects, specifically analyzing the frequency of parameter passing and return of functions.

2) Loop optimization

Loops are the most frequently executed sections of code in a program, so most of the execution time is spent in loops. The performance of a loop is limited by memory latency, memory bandwidth, or computing power. Roofline modeling is a good way to evaluate different loops based on the theoretical maximum of the hardware, and TMA analysis is another way to deal with this bottleneck.

  • Low-level optimization

Loop invariant extraction helps improve arithmetic strength performance by moving expressions that never change within the loop outside the loop. Loop unrolling can increase instruction-level parallelism while reducing loop overhead, but it is not recommended that developers manually unroll any loops because the compiler is very good at unrolling loops in an optimal way. With out-of-order execution, the processor has a "built-in expander." Loop-strength folding, which replaces expensive instructions with less expensive instructions, is applied to expressions and array indices of all loop variables, and the compiler does this by analyzing how the variable's value evolves over the loop iterations. In addition, if there are constant judgment conditions inside the loop, moving them outside the loop, that is, performing loop judgment externalization, will also help improve performance.

  • High-level optimization

Such optimization strategies deeply change the loop structure and can affect the overall performance of multiple nested loops. Its fundamental purpose is to improve memory access efficiency and solve the bottleneck problem of memory bandwidth and latency. To achieve this goal, the following strategies can be adopted: by exchanging the order of nested loops, the memory access to multi-dimensional array elements is more orderly, thereby eliminating bandwidth and delay limitations; reasonably splitting the execution scope of multi-dimensional loops For multiple loop blocks, the access of each block of data can be adapted to the size of the CPU cache, thereby optimizing the memory bandwidth and latency of stride memory access; for situations that can be merged, multiple independent loops are merged in Together, to reduce loop overhead while improving the temporal locality of memory accesses.

But it should be noted that loop merging does not always improve performance. Sometimes it may be more effective to split the loop into multiple paths, pre-filter the data, sort and reorganize the data, etc. Splitting the loop helps solve the high cache contention problem that occurs in large loops. It also reduces register pressure and enables further individual optimization of small loops with the help of the compiler.

3) Discover opportunities for cycle optimization

Compilation optimization report shows conversion failure and requires viewing hotspots of assembly code generated by the application profile file. Optimization strategies should start with simple solutions, and then developers need to identify bottlenecks in the loop and evaluate performance based on the theoretical maximum of the hardware. You can use a roofline model to identify bottleneck points that need analysis and then try various transformations.

4) Use loop optimization framework

The polyhedron framework can be used to check the legality of loop conversions and convert loops automatically. Polly is a high-level loop and data locality optimizer and optimization infrastructure based on LLVM, which uses an abstract mathematical representation based on integer polyhedrons to analyze and optimize the memory access pattern of the program. To enable Polly, the user needs to enable it through an explicit compiler option (-mllvm -polly), because the standard pipeline of the LLVM infrastructure does not enable Polly by default.

3. Vectorization

Using SIMD instructions can significantly speed up unvectorized code. One of the key points of performance analysis is to ensure that critical code can be vectorized correctly by the compiler. If the compiler cannot generate the required assembly instructions, you can rewrite the code fragment using compiler built-in functions. Code that uses compiler built-in functions is similar to inline assembly code and is less readable. You can usually guide the compiler to perform automatic vectorization by using compilation annotations. The compiler can perform three vectorization operations: inner loop automatic vectorization, outer loop vectorization and superword vectorization.

1) Compiler automatic vectorization

Compiler automatic vectorization is hindered by several factors, including the inherent semantics of the programming language and limitations of the processor's vector operations. These factors make it difficult for compilers to efficiently convert loops into vectorized code. However, through the three stages of legality check, benefit check and conversion, the code can be gradually optimized and the running speed of the program can be improved. During the legality check phase, the loop vectorization is evaluated to see whether it satisfies a series of conditions to ensure that the generated code is correct and valid. During the benefit checking phase, different vectorization factors are compared and the optimal solution is selected, taking into account the execution cost and efficiency of the code. Finally, during the conversion phase, vectorized execution is enabled by inserting vectorized guard code, and the code is optimized to run faster.

2) Explore opportunities for vectorization

The easiest way to analyze hot loops in your program and check what optimizations the compiler has done is to look at the compiler vectorization flags. When a loop cannot be vectorized, the compiler gives a reason for failure. Another approach is to examine the assembly output of your program, and better yet, analyze the output of a profiling tool. Although looking at assembly is time-consuming, this skill is highly rewarding because it can reveal suboptimal code, lack of vectorization, suboptimal vectorization factors, performing unnecessary calculations, etc. in assembly code.

Vectorization flags clearly explain the problem and why the compiler cannot vectorize the code. gcc 10.2 can output optimization reports (enabled with parameter -fopt-info). Developers should be aware of the hidden costs of vectorized code, especially AVX512 which can result in significant throttling. For loops with small iterations, force the vectorizer to use small vectorization factors or count unrolling to reduce the number of elements processed by the loop.

Multi-GPU programming

CUDA provides features for multi-GPU programming, including managing multiple devices in one or more processes, direct access to other device memory using unified virtual addressing, GPUDirect, and overlapping multi-device computing communication using streams and asynchronous functions.

1. From one GPU to multiple GPUs

Using multiple GPUs is an effective way to improve computational efficiency and throughput when processing large-scale data sets. Multi-GPU systems achieve efficient inter-GPU communication through different connection methods, such as through PCIe bus or network switches in a cluster. In multi-GPU applications, workload distribution and data exchange patterns are key factors. The most basic mode is to run each problem partition on a separate GPU, while more complex modes need to consider how to optimally move data between devices to avoid copying the data to the host and then to another GPU.

1. Execute on multiple GPUs

CUDA's cudaGetDeviceCount function determines the number of CUDA devices available in the system. In applications that leverage CUDA to collaborate with multiple GPUs, the target GPU must be explicitly specified. The current device can be set using the cudaSetDevice(int id) function. This function sets the device with a specific ID as the current device. There is no synchronization with other devices, so the overhead is low.

If the cudaSetDevice function is not explicitly called before the first CUDA API call, the current device will be automatically set to device 0. When the current device is selected, all CUDA operations will be applied to this device, including: device memory allocated from the main thread, host memory allocated by CUDA runtime functions, streams or events created by the host thread, and kernels launched by the host thread. .

Multi-GPU is suitable for the following scenarios: single-node single-thread, single-node multi-thread, single-node multi-process, and multi-node multi-process. The following code shows how to perform kernel and memory copies in the host thread:

for (int i = 0; i < ngpus; i++) {  

    cudaSetDevice(i);  

    kernel<<<grid, block>>>(...);  

    cudaMemcpyAsync();  

}

Because the kernel launch and data transfer within the loop are asynchronous, control is quickly returned to the host thread after each operation.

2. Point-to-point communication

On devices with compute capability 2.0 or above, the kernel executed by a 64-bit application can directly access the GPU global memory connected to the same PCIe root node, but needs to use the CUDA point-to-point API for direct communication between devices. This feature requires CUDA4.0 or higher version. Point-to-point access and transfer are two modes supported by the CUDA P2P API. However, when the GPU is connected to different PCIe root nodes, direct point-to-point access will not be allowed. At this time, the CUDA P2P API can be used for point-to-point transfer, but the data transfer will pass through the host memory. conduct.

1) Enable peer-to-peer access

Point-to-point access allows the GPU to directly reference data on the memory of other GPU devices connected to the same PCIe root node. Use cudaDeviceCanAccessPeer to check whether the device supports P2P. If the device can directly access the global memory of the peer device, it returns 1, otherwise it returns 0. Between two devices, point-to-point memory access must be explicitly enabled using cudaDeviceEnablePeerAccess. This function allows point-to-point access from the current device to peerDevice. Authorized access is one-way. Peer-to-peer access remains enabled until explicitly disabled by cudaDeviceDisablePeerAccess. 32-bit applications do not support peer-to-peer access.

2) Point-to-point memory copy

After enabling peer access between two devices, you can use the cudaMemcpyPeerAsync function to asynchronously copy data on the device. This function transfers data from the source device srcDev to the destination device dstDev. If srcDev and dstDev are connected on the same PCIe root node, data transfer will be performed along the shortest path of PCIe without relaying through host memory.

3. Synchronization between multiple GPUs

In multi-GPU applications, streams and events are associated with a single device. Typical workflows include: selecting a GPU set, creating streams and events for each device, allocating device resources, starting tasks through streams, querying and waiting for tasks to complete and clearing resources. Only devices associated with the stream can start the kernel and log events. Memory copies can be made in any stream, regardless of device and current state. You can query or synchronize streams or events even if they are not relevant to the current device.

2. Subdivision calculation between multiple GPUs

1. Allocate memory on multiple devices

Before assigning tasks to multiple devices, you first need to determine the number of GPUs available in your system. Get the number of GPUs through cudaGetDeviceCount and print it.

Next, declare the required memory and streams for each device. Use cudaSetDevice to allocate memory and streams for each device.

For each device, allocate a certain amount of host memory and device memory, and create a stream. For asynchronous data transfer between the device and the host, page-locked memory also needs to be allocated.

Finally, use a loop to do the following for each device:

1) Set the current device

2) Allocate device memory: cudaMalloc

3) Allocate host memory: cudaMallocHost

4) Create stream: cudaStreamCreate

In this way, each device is allocated memory and streams, ready for task distribution and data transfer.

2. Single host thread allocation work

// Initialize the state of the host array for each device before distributing operations between devices

for (int i = 0; i < ngpus; i++)

{

    cudaSetDevice(i);

    initial(h_A[i], iSize);

    initial(h_B[i], iSize);

}

// Distribute data and calculations across multiple devices

for (int i = 0; i < ngpus; i++)

{

    cudaSetDevice(i);

    cudaMemcpyAsync(d_A[i], h_A[i], iBytes, cudaMemcpyHostToDevice, streams[i]);

    cudaMemcpyAsync(d_B[i], h_B[i], iBytes, cudaMemcpyHostToDevice, streams[i]);

    iKernel<<<grid, block, 0, streams[i]>>>(d_A[i], d_B[i], d_C[i], iSize);

    cudaMemcpyAsync(gpuRef[i], d_C[i], iBytes, cudaMemcpyDeviceToHost, stream[i]);

}

cudaDeviceSynchronize();

This loop traverses multiple GPUs, asynchronously copying the input array for the device. Then operate iSize data elements in the desired stream to start the kernel. Finally, the device issues an asynchronous copy command and returns the results from the kernel to the host. Because all elements are asynchronous, control returns immediately to the host thread.

3. Point-to-point communication on multiple GPUs

The following will test three situations: one-way memory copy between two GPUs; two-way memory between two GPUs and access to peer device memory in the kernel;

1. Achieve point-to-point access

First, two-way point-to-point access must be enabled on all devices, the code is as follows;

// Enable bidirectional point-to-point access

inline void enableP2P(int ngpus)

{

    for (int i = 0; i < ngpus; i++)

    {

        cudaSetDevice(i)

        for (int j = 0; j < ngpus; j++)

        {

            if (i == j)

                continue;

            

            int peer_access_available = 0;

            cudaDeviceCanAccessPeer(&peer_access_available, i, j);

            if (peer_access_avilable)

            {

                cudaDeviceEnablePeerAccess(j, i);

                printf(" > GP%d enbled direct access to GPU%d\n", i, j);

            }

            else

                printf("(%d, %d)\n", i, j);

        }

    }

}

The function enableP2P traverses all device pairs (i, j), and if point-to-point access is supported, the cudaDeviceEnablePeerAccess function is used to enable bidirectional point-to-point access.

2. Point-to-point memory replication

The most likely reason why peer-to-peer access cannot be enabled is that they are not connected to the same PCIe root node. If peer-to-peer access is not supported between two GPUs, then peer-to-peer memory copies between the two devices will be relayed through host memory, degrading performance.

With peer-to-peer access enabled, the following code performs a ping-pong synchronized memory copy between two devices 100 times.

// ping-pong undirectional gmem copy

cudaEventRecord(start, 0);

for (int i = 0; u < 100; i++)

{

    if (i % 2 == 0)

        cudaMemcpy(d_src[1], drc[0], iBytes, cudaMemcpyDeviceToHost);

    else

        cudaMemcpy(d_src[0], drc[1], iBytes, cudaMemcpyDeviceToHost);

}

Note that no device is specified before the memory copy, since memory copying across devices does not require explicit setting of the current device. If the device is specified before memory copying, it does not affect its behavior.

To measure the performance of data transfer between devices, start and stop events need to be recorded on the same device and include ping-pong memory replication. Then, use cudaEventElapsedTime to calculate the time elapsed between the two events.

// ping-pong undirectional gmem copy

cudaEventRecord(start, 0);

for (int i = 0; u < 100; i++)

{

    if (i % 2 == 0)

        cudaMemcpy(d_src[1], drc[0], iBytes, cudaMemcpyDeviceToHost);

    else

        cudaMemcpy(d_src[0], drc[1], iBytes, cudaMemcpyDeviceToHost);

}

cudaEventRecord(start, 0);

for (int i = 0; u < 100; i++)

{

...

}

cudaSetDevice(0);

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

float elapsed_time_ms;

cudaEventElapsedTime(&elapsed_time_ms, start, stop);

elapsed_time_ms /= 100;

printf("Ping-pong unidirectional cudaMemcpy: \t\t %8.2f ms", elapsed_time_ms);

printf("performance: %8.2f GB/s\n", (float)iBytes / (elapsed_time_ms * 1e6f));

Because the PCIe bus supports a full-duplex channel between any two endpoints, asynchronous copy functions can also be used to perform bidirectional and point-to-point memory copies.

// bidirectional asynchronous gmem copy

for (int i = 0; u < 100; i++)

{

    if (i % 2 == 0)

        cudaMemcpyAsync(d_src[1], drc[0], iBytes, cudaMemcpyDeviceToHost);

    else

        cudaMemcpyAsync(d_rcv[0], drcv[1], iBytes, cudaMemcpyDeviceToHost);

}

Note that since the PCIe bus is used in both directions at once, the bandwidth obtained is doubled.

Development and bottlenecks of China’s computing power industry

1. Market size: As a carrier of computing power, servers benefit from the increasing demand for cloud computing.

1. Industrial chain: The demand for computing power in various downstream fields drives the development of the server industry

The upstream of the server industry chain is mainly electronic materials and components/assembly. The midstream includes various server products, including cloud servers, smart servers, edge servers, and storage servers. Downstream demand entities include data center service providers, Internet companies, government departments, financial institutions, telecom operators, etc.

Panoramic view of server industry chain

2. Cloud computing: The Internet has the greatest demand for computing power, followed by government, services, etc.

In the field of general computing power, the Internet industry is still the industry with the largest demand for computing power, accounting for 39% of general computing power. The telecommunications industry has increased investment in computing power infrastructure, and its computing power share has surpassed the government industry for the first time, ranking second. The government, services, finance, manufacturing, education, transportation and other industries ranked third to eighth.

In the field of intelligent computing power, the Internet industry's demand for data processing and model training continues to grow, becoming the industry with the largest demand for intelligent computing power, accounting for 53% of intelligent computing power. The service industry is rapidly shifting from traditional models to emerging smart models, and its computing power share ranks second. The government, telecommunications, manufacturing, education, finance, transportation and other industries ranked third to eighth respectively.

 

2. Cloud computing: China’s market is growing faster than the world and is expected to exceed one trillion yuan in 2025

According to Gartner data, the global cloud computing market will reach US$491 billion in 2022, a year-on-year increase of 19%, but a year-on-year decrease of 13.5%. According to statistics from the China Academy of Information and Communications Technology, China's cloud computing market will reach 455 billion yuan in 2022, a year-on-year increase of 40.91%.

Global cloud computing market size and growth rate

Cloud computing remains an important driving force for the integration of new technologies and the development of business formats. It is expected that the market will continue to maintain steady growth, stimulated by the demand for large models and computing power, and the global cloud computing market will exceed one trillion US dollars by 2026. Compared with the global growth rate of 19%, China's cloud computing market is still in a stage of rapid development and has maintained a high ability to resist risks despite the economic downturn. It is expected that the overall market size of my country's cloud computing will exceed one trillion yuan by 2025.

China’s cloud computing market size and growth rate

3. Server: The sales end is concentrated, and the purchasing end is dominated by technology giants.

According to data previously released by IDC, the main suppliers of China's server market in 2022 include Inspur Information, H3C, Super Fusion, Ning Chang and ZTE.

China’s server market share in 2022

The domesticAI server industry adopts the architecture of CPU + acceleration chip, which has efficiency advantages in model training and inference. Inspur Information has a higher domestic market share, followed by H3C, Ningchang, Anqing, etc.

 China’s AI server market share in 2022

With the rise of cloud computing, mobile Internet, Internet of Things, big data, artificial intelligence and other technologies, Internet giants have gradually replaced government and banks as the main purchasers of servers. Before 2012, the downstream customers of servers were mainly governments, banks and other financial institutions, telecommunications and other large enterprises. However, the downstream customers of servers are now mainly technology giants. Cloud computing giants such as overseas Amazon, Microsoft, Google, and domestic Alibaba, Tencent, etc. have gradually become the main purchasing customers in the server market.

Server scale of major cloud vendors in China in 2022

According to IDC forecast data, the global server market size will be almost the same year-on-year in 2023, while the server market will maintain a growth rate of 8-11% in 2024 and beyond, and the market size is expected to reach US$178 billion by 2027.

 Global server market size from 2022 to 2027 (USD billion)

China's server market size will be approximately US$27.34 billion in 2022, a year-on-year increase of 9%, and the growth rate has slowed down. According to data from the Huajing Industrial Research Institute, the market size will reach US$30.8 billion in 2023, with a growth rate of 13%. With the advancement of the Eastern Data and Western Computing Project, the rapid growth of massive data computing and storage requirements and other factors, China's overall server procurement demand will further increase. IDC predicts that China's AI server market will reach US$16.4 billion by 2027.

 

China's server market size from 2016 to 2023E (USD billion)

2. The key to the bottom layer: CPU is the brain of the server, and there is broad room for domestic substitution.

1. Key role: CPU is the brain of the server, and GPU has strong parallel computing capabilities

The CPU is the control center of the server, responsible for completing tasks such as layout strategy, issuing orders, and controlling actions. Its structure includes arithmetic units, control units, registers, caches, and communication buses. Due to the underlying logic of graphics rendering, numerical analysis, AI reasoning, etc., the GPU needs to decompose heavy mathematical tasks. It uses the GPU multi-stream processor mechanism to decompose a large number of operations into small operations and process them in parallel. CPU and GPU are two different types of processors. CPU is a general-purpose processor controlled by programs and executed sequentially, while GPU is a dedicated processor used for analysis in specific fields and is controlled by the CPU. In many terminal devices, the CPU and GPU are usually integrated into one chip and have both CPU or GPU processing capabilities.

GPU invests more transistors for data processing and has strong parallel computing capabilities

2. The key to value: CPU and GPU occupy various types of servers. The hardware cost is high.

In terms of the hardware cost structure of servers, CPU and chipset, memory and external storage are the main parts:In ordinary servers, CPU and Chipset accounts for about 32%, memory accounts for about 27%, external storage accounts for about 18%, and other hardware accounts for about 23%. On the AI ​​server, the cost proportion of GPU is much higher than that of other parts, and may be close to 70% of the overall cost. When upgrading from ordinary servers to AI training servers, other components with larger value increases in a single server, including memory, SSD, PCB, power supply, etc., have increased several times.

 

Internal disassembly diagram of the server

3. Processor: CPU dominates, GPU grows rapidly

According to a report by Yole Intelligence, the revenue of the processor market is expected to reach US$242 billion by 2028, with a compound annual growth rate of 8%. The dominant position in the CPU market will be consolidated, and the market size will reach US$97 billion in 2028, with a compound annual growth rate of 6.9%. The GPU market will also achieve significant growth, with the market size reaching US$55 billion in 2028, with a compound annual growth rate of 16.5%. In the processor market, giants such as Intel, AMD, NVIDIA, and UNISOC dominate the market. In terms of processors used in servers at home and abroad, Intel, AMD, NVIDIA, Loongson, Zhaoxin, Kunpeng, Haiguang, Feiteng, Shenwei, Shengteng, etc. dominate.

 

Processor Revenue Forecast by Processor Type, 2022-2028

4. my country continues to carry out scientific and technological research under multiple rounds of U.S. sanctions.

From May 2019 to September 2020, the U.S. government imposed multiple rounds of sanctions on Huawei, which cut off Huawei's supply of 5G mobile phone chips and caused a sharp decline in Huawei's mobile phone sales. Since then, the United States has continued to escalate its restrictions on my country’s semiconductor field. However, Huawei’s latest flagship model uses the 7nm process Kirin 9000s chip, marking a milestone for China in the field of chip design and manufacturing.

On October 17, 2023, the U.S. Department of Commerce's Bureau of Industry and Security announced new cutting-edge chip export control rules, totaling nearly 500 pages, which comprehensively restrict the export of "special edition" chips produced by U.S. chip giants such as Nvidia and Intel to China. and more than 40 countries. In addition, the "long-arm jurisdiction" related to semiconductor equipment and technology has been updated, expanding the range of models that Dutch lithography machine company ASML cannot export to China, and restricting the spread to more than 20 countries outside China. At the same time, 13 Chinese entities, including Biren Technology and Moore Thread, were added to the U.S. control list, restricting Chinese companies from producing advanced chips through foundries.

On October 17, the U.S. Department of Commerce’s Bureau of Industry and Security announced new regulations

3. Wherever the bottleneck of computing power is, there will be opportunities.

Computing power is the core of modern computer technology, and its bottlenecks mainly exist in data transmission and storage. At present, computers generally adopt the von Neumann architecture, and data storage and data calculation are separated. The computing power is easily stuck in data transmission rather than real calculation. Computing power is divided into four layers, and each layer needs to solve the problem of how to make data connections faster.

1. Inside the GPU

Data transmission between the computing unit inside the GPU and the video memory is a bottleneck in performance improvement. At the same time, collaborative computing among multiple GPUs is also limited by the data transmission speed. Traditional GPUs usually use GDDR memory, which is flatly packaged, resulting in data transmission speed that cannot keep up with the computing speed of the GPU. To solve this problem, the upgraded solution uses HBM memory technology. HBM memory is a vertical package that provides greater bandwidth for faster data transfer to the GPU's computing units. For example, HBM2 has a bandwidth of up to 256GB/s, which is more than ten times faster than traditional GDDR memory.

2. AI server

Each AI server is composed of multiple GPUs (4, 8 or even more), and the GPUs need to perform collaborative calculations. However, the data transfer speed between them becomes a performance bottleneck. In this regard, Nvidia's GPU connection technology is the most advanced, using its NVLink protocol with transfer speeds of up to 50GB per second. Huawei also has its own HCCS protocol, with good bandwidth performance of 30GB per second, which is not significantly different from NVIDIA. However, other traditional servers only use the PCIe 5 standard interface, and the transmission speed per channel is only 4GB, less than one-tenth of Nvidia's. Therefore, in order to increase data transmission speed and solve this bottleneck problem, more advanced technologies and protocols are needed.

3. Data center

A data center consists of hundreds or even thousands of AI servers forming a computing cluster, which requires fast data connections between servers. Nvidia uses a dedicated InfiniBand network, while other manufacturers use ROC high-speed Ethernet networks. Although both networks use optical fiber connections at the physical layer, they are inseparable from optical modules. Whether it is data sending or receiving, whether it is the server side or the switch side, optical modules are required. This year, optical module technology has been upgraded from 400G to 800G. Because domestic manufacturers account for a high proportion in the field of optical module manufacturing, the performance of this area can be truly realized, resulting in optical module technology being the most hyped in the field of computing power.

4. Data network

Data centers in different locations and cities can form a huge computing power network. Through scheduling and coordination, end users can easily use the fastest and cheapest computing power resources. At present, the development trend of computing power network is to adopt cloud-edge-end architecture, aiming to solve the problem of data transmission. Among them, edge computing is one of the most popular technologies. Edge computing does not just refer to mobile phones and smart vehicles, but to adding a layer of direct computing power closer to the terminal outside the traditional cloud computing center to save the cost and time of data transmission. Therefore, the future trend is to combine cloud AI computing power, edge AI computing power, and client AI computing power to jointly promote the development of artificial intelligence technology.

Blue Ocean Brain Deep Learning Big Data Platform

The Blue Ocean Brain deep learning big data platform is a processing platform for multi-source spatial data. It integrates storage, computing and data processing software. It has significant advantages such as high efficiency, easy operation, low cost, multi-level expansion and rapid deployment. It is widely used in surveying, mapping and agriculture. , forestry, water conservancy, environmental protection and other fields have greatly improved image processing capabilities, protected investments, efficiently responded to big data challenges, and accelerated business breakthroughs and transformation.

1. Main technical indicators

  • Reliability: mean time between failures MTBF≥15000 h

  • Working temperature: 5~40 ℃

  • Working humidity: 35%~80%

  • Storage temperature: -40~55 ℃

  • Storage humidity: 20%~90%

  • Sound noise: ≤35dB

2. Features and advantages:

  • Based on a unified overall architecture, it adopts advanced, mature and reliable technology and software and hardware platforms to ensure that the basic data platform is easy to expand, upgrade, operate and maintain. Based on the industry's popular and leading Spark technology, it quickly improves the overall computing performance of the platform.

  • Support the scalability of basic data models, application analysis models, and front-end applications; support the scalability of servers, storage, I/O devices, etc. in a unified system architecture.​ 

  • Develop and implement basic data platform high-availability plans, operation management monitoring systems, operation and maintenance systems, fault handling plans, etc. to ensure the reliability of the system in complex environments such as multi-users and multi-nodes.​ 

  • Efficiency: Complete the data writing operation within the specified time and minimize the impact of data writing on data analysis; improve the speed of data query and statistical analysis to achieve planning requirements.

  • Data quality runs through every aspect of the construction of the basic data platform system, and the basic data platform system ensures data quality through reasonable data quality management solutions.​ 

  • Implement data security management in accordance with national standards, industry standards, security regulations, etc.​ 

  • A unified management platform performs corresponding performance management and log monitoring of the system.​ 

  • The flexible and diverse display methods of the human-machine interface allow end users to easily use new analysis tools with only appropriate training, reducing the workload of IT personnel and enhancing the timeliness of cluster supervision.

  • It has super image processing capabilities and can process up to 500 scene pairs (panchromatic and multispectral) Gaofen-1 image data every day (24 hours).

  • It is widely used in basic surveying and mapping, agriculture, forestry, water conservancy, environmental protection and other fields. It is suitable for product production in conventional mode and rapid image generation in emergency mode.

In response to the problems existing in the original big data technology, the Blue Ocean Brain big data platform has carried out a series of technical developments on Apache Hadoop from the perspective of enterprise applications, forming a one-stop big data platform suitable for enterprise-level applications, thereby meeting the requirements of various enterprises. :

  • Distributed storage of very large data and real-time computing requirements for streaming data

  • Meet high-concurrency, low-latency query requests for big data

  • When the distributed application system fails abnormally, service switching

  • When the system expands linearly, there is no need to increase development work and achieve cost-free expansion.

3. Commonly used configuration recommendations

1、CPU:

- Intel Xeon Gold 8358P 32C/64T 2.6GHz 48MB,DDR4 3200,Turbo,HT 240W

- Intel Xeon Platinum 8458P 28C/56T 2.7GHz 38.5MB,DDR4 2933,Turbo,HT 205W

- Intel Xeon Platinum 8468 Processor 48C/64T 2.1GHz 105M Cache 350W

- AMD EPYC™ 7742 64C/128T,2.25GHz to 3.4GHz,256MB,DDR4 3200MT/s,225W

- AMD EPYC™ 9654 96C/192T,2.4GHz to 3.55GHz to 3.7GHz,384MB,DDR5 4800MT/s,360W

- Intel Xeon Platinum 8350C 32C/64T 2.6GHz 48MB,DDR4 3200,Turbo,HT 240W

- Intel Xeon Gold 6240R 24C/48T,2.4GHz,35.75MB,DDR4 2933,Turbo,HT,165W.1TB

- Intel Xeon Gold 6258R 28C/56T,2.7GHz,38.55MB,DDR4 2933,Turbo,HT,205W.1TB

- Intel Xeon W-3265 24C/48T 2.7GHz 33MB 205W DDR4 2933 1TB

- Intel Xeon Platinum 8280 28C/56T 2.7GHz 38.5MB,DDR4 2933,Turbo,HT 205W 1TB

- Intel Xeon Platinum 9242 48C/96T 3.8GHz 71.5MB L2,DDR4 3200,HT 350W 1TB

- Intel Xeon Platinum 9282 56C/112T 3.8GHz 71.5MB L2,DDR4 3200,HT 400W 1TB

2、GPU:

- NVIDIA A100, NVIDIA GV100

- NVIDIA L40S GPU 48GB

- NVIDIA NVLink-A100-SXM640GB

- NVIDIA HGX A800 80GB

- NVIDIA Tesla H800 80GB HBM2

- NVIDIA A800-80GB-400Wx8-NvlinkSW

- NVIDIA RTX 3090, NVIDIA RTX 3090TI

- NVIDIA RTX 8000, NVIDIA RTX A6000

- NVIDIA Quadro P2000,NVIDIA Quadro P2200

Guess you like

Origin blog.csdn.net/LANHYGPU/article/details/134875277