Does AIGC rely on GPU or CPU? The Evolution of Two High-Performance Computing Technology Directions

The AI ​​industry in 2023 can be described as surging. The birth of ChatGPT has made generative AI technology popular all over the world overnight, and many ordinary people who have never understood artificial intelligence have also begun to have a strong interest in large models. The media and research institutions have launched long-form topics to demonstrate which industries such as ChatGPT, StableDiffusion, Midjourney and other large text and image models will have a subversive impact; many employees and companies have even begun to use these large models to improve productivity in daily work, and even replace them. human post. There is no doubt that 2023 will be a turning point for the explosion of large-scale model technology, and a far-reaching technological revolution is slowly kicking off.

In the AI ​​industry, although OpenAI is temporarily in the leading position with ChatGPT, the huge market prospect has attracted a large number of enterprises and scientific research institutions to join the battlefield of large models. Google, Meta, Baidu, Ali, ByteDance, Tencent, JD.com, iFLYTEK, Pangu... a lot of Internet giants, start-up companies and colleges have released their own large-scale model services or plans. ChatGPT has set off an AI arms race, and Internet companies with a little strength have actively or passively joined it, hoping to firmly grasp this rare historical opportunity.

The sudden boom in large-scale models has also caused the industry's demand for hardware infrastructure to soar. Ultra-large models with hundreds of billions or even trillions of parameters require huge computing power support, and operating a typical large-scale model service generally requires thousands of multi-GPU servers. Such a large demand for computing power has brought a heavy burden to the enterprise, and the difficulty of obtaining core hardware has made the situation worse.

On the other hand, the application prospects of ultra-large general models like ChatGPT in industry practice have also been questioned. Many opinions believe that in vertical industries, small and medium-sized models optimized for domain knowledge may have better performance. The training cost required for these small and medium models is significantly lower than that of general-purpose large models, and they are not highly dependent on expensive and hard-to-obtain GPU hardware. They can use a new generation of CPUs with AI acceleration hardware and dedicated AI acceleration chips, etc., which are more efficient. Suitable for industry-specific use and SMB use.

AI Productivity
GPUs Are Not the Only Option

In AI, GPUs are often seen as the only computing hardware option. With huge parallel computing resources, GPU can quickly process matrix operations in the deep learning process, greatly improving the speed of model training and reasoning.

However, due to problems such as high GPU prices, limited memory capacity, supply chain problems, and insufficient scalability, enterprises and developers have begun to realize that they can use solutions such as CPUs to obtain higher cost performance in some AI productivity scenarios. For example, Hugging Face's lead AI evangelist Julien Simon recently demonstrated a 7 billion parameter language model, Q8-Chat, running on a 32-core 4th Gen Intel® Xeon® Scalable processor, faster than ChatGPT many. Q8-Chat is based on the open source MPT-7B language model of MosaicML, and makes full use of the AI ​​acceleration engine of the 4th generation Intel® Xeon® Scalable processor to improve performance. Since the CPU has a good serial computing capability, the CPU often has better performance than the GPU in AI tasks that rely more on serial or hybrid computing.

Additionally, CPUs, while often not as fast as GPUs in model training scenarios, are capable of delivering similar levels of performance in inference scenarios. At the same time, the CPU's easy-to-expand memory, software compatibility, and excellent scalability also allow companies to have greater flexibility when choosing the software stack of the AI ​​​​reasoning system. For this reason, leading Internet companies including Meituan, Alibaba Cloud, and Meta are exploring ways to use CPUs to improve AI reasoning and training performance in some scenarios, reduce AI hardware procurement costs, and reduce dependence on specific AI software stacks . In the AI ​​industry, the importance of the CPU is increasing day by day.

From recommender systems to visual inference,
how CPUs are taking off in AI

When it comes to AI hardware, the CPU has long played the "green leaf" role. Developers generally only care about how many GPU computing cards the CPU can support and whether they can run stably for a long time, and basically do not consider the computing power requirements of using the CPU to carry AI applications. The reason is also very simple. Compared with the GPU, the parallel computing power of the CPU is too low.

But this situation has turned around today. At the end of 2022, the fourth-generation Intel® Xeon® Scalable processors equipped with AMX acceleration technology will be launched. For the first time, CPUs can achieve AI performance comparable to high-end GPUs in many application scenarios. AMX can be regarded as an acceleration module specially designed for AI computing in the CPU core. It is optimized for INT8 and BF16 computing. Compared with the traditional AVX instruction set, it can provide an order of magnitude higher single-cycle instruction throughput performance. With the help of AMX, the AI ​​computing capability of the 4th generation Intel® Xeon® Scalable processor has been greatly improved, and in some fields, it has achieved higher cost performance than GPU.

Recommended system

The recommendation system is a very important and common artificial intelligence application. It usually includes basic components such as knowledge base, topic model, user/video portrait, real-time feedback/statistics, and recommendation engine. It can analyze massive data and provide Users provide personalized content and services to help enhance user value.

Modern recommendation systems have high requirements for AI computing power. As the world's largest e-commerce giant, Alibaba's core recommendation system needs to process hundreds of millions of requests per second from Tmall and Taobao's huge global customer base in real time. The system needs to ensure that the processing time of AI reasoning tasks is within a strict delay threshold to ensure user experience; at the same time, the system needs to ensure a certain reasoning accuracy to ensure the quality of recommendations. In order to achieve a balance between performance and cost, Alibaba has recently started using CPUs in its recommendation system to handle workloads such as AI inference, and has selected 4th generation Intel® Xeon® Scalable processors for performance optimization.

Alibaba has cooperated with Intel to apply the AMX acceleration engine to the entire stack of the core recommendation model by using the Intel oneAPI deep neural network library. With AMX, BF16 mixed precision, 8-channel DDR5, larger cache, more cores, efficient core-to-core communication, and software optimizations, the mainstream 48-core 4th Generation Intel® Xeon® Scalable processor can Increase the throughput of proxy models by nearly 3 times, surpassing the mainstream 32-core 3rd generation Intel® Xeon® Scalable processors, while keeping the latency strictly below 15 ms. This performance is already comparable to the high-end GPU solution adopted by Alibaba, and at the same time has stronger advantages in terms of cost and flexibility. Alibaba's solution has been put into production practice and has experienced the test of peak load pressure such as the Double Eleven Shopping Festival.

Guess you like

Origin blog.csdn.net/YDM6211/article/details/131434167