From GFS to GPT, 20 years of excitement in AI Infra

Introduction

Recently, the waves of AIGC and LLM have been rising one after another, and there is a strong tendency to completely realize the pie that the AI ​​industry has created in the past ten years overnight. AI Infra (the infrastructure required to build AI) has also become one of the focuses of discussion . The public's focus on AI Infra is often on AI computing power - such as the chip blockade of A100/H100; for example, Musk bought 10,000 more GPUs, etc.

Computing power is undoubtedly a crucial part of the AI ​​wave, but AI Infra is not only related to computing power. Just as GPT was not a sudden success, the AI ​​Infra industry has also experienced a long period of accumulation and iteration. The author has recently been discussing various developments in AI with colleagues and friends. Whenever I talk about AI Infra, there are always thousands of words in my heart but it is difficult to put it into words, so today I decided to write down everything I want to say.

As the title says, the development of the entire AI is inseparable from big data, and the beginning of big data is naturally Google's three major pieces: Google File System, MapReduce and BigTable. The GFS paper was published in 2003, exactly 20 years ago. These 20 years have also been 20 years of rapid development of big data, AI, and the Internet.

This article attempts to sort out the milestone events of AI Infra in the past 20 years . Because when we are in it, we often cannot distinguish between hype and practical information, nor can we clearly see the structural battle between local leadership and ultimate victory. Only when we look back at history and observe long-term changes, some patterns emerge. Without further ado, let’s get started!

directory index

【2003/2004】【Framework】: Google File System & MapReduce

【2005】【Data】: Amazon Mechanical Turk

【2007】【Computing power】: CUDA 1.0

【2012/2014】【R&D tools】: Conda/Jupyter

【summary】

【2012】【Framework】: Spark

【2013/2015/2016】【Framework】: Caffe/Tensorflow/Pytorch

【2014】【Framework/Computing Power/R&D Tools】: Parameter Server & Production Level Deep learning

【2017】【Computing Power】: TVM/XLA

【2020】【Data/Computing Power】: Tesla FSD

【2022】【Data】: Unreal Engine 5.0

[2022] [Data/R&D Tools]: HuggingFace raises US$100 million

[Currently] What AI Infra does OpenAI have?

【Conclusion】


【2003/2004】【Framework】: Google File System & MapReduce

The GFS paper released by Google in 2003 can be said to have kicked off this 20-year drama, announcing that human society has officially entered the era of Internet big data. A small episode is that although Google opened the paper, it did not have an open source implementation. As a result, Apache Hadoop later occupied the open source ecosystem with "hard to describe" performance (which also paved the way for the emergence of Spark in the future), and the open source community developed explosively. It must have also influenced Google's subsequent attitude towards open source systems.

GFS and MapReduce can be said to have opened the era of distributed computing. At the same time, in addition to traditional stand-alone operating systems, compilers, and databases, the word "Infrastructure " has gradually become more and more popular. I won’t say much about GFS here, but I want to focus on discussing the “problems and shortcomings” of MapReduce . I don’t know if anyone, after learning the MapReduce programming model for the first time, has been wondering like the author: What’s so special about this Map and Reduce? Why them and not other interfaces? Why must we use this paradigm for programming? Does the inverted index need to be built using MR? Even after reading through the paper, I still could not fully understand all the questions.

And later I discovered that I was not the only one complaining. In 2008, database guru Michael Stonebraker, who had not yet won the Turing Award, wrote an article harshly criticizing "MapReduce: A major step backwards" and directly criticized a school on the West Coast by name: "Berkeley has gone so far as to plan on teaching their freshman how to program using the MapReduce framework." The main criticism of Professor Stonebraker is that MR lacks a lot of features of traditional databases, especially Schema & high-level SQL language, Indexing query acceleration, etc. Our Alibaba classmates must have been very happy when they saw this: "Hey, these features you always mentioned, our MaxCompute's lake-warehouse integration/SQL query/automatic acceleration, are all available now! MR can also be great."

But this is already modern times. Let’s go back to 2004 to see why Google still launched MapReduce and defined the model of the entire open source big data ecosystem without these advanced features in the future. What I want to say here is: " Only by understanding the shortcomings of a successful architecture can we truly understand how much benefit its advantages bring, so that all shortcomings can be erased ." MapReduce is not necessarily a good programming paradigm (later development has also proved that there are various better paradigms). It makes algorithm implementation complex and dogmatic. It can only implement a small part of the algorithm, and its performance may be better than the original algorithm. The optimal implementation of the problem is far from optimal. But in 2004, it made it very easy for ordinary programmers to use large-scale distributed computing! You don’t need to understand MPI or the principle of distributed communication synchronization. After writing Mapper and Reducer, you can run the program on a cluster of thousands of servers. The key is that you don’t have to worry about machine failures and other abnormal problems.

Ultimately, MapReduce is a compromise

MR sacrifices flexibility and performance, but allows users to obtain stable and reliable distributed computing capabilities. And various "compromises" have become the main theme in subsequent generations of AI Infra. However, we can also be pleasantly surprised to see that with the development of modern engineering technology, there are many systems that score high scores in the three dimensions of flexibility, performance, and stability. Of course, new compromise points will still exist, which is one of the reasons why the field of AI Infra or Large-Scale Computer System is fascinating.

There is one last thing to say about GFS and MR, which is " workload-oriented design. " Google also said in the paper that the design of the entire big data system is closely related to their search engine business: the file system can only append writes It will not be deleted. Reading is mainly sequential reading rather than random reading. The tasks that require MR are mainly scanning databases and building indexes. The support of traditional databases and file systems for other general needs will inevitably lead to them not being the optimal solution for the task of big data processing.

Okay, after reading this, some readers may ask, you have talked so much about GFS from 20 years ago, where is the GPT that I care about? How to create GPT? Don’t worry, there is nothing new under the sun. The design thinking of the framework 20 years ago may not be fundamentally different from the latest AI Infra.

【2005】【Data】: Amazon Mechanical Turk

Time has come to 2005, let us step away from the system field and see what kind of surprises AMT has brought to the world. In fact, when Web 1.0 first started, it was also the Internet bubble period. It may be similar to what we feel now. The whole society is in a crazy state. I don’t know who had the sudden idea at Amazon to build such a crowdsourcing platform based on the Internet, but this was a disaster for the teachers in the school’s research institute who relied on students and manually recruited subjects to label the data. Ever since, Stanford's Fei-Fei Li team began to annotate the largest Image Classification data set in CV history based on AMT: ImageNet, and began to hold competitions in 2010. Finally, in 2012, AlexNet technology shocked everyone, triggering the first A deep learning revolution.

Here are three points to make about AMT and ImageNet:

  1. Looking back at the previous revolutions in "data" in hindsight, the characteristics are obvious. Each time, the cost of obtaining data annotation has been greatly reduced, or the scale of data has been greatly increased. It is AMT or the Internet that allows humans to easily obtain labeled data on a large scale for the first time in order to study AI. And by the 2023 LLM, everyone has actually thought about this issue very clearly: "It turns out that there is no need for any crowdsourcing platform at all. Everyone who has spoken on the Internet, as well as the ancients who have written books before, are actually on the Internet." Help AI label data ."

  2. Many students don’t know why ImageNet has a Net, or they think that ImageNet’s Net and AlexNet’s Net both refer to neural networks. In fact, they don’t at all. You can refer to the original paper of ImageNet here, mainly because there was another project before, WordNet, which is a work similar to a knowledge graph or a large dictionary. Various categories and concepts are recorded and connected in a network. ImageNet selectively constructs 1000 object categories based on WordNet and designs visual classification tasks. From a modern point of view, this is called multi-modal graphics and text, but in fact this is a paradigm that has existed very early: " Borrowing Taxonomy from NLP to define the classification task of CV ."

  3. Fei-Fei Li has many very interesting CV papers, and the number of citations is generally not high because their entry points are often unique. In addition, Fei-Fei Li’s disciple Andrej Karpathy must be very familiar to everyone. Although everyone may not remember AK’s paper (you don’t even know that he is also in the author list of ImageNet), AK’s blog and Github are extremely popular. Influence, from the earliest "Hacker's guide to Neural Networks" to the recent nanoGPT, and the title of AK's doctoral thesis is: "Connecting Images and Natural Language"

【2007】【Computing power】: CUDA 1.0

In 2007, when gamers were still struggling with which graphics card to buy to run Crysis, NVIDIA quietly released the first generation of CUDA. The reason why I use the word "quietly" is because it probably didn't cause any splash at the time. Because a few years later, the author heard comments about CUDA from seniors who were engaged in image processing, without exception: " It's really difficult to use ." Yes, after all, I have been spoiled by compilers and high-level languages ​​for so many years. Suddenly I tell you that when writing a program, you have to think about how the GPU hardware operates, and you have to manually manage the cache. If you don't pay attention to it, the program will change. If you have to die slowly, who can like it? What's even more terrible is the floating point precision problem of CUDA. Back then, I used CUDA for the first time. After excitedly writing a matrix multiplication, I compared it and found, Huh? Why are the results so different? Is there something wrong? After a long time of troubleshooting, there was no result. After all, " When using the CPU, if the results are wrong, it is always my fault and not the fault of the hardware ." Later, after being pointed out by a classmate, it turned out that CUDA’s floating point numbers were not precise enough, so Kahan Summation needed to be used. This is the code below:

float kahanSum(vector<float> nums) {
    float sum = 0.0f;
    float c = 0.0f;
    for (auto num : nums) {
        float y = num - c;
        float t = sum + y;
        c = (t - sum) - y;
        sum = t;
    }
    return sum;
}

After adding it, the result is magically correct. Nowadays, students who use V100/A100 every day and complain about NPU/PPU that it is not good and cannot adapt may not know that when CUDA was first promoted, it was not much better. Especially in the field of high-performance computing, because major customers are scientific research institutions that perform various differential equations, they use ancient Fortran codes written by scientists all year round, and the hardware has always been CPU double-precision floating point numbers to ensure safety. So for quite a long time, CUDA was not considered at all. Intel has also become the absolute dominant player in the high-performance field.

In addition, I would like to introduce a protagonist who once had high hopes from Intel: Xeon Phi. The Xeon Phi chip was first released in 2010 and is a many-core architecture developed to compete with CUDA. When the author participated in the ASC supercomputing competition in 2013, Intel sponsored a large number of Phi for free and directly provided a question for everyone to try Phi. It feels like everyone has used it... Convenience is really convenient. After all, the main thing is that the compiler does everything. The original high-performance distributed code does not need to be changed in a single line, and it is directly adapted to the many-core architecture. This is also Intel's long-standing approach to CISC architecture CPUs and compilers: " The bottom layer implements complex instruction sets, and the compiler completes all translation and optimization. " Upper-level users are completely indifferent, and they can enjoy the dividends of Moore's Law by paying every year (it's worth it) It should be mentioned that Intel's Icc high-performance compiler and MKL library require additional payment). Unfortunately, although Phi's goals and vision are very good, its compiler and many-core architecture do not live up to the claims. After one-click switching, the performance is greatly improved. The Phi project never accumulated a large number of users and was eventually shut down in 2020.

On the other hand, CUDA has achieved successive victories: people have found that CUDA is actually better than CPU when writing high-performance applications in SIMD mode. The reason is precisely that " the compiler does less ." When writing CUDA, what you write is what you get. Unlike writing high-performance CPU applications, you no longer need to compile assembly code to check whether the vectorization is effective and whether the loop expansion is correct. Since the CPU Cache cannot be directly managed, we can only rely on experience and actual measurements to understand the current Cache allocation. This also raises a question: " Must compilers and language design meet everyone's needs? " Probably not. Finding the real users of this language (high-performance engineers) may be the key.

What is most relevant to this article is that CUDA has found such a magical customer as AI. It is said to be magical, because the AI ​​algorithm really makes the "Numerical Analysis" teacher wow, and the "Convex Optimization" teacher vomits blood. why? For such a large-scale numerical computing application, it actually said that " accuracy is not important " and " it is all basic matrix operations that CUDA is best at ". Machine learning does not require double precision, it can be done directly with single-precision floating point numbers. Even single-precision is not enough during inference. Half-precision, int8, and int4 can be used. From an optimization perspective, it also breaks all traditional perceptions of convex optimization: it is a non-convex optimization problem that is not required by various traditional algorithms. Moreover, optimizing the full amount of data is not effective. Although the mini-batch of SGD will have its own noise, the noise is beneficial to training. As a result, another weakness of the GPU: limited video memory, has become less of a problem under the mini-batch algorithm.

In short, CUDA and GPU seem to be born for AI, and their shortcomings eventually turned into features, turning Huang into the kitchen overlord and the king of nuclear bombs. And those of us who are currently devoting our entire nation to developing self-developed chips should not forget that in the 16 years since CUDA was released, in addition to the underlying chips, the software layer tool chain and user habits and user ecology have evolved step by step from 0 to 1. . Will GPU be the only one company in the future? Will TPU/NPU/PPU overtake in corners? Let's wait and see.

【2012/2014】【R&D tools】: Conda/Jupyter

After talking about frameworks, data and computing power, let’s take a look at AI R&D tools. The question that has to be discussed here is: Why is the mainstream language of AI Python? In fact, not only AI, but also the popularity of Python has been increasing year by year. The open source community has long discovered that after providing a Python interface for a project, user usage will increase significantly, and everyone is more inclined to use the Python interface. The reason is that the charm of a dynamic scripting language that does not require compilation is really great. No need to say much here, after all everyone knows:

Life is short, I use Python

The Python ecosystem itself is also constantly improving. Pip-based package management is already very convenient. After the launch of Conda in 2012, it has made " virtual environment management " extremely easy. You know, for a development field that frequently needs to reuse open source software packages, this is definitely a Killer Feature.

In addition to package management, another major breakthrough in Python is Jupyter based on IPython. It raises Python's already easy-to-use interactive functions to a new benchmark and creates the Jupyter Notebook that everyone loves. As for whether Notebook counts as AI Infra, look at Google's Colab, the guidance tutorials of various current AI open source projects, and our own PAI-DSW to know that Notebook is already an indispensable part of AI R&D and knowledge sharing. The missing link. Its Web-side R&D experience that isolates back-end clusters allows users to control massive computing resources in one stop. They no longer have to use Vim or remotely synchronize codes.

For the author, the first choice for writing Data-related Python experimental code is no longer IDE, but Jupyter Notebook. The reason is very simple: processing data such as images, Dataframe, and Json requires frequent "iteration of different algorithm strategies . " At this time, " how the code is written depends on its inherent data format and previous algorithm results ." The specific form of data and algorithm results can only be seen at runtime. Therefore, " writing code while running the code " is a commonplace for data processing and AI algorithm engineers. Many Infra engineers who don't understand this will inevitably design frameworks or tools that are hard to describe. We will also see later that Tensorflow, which has gone backwards in terms of interactivity and dynamics, is also losing users bit by bit.

【summary】

Through the introduction of the representative work of the previous four sectors, it is not difficult for us to see the prototype of the overall picture of AI Infra:

  1. Computing power : A powerful CPU/GPU is required to provide computing power support for various numerical calculation tasks, and the compiler needs to provide a good programming interface for high-performance engineers.

  2. Framework : Abstract a programming paradigm that is both universal and satisfies certain constraints for a specific Workload. The execution engine can provide one-stop capabilities such as distributed computing, disaster recovery and fault tolerance, as well as various operation and maintenance monitoring capabilities.

  3. R&D tools : AI and data algorithm R&D expect real-time interactive feedback when writing code; the open source community requires that code and other production materials can be easily packaged, released, integrated, and version managed.

  4. Data : A tool or model is needed that provides the massive data required for AI training.

With these ideas in mind, it is actually easy to see clearly the basic context of the subsequent development of AI Infra. Let us continue to take a look.

【2012】【Framework】: Spark

Still in 2012, this year Matei Zaharia of Berkeley published the famous Resilient Distributed Datasets paper and open sourced the Spark framework. Spark has completely changed the "slow" and "difficult to use" problems of the Hadoop ecosystem. With the popularity of Scala and Pyspark/Spark SQL, many of the latest developments in the field of programming languages ​​have been introduced into the big data open source community. In fact, at present, whether the RDD is In Memory may not be the most important. After all, most jobs are not Iterative. However, the Spark shell implemented with the help of Scala interactive shell is disruptive in itself for Hadoop, which can easily start tasks in minutes (think about telling a classmate who writes Java based MR interfaces every day, you can now use the Python command What is it like to engage in big data computing in the industry?) Not to mention Scala’s various syntactic sugars and support for massive operators.

All in all: Spark uses Scala, Python, and SQL languages ​​to provide an excellent interactive experience to reduce the dimensionality of cumbersome Java and provide better system performance. And people have also seen that as long as the " user experience " is good enough, even a mature open source ecosystem can be subverted. The open source big data ecosystem has therefore entered a stage where a hundred flowers are blooming.

【2013/2015/2016】【Framework】: Caffe/Tensorflow/Pytorch

In 2013, the AI ​​Infra job closest to what the public thought was coming. That is Jia Yangqing and Daniu’s open source Caffe. Since then, the threshold for Deep Learning has been greatly lowered. Through the model configuration file, you can build a network, complete training, and utilize the computing power of the GPU. For a time, model innovation ushered in an era of explosion. In fact, the open source frameworks during the same period also included Theano and Lua-based Torch, but the usage methods were different. Subsequently, major companies entered the game one after another. Google and FB released Tensorflow and PyTorch in 2015 and 2016 respectively. Together with MxNet endorsed by Amazon in the future, and Baidu’s PaddlePaddle, machine learning frameworks ushered in an era of contention among a hundred schools of thought. There are too many points that can be discussed about machine learning frameworks, and there is also a lot of public information. Here we only discuss two of them:

One is " Symbolic vs. Imperative " from the perspective of framework design . This discussion can be traced back to MxNet's technology blog Deep Learning Programming Paradigm. MxNet is also the first framework to support both modes, and it is pointed out in the blog: Imperative is more flexible and easier to use, and Symbolic has better performance. Looking at the early versions of other frameworks, we focus on one of the paradigms: Tensorflow is Symbolic, and PyTorch is Imperative. As everyone knows about what follows, Pytorch fully inherits the advantages of the Python language and has always been known for its flexibility and suitability for scientific research; TF is more friendly in engineering deployment, but sacrifices interactivity. After a long period of iteration, the two paradigms have basically converged. TF also supports Eager mode, and later directly launched the new framework Jax; Pytorch can also export and operate Symbolic Graph, such as TorchScript and Torch.fx. Just like MapReduce is a compromise, each machine learning framework also makes some compromise on " ease of use " and " performance ". But overall, Pytorch, which focuses on Imperative and is consistent with Python usage habits, is still gradually occupying the peak in terms of user volume. Of course, it's too early to draw any conclusions. The development and iteration of machine learning frameworks is far from over.

Another point that can be discussed is the relationship between " machine learning framework evolution and algorithm evolution. " The reason for discussing this point is because many algorithm development teams and engineering framework teams are accustomed to the working model of Party A and Party B. Framework development and framework upgrades are understood as: In order to implement a certain model or idea, algorithm scientists find that the existing framework cannot support it, so they provide the engineering team with some new ideas on operator Op implementation, new distributed computing paradigms, and performance optimization. need. This model has many disadvantages. For example, it can only support local small innovations, and the innovation cycle may be very long. It may even lead to something that is often complained by the engineering team: "The last requirement has not been completed, and the algorithm side has been replaced with a new one. " idea ". Therefore, how to polish the collaboration model between the algorithm team and the engineering team is a very important topic: for example, in the Co-Design methodology, both parties must put themselves in perspective and predict the technical path in advance. For example, the engineering team cannot turn their daily work into helping scientists implement functional codes. Instead, they must provide a flexible upper-layer interface for scientists to explore on their own. The framework layer focuses on solving stuck engineering and technical problems. And the most important thing is: both parties must realize: " The current model structure and framework implementation may be just an accident in the historical speech process ", and " model design and framework implementation will continue to influence each other's evolution path. "

effa6e0ff5f1e20c4504e68edc374d8c.png

The reason is also very simple. At the forefront of model innovation, there is a chicken-and-egg problem: algorithm scientists can only implement and verify those ideas that existing frameworks can implement, and the functions supported by the framework are often successful in the past. The algorithm architecture has been passed or the architecture that scientists have put forward clear requirements. So how does true system-level innovation happen ? Maybe it’s back to Ali’s old saying:

Because I believe, I see

In addition, the symbiotic relationship between algorithms and frameworks has also triggered a lot of discussion in recent years. For example, there has been a lot of discussion recently, why is LLM a Decoder Only architecture? There is also the article "The Hardware Lottery" that puts forward " A research idea wins because it is suited to the available software and hardware ."

All in all, for machine learning frameworks, the meaning of "framework" has long gone beyond the scope of big data frameworks such as MapReduce/Spark to help engineers realize various Data ETL functions. Because the form of the algorithm and model itself is changing and innovating, if the framework is too restrictive, it will restrict the iteration and innovation of the algorithm.

【2014】【Framework/Computing Power/R&D Tools】: Parameter Server & Production Level Deep Learning

The framework of the open source community has triggered a new wave of AI, and in the search and promotion business of major Internet companies, everyone has begun to wonder whether the success of Deep Learning can be reproduced in the traditional Ctr algorithm? The answer is yes! Basically all major manufacturers have started related research and development. Let’s make an advertisement here. Take Alimama’s display advertising business line that I am familiar with as an example. From MLR in 2013 , to the later large-scale distributed training framework XDL , to DIN and STAR , students who search for promotion should be very familiar with it. understood. Open source frameworks do not support large-scale Embedding Tables and reliable distributed training, which also gives room for development of self-developed Parameter Server-like frameworks. The large-scale distributed training framework has also become the main driver of search promotion algorithm iteration in recent years. As mentioned above, in the context of high-frequency iteration of models and normalization of large-scale promotions and efficiency improvements, algorithm innovation and framework evolution are a complex symbiotic relationship. I also recommend you to read the advertising recommendation technology written by teacher Huai Ren . The development cycle completely describes the evolution of the entire algorithm architecture.

On the other hand, the training engine is just the tip of the iceberg in engineering the entire search promotion algorithm. Model inference engine, real-time data flow, ABTest experimental platform, container scheduling platform, etc. all require a complete set of Infrastructure. The most detailed one written here is of course Teacher Wufu’s AI OS overview . The author also roughly sorts out some common problems faced in industrial-level machine learning applications in the figure below.

6d9e453409ebdd15a6dd7dd8e766642d.png

I have to say here that the Ctr model promoted by Sou has always been the technological high ground among large manufacturers because it is highly related to Internet business and revenue. After countless people’s continuous polishing over time, it can be said that every detail of the learning paradigm of y = f(x) has been achieved to the extreme. Each small box in the above picture is worthy of 10+ technical sharing articles. In the GPT era, LLM, semi-supervised learning paradigms, and AI applications with broad future prospects, Alibaba’s accumulation in this area can definitely be migrated and reused, and continue to shine.

【2017】【Computing Power】: TVM/XLA

The time has arrived in 2017, both TVM and XLA were released this year, and the topic of AI compiler is worthy of our separate discussion. Unlike machine learning frameworks, which mainly solve the problem of ease of use, AI compilers focus on solving the problems of performance optimization and optimal adaptation of computing chips. Generally, the performance of model inference is improved by generating the underlying calculation code of a single operator or reorganizing and merging the calculation graph. At a time when chip supply is cut off and self-developed chips are in full bloom, AI compilers have become one of the fastest growing areas of AI Infra. Teacher Yang Jun from Alibaba’s PAI team also wrote a review about AI compilers.

Since it is a compiler, there will be issues like who the compiler user is and the interface agreement we mentioned earlier. In addition, there is the issue of general compilation optimization vs. proprietary compilation optimization. For example, in the search and promotion business, due to the particularity of its model structure, it often builds its own proprietary compilation and optimization, and specifically summarizes certain optimization patterns to support the massive inference computing power requirements brought about by model iteration. As for general compilation optimization algorithms, it is actually difficult to integrate these specific Pattern abstractions into optimization rules.

On the other hand, the graph optimization algorithm of the AI ​​compiler is often not friendly to ordinary algorithm students. The reason is that slight changes to the model may cause the original optimization rules to fail to hit. The reasons for failure to hit are often not given. This goes back to the problem of CPU high-performance compilers mentioned earlier. Although the compiler seems to be very powerful and versatile, it can hide hardware details. However, users who can actually write high-performance code generally need to have a full understanding of the underlying logic of the hardware and understand the implementation logic of the compiler for inspection and verification.

So, does the AI ​​compiler help novice users improve performance with one click like torch.compile, or is it just an automated tool that improves R&D efficiency for high-performance model engineers with underlying cognition? At present, both are available. For example, OpenAI also released Triton in 2021, which can use Python syntax to more conveniently perform CUDA-like GPU programming. Work like Triton not only requires programmers to have a general understanding of the principles of the GPU multi-threading model, but also significantly lowers the entry barrier. TVM is also constantly being upgraded. For example, you can take a look at "New Generation Deep Learning Compilation Technology Changes and Prospects" written by Tianqi . Let us wait and see how AI compilers will develop in the future!

【2020】【Data/Computing Power】: Tesla FSD

Time has come to the third decade of the 21st century. At this time, the public perception of the AI ​​field is a little dull. Because the RL revolution brought about by AlphaGo in the last wave has not yet achieved a lot of benefits in actual scenarios, L4 driverless driving has also fallen into a bottleneck, and the cakes previously drawn by other AIs are still on paper. The engineering architecture of Sou Promotion has also gone from 3.0 to 4.0 and then to 5.0, 6.0, 7.0...

Just when everyone was still thinking about what AI was going to do, this year Tesla, led by Andrej Karpathy, suddenly made a big move and released the Full Self-Driving driverless solution with a purely visual architecture. Tesla AI Day announced a complete technical solution: BEV sensing, data closed-loop Data Engine, on-end FSD chip, cloud Dojo ultra-large-scale training engine, etc. With one stone, Tesla changed the perception of the industry. Its shadow can be seen in the PR drafts of most domestic autonomous driving companies.

7791167a4424cc4813c4f03eb1843b8f.png85a41af5ab10c19bc94bac2be8a791e4.png

fffc7f8a12a48a14752a300533b0f119.pngPictures from Tesla AI day

It can be said that Tesla has taken the engineering architecture of supervised learning to a new level: large-scale semi-automatic labeling engine, large-scale active hard case data collection, large-scale distributed training and model verification, and the underlying AI Infra supports several Continuous iteration of ten perceptual regulation models.

【2022】【Data】: Unreal Engine 5

Time comes to April 2022, ChatGPT will arrive on the battlefield in 8 months, and UE5 will be officially released this month. Students who are paying attention must know that the effects are extremely stunning: Nanite's real-time rendering of super-large-scale triangle patches, and Lumen's dynamic global illumination. In the official DEMO "The Matrix Awakens" we can also see what level real-time rendering can achieve today.

163697cd26ddc807e8ce5c28ecb251e0.png

The accompanying picture comes from the Unreal Engine official website

So is UE 5 AI Infra? The answer is also yes. First of all, various open source simulation rendering tools based on UE4 such as AirSim and CARLA have long been used on a large scale to generate training data for drones and unmanned driving. Training unmanned vehicles in GTA and training villains to run in MuJoCo (MuJoCo has been acquired by Deepmind in 2021) are nothing new. Such a revolutionary update of UE5, coupled with the development of the entire material construction and 3D model production line, will inevitably make the effect of real-time rendering simulation step by step closer to the real physical world.

Well, will DeepMind + MuJoCo + UE5 become more popular one day in the future? let us wait and see.

[2022] [Data/R&D Tools]: HuggingFace raises US$100 million

2a5e69ab65a55b31f388bf2b336cb937.png

Students who pay attention to AI and GPT must have seen this smiling face often recently, but what exactly does Hugging Face do, and why can it become a key part of AI Infra and successfully raise 100 million yuan in 2022? If you know projects like OpenCrawl, Pile, Bigscience, Bigcode, and PubMed, then you must also be studying LLM training data. And you will be surprised to find that a lot of corpus data has been sorted and put on Hugging Face. They also created a Python package called Datasets!

Unknowingly, Hugging Face has become the Github for Data & Model in the field of AI (at least NLP). Seeing this, some students want to ask. Face recognition, search promotion, and autonomous driving companies that have been engaged in AI for so many years have always said that data is the strongest barrier. I have never heard of anyone making the most precious data and models open source. Put it online. But when things came to LLM and GPT, there was a fundamental change. The data currently used by large multi-modal models naturally exists on the Internet. It is open and easy to obtain (except for copyright issues). So the current model has become that everyone helps collect and organize data bit by bit, and finally produces a large number of high-quality original corpora (for example, the founder of the LAION organization is a high school teacher ).

In fact, for LLM and AGI, the future pattern is likely to be like this: data + computing power + algorithm, among the three traditional elements of AI, data may not be the only barrier due to open source. In major manufacturers with chip hardware, the final competition is algorithm And the iteration speed based on AI Infra is up!

[Current]: What AI Infra does OpenAI have?

So, how does AI Infra help build GPT? Judging from the disclosed architecture of OpenAI , basically all aspects mentioned above are involved. Under the two topics of Compute and Software-Engineering, you can also see a large number of blogs about AI Infra published by OpenAI itself. Many of them are in the direction of computing power-algorithm Co-Design. For example, at the beginning of 2021, the K8S cluster managed by OpenAI reached a scale of 7,500 nodes (2,500 nodes four years ago). Then in July 21, the aforementioned Trition was open sourced, a compiler that can implement high-performance GPU programming using Python syntax. In 2022, they also spent a lot of space introducing their techniques for large-scale distributed training.

It is not difficult to see that maximizing the use of massive computing resources for algorithm development is the number one goal of OpenAI Infra. On the other hand, as can be seen from the two articles AI and Compute and AI and Efficiency, OpenAI has spent a lot of energy analyzing the incremental curve of computing power required by the strongest model over time, and due to algorithm improvements. The resulting computing power efficiency change curve. Analysis like this is also reflected in GPT-4’s Predictable scaling . In other words, given a training algorithm, the computing power consumed and the level of intelligence achieved can be predicted . This " computing algorithm Co-Design " indicator can well guide the rhythm and direction of algorithm development vs. engineering architecture upgrades.

In addition to computing power, the AI ​​open source community is also making rapid progress, and many of them must have contributed to the emergence of GPT. In addition to Hugging Face, there are many commendable AI startups emerging. I have not had time to analyze the work and significance of each company in detail here. But change is already happening, and new things are emerging at a rate of weeks.

【Conclusion】

The development speed of AI in recent months has indeed far exceeded the author's previous knowledge. There is no doubt that the era of AI 2.0 has arrived, and the previous generation paradigm based on pure supervised learning is no longer sufficient. Everyone also ate the cakes drawn by AI, and they were so delicious! As an AI practitioner, the past few months have also made my heart surge. Although you still cannot make a GPT after reading this article, you must have seen the development of AI Infra in the past 20 years. No matter where AI algorithms go in the future, the underlying computing power layer and underlying system will still be the cornerstone of algorithm development .

Looking back at the development of the past 20 years, the ten years from 2003 to 2013 were the era of Web1.0, and the author was still a child; from 13 to 23, the author witnessed the development wave of AI1.0 and Web2.0, but more Most of the time, they are just people eating melons. The next ten years will naturally be a revolutionary decade for AI 2.0 and Web 3.0. The author cannot imagine what the world will look like 10 years from now. But the only thing that is certain is that this time I can finally fully participate in it, and work with like-minded friends to do things that can affect the industry!

Having said that, how can we do without advertising? We are the training engineering platform team of the AutoNavi Vision Technology Center , responsible for supporting various algorithm engineering needs such as data closed loop, large-scale training, and algorithm servitization. We strive to create a technologically differentiated device-cloud collaborative AI Infra in the AI ​​2.0 era . On the one hand, we will reuse a large number of middleware from the group and Alibaba Cloud, and on the other hand, we will build many dedicated AI tool chains. Amap Vision has now become one of the largest visual algorithm teams in the group, supporting a variety of businesses such as high-precision maps, lane-level navigation, and smart travel, involving multiple technology stacks such as perception recognition, visual positioning, three-dimensional reconstruction, and rendering. .

Our specific recruitment needs are as follows. In addition, there are many needs that are difficult to describe with JD and are also waiting for positions.

Machine learning platform MLOps R&D engineer: https://talent.alibaba.com/off-campus-position/980607

Algorithm Engineering Service R&D Engineer: https://talent.alibaba.com/off-campus-position/980608

Distributed training optimization expert: https://talent.alibaba.com/off-campus-position/980705

If the above resonates with you, if you also love programming, if you are also the kind of person who "will study C++ on weekends ", then there will definitely be a position suitable for you here! Welcome to join the Amap Vision family! We also welcome everyone to recommend and pass it on to students in need.

friendly call

Amap urgently recruits 3D reconstruction/generative AI algorithm engineers: https://talent.alibaba.com/off-campus-position/979029

Amap urgently recruits SLAM algorithm/perception algorithm experts: https://talent.alibaba.com/off-campus-position/973417

https://talent.alibaba.com/off-campus-position/991613


Follow "Amap Technology" to learn more

Guess you like

Origin blog.csdn.net/amap_tech/article/details/130633432