Tencent Andymhuang (Huang Ming): The integration of Taoism and equipment, on the self-cultivation of an excellent machine learning platform by Angel

Summary

In June 2017, Tencent officially open-sourced Angel, the third-generation high-performance computing platform for machine learning, which has attracted much attention on GitHub; on October 19, 2017, Tencent T4 expert Andymhuang (Huang Ming) will serve the audience at QCon Shanghai. Wonderful sharing of the last Spark on Angel. As the main developer and team leader of Angel, as well as an early researcher and evangelist of Spark, his work experience can be said to have synchronized the transition from a general-purpose big data platform to a dedicated machine learning platform. Therefore, before that, InfoQ conducted an interview with Ming's Huang. He will share with you the evolution of big data platforms in the era of artificial intelligence, and combine Angel's development experience to talk about how to build an excellent machine learning platform. , as well as Angel's latest news and future plans after open source.

What will artificial intelligence bring to enterprises? ——Is it a change, an opportunity, or a bigger challenge?

In the previous era of big data, enterprises began to realize the importance of data and started to build their own big data platforms. Big data became the focus of the industry. Various big data frameworks, components and platforms such as Hadoop and Spark emerged one after another. . With the advent of the era of artificial intelligence, new changes have taken place in the big data platform, and higher requirements have also been put forward. Big data platforms such as Spark are mostly designed for general data processing, not dedicated to machine learning tasks. How can enterprises make better use of artificial intelligence technologies such as machine learning and deep learning to efficiently mine valuable information from real-time data?

In the past two years (2015-2017), with the revolutionary success of machine learning, especially deep learning, in many fields, various dedicated machine learning platforms have emerged, and a hundred flowers have blossomed. Angel is also one of them. For this, we interviewed him.

In the age of artificial intelligence, the evolution of big data platforms

InfoQ: You are not only the main developer and team leader of Angel, but also an early researcher and evangelist for Spark, and have been working on distributed computing and machine learning. Can you tell us about the evolution from a general big data platform to a dedicated machine learning platform based on your work experience? What's driving this shift? Do you think most tasks in big data centers will become machine learning tasks in the future?

Huang Ming: In fact, what drives this change is essentially driven by people's pursuit of a higher level. From understanding the past to predicting the future; from exhaustion of limited space to exploration of infinite space; from supervised training to unsupervised self-learning... Whether it is corporate executives or product users, everyone hopes to get Smarter services, and only products and companies that provide this level of service can win in the brutal Internet competition.

In 2010, big data in the industry was just emerging. At that time, many popular projects were statistical, which can tell you what was the most popular yesterday. The underlying frameworks are Hadoop and Hive. The biggest function of many platforms is to produce various reports, daily reports, monthly reports... At this time, the level is to know what happened.

In 2012, there were two major development directions at that time, one was faster SQL and the other was machine learning, and many open source projects emerged. Spark wins because it strikes a balance between the two and shows the potential of machine learning. Matei Zaharia and others mentioned in NSDI's RDD Paper that Spark's goal is to solve two types of problems: iterative algorithms and interactive data mining tools. From the current point of view, this judgment is still correct. Spark later became popular, and many companies still regard Spark as their preferred general-purpose data processing platform and machine learning platform. That's what people want to know, what's about to happen.

In 2014, Li Mu and others gave a better idea of ​​distributed machine learning in osdi's Paper on Parameter Server, and then the ps-lite of Petuum and DMLC came out. Spark's follow-up was not very good at that time, and the concept of RDD itself was in conflict with PS. We also submitted a PR to Spark at the time, which was not accepted, but led to Glint. Up to now, the official Spark still uses RDD as the core to implement machine learning algorithms, which is a big constraint and obstacle.

But in 2015, the development of PS was also impacted by deep learning. With the emergence of TensorFlow, everyone turned to the framework development of deep learning. Including Microsoft's DMTK to CNTK, DMLC's PS-Lite to MXNet... But in fact, many companies still have a large number of CPU machines in their data centers, and a large number of non-deep learning algorithms still require distributed training on large-scale data sets. There is a vacancy in this field, and deep learning cannot replace it.

Tencent started researching and developing Angel in 2015, and its purpose is to fill the vacancy mentioned above. In 2016, Angel began to be used internally. In 2017, Angel was finally open-sourced, and the whole open-source process is still not easy (for details, please refer to Check out an earlier report from InfoQ). I hope Angel can fill this gap and become a dedicated distributed machine learning platform, serving more products inside and outside the company, and promoting people's pursuit of higher levels.

Finally, in the future data center, I believe there will still be many data processing tasks. Because no matter what kind of model and algorithm, its premise must be based on clean data. Without the complete data preprocessing process, it is unrealistic to talk about machine learning and artificial intelligence. But most data tasks, their final export and end, will also be the tasks of machine learning and artificial intelligence, because without this end, the previous data processing will be meaningless, and the end can be controlled who is the ultimate victor.

InfoQ: In the early days of big data platforms, offline batch processing was the main focus, supplemented by real-time computing, but now more and more application scenarios have high timeliness requirements for data processing. Tencent's big data platform has also gone through three stages of development: offline computing, real-time computing, and machine learning. What role will batch computing and real-time streaming computing play in the infrastructure of enterprises building AI platforms in the future?

**Huang Ming:** For a high-tech enterprise, real-time computing capabilities and machine learning capabilities are the foundation of AI capabilities and are indeed necessary. In addition, the importance of the training and inference stages of machine learning will go hand in hand, and the advantages of real-time computing power will further radiate to inference scenarios. But this does not mean that offline batch computing is not important, especially in the training phase, offline batch computing is still the main scenario, the reasons are:

  1. A good model requires a large amount of data, repeated iterations and a certain degree of accuracy can be launched, especially a good deep learning model, which usually requires multiple GPU cards and takes a long time to train to complete, so the high-performance distribution here An essential machine learning platform.

  2. There are many algorithms and scenarios that do not support real-time updates, and have inherent constraints, or the mathematical proof does not hold, or does not support streaming superposition, so the model still needs to be trained offline and then pushed to the mobile phone or other terminal devices.

  3. In the field of online learning, the continuous optimization and update of the model is very important, but a basic model is always needed, and the quality of this basic model is very important to restrict the subsequent improvement effect.

Combining the above three points, offline batch processing will still be a very important and core scenario and cannot be replaced. But real-time streaming computing develops rapidly, especially in the inference phase. Mainly because in the era of deep learning:

  1. The model is more complex than before, changing from a shallow model to a deep model, and its inference computation is not a simple algebraic computation.

  2. The transmission data is larger than before, and the input is likely to be pictures, sounds, texts, etc., which requires high throughput, and the reasoning process still needs to be completed at the millisecond level. This has higher requirements on the performance of inference.

So I believe that in the next 1-2 years, there will be many excellent startups from hardware to software in this regard.

How an excellent machine learning platform is made

InfoQ: Computing is the foundation of a machine learning platform, but not everything. In your opinion, what characteristics should a good machine learning platform have?

Huang Ming: In the field of machine learning, some people like to compare the process of parameter adjustment and training to alchemy , rising to the level of "Tao". As for the integration of Taoist tools, in my opinion, alchemy requires a good alchemy furnace , which is an excellent machine learning platform. It needs to be able to provide the right temperature for alchemy, that is, to provide the best operating environment for innovative models and algorithms. Therefore, for a machine learning platform to be successful, it is best to have the following five characteristics:

  1. Brilliant core abstraction

    A machine learning platform must have its soul, which is its core abstraction. When this core abstraction matches the models and algorithms it is up against, the platform is half-successful. If it is wrong at the beginning, for example, SQL is used as the core abstraction of the platform, then the constraints on the later development will be very obvious, which is tantamount to seeking fish from the woods, no matter how hard you try, you will not succeed.

    The core abstraction of Spark is a RDDgood solution to the general problem of distributed big data; while the three core abstractions of TensorFlow and TensorFlow highly summarize the various elements in deep learning Tensor. Angel's current core abstraction is to focus on solving the three major problems of model segmentation , data parallelism , model parallelism , and mode asynchrony in distributed machine learning , which can basically meet most non-deep learning machine learning needs.Mutable VariablesDataflow GraphsPSModel

  2. Full performance optimization

    Under the premise that the core abstraction is correct, performance is the key to determining speed. This involves the platform layer 's understanding, tuning, and encapsulation of the hardware layer . Last year, we won the TeraSort competition with 500 high-performance machines, which is also the embodiment of this performance optimization ability, and grafted it onto Angel.

    Now is not the era of MR to take the route of massive low-profile machines. Whether it is a CPU machine or a GPU machine, it is going in a stronger and faster direction. In last year's competition, we used a high-performance stand-alone computer, including IBM's PowerPC, 512G of memory, multiple NVME SSDs, and RDMA's 100G network...all of which are the top configurations in the industry.

    However, it is not enough to have hardware stacking alone, and the platform must make full use of the hardware. For non-deep learning, the Java department is the tuning of the JVM. How to make better use of memory, avoid the generation of FullGC, try to keep the calculation from falling to the ground, and streamline the pre-reading of data... These are all tests for platform design. For deep learning, the performance utilization of CUDA and OpenCL, the data copying of video memory and memory, the selection of floating-point and fixed-point operations, the internal communication of multiple cards in one machine...the platform needs to be adjusted a lot, and even the introduction of such as XLA black technology.

    Since it is a distributed machine learning platform, it will definitely involve a distributed topology. At present, the more mature distributed topologies are still MR, MPI, and PS . In machine learning, MR has basically been out of the game. MPI has made a comeback with deep learning and competes with PS. Of course, there is also the practice of using PS as a whole and MPI locally, which is not a bad idea. After determining the network topology, it is necessary to consider network acceleration. The two key technologies of RDMA and NVLINK are worthy of attention and the future direction. After all, whether the data is directly stored in the video memory, or the memory is taken twice, the difference is conceivable. In addition, there is no need for CPU overhead, and the impact on performance is still considerable.

    All these optimizations are finally exposed to platform users, preferably as simple as possible. The platform can automatically select the best performance channel based on simple parameters, which is the most meaningful to algorithm engineers and data scientists.

  3. Strong fault tolerance

    When it comes to fault tolerance, I have to mention MPI and MR again. In the Hadoop era, the prevalence of the theory of massive low-profile machines has made MPI severely suppressed by MR. But in the era of deep learning, everyone found that these high-end machines are not too different from HPC. The reliability of hundreds of thousands of machines is still very strong, and the probability of error is very low. In contrast, performance is more important, so The MPI model comes alive again.

    From the overall point of view, after the scale increases, in large data centers, high-end GPU machines and T-level training data, a certain balance of fault tolerance still needs to be achieved. In this case, PS mode is still the most suitable. . The overall architecture, including the communication performance of the network, is the most flexible and robust, and there are many disaster recovery measures that can be done with low cost. The final result will be far better than a simple regular Checkpoint.

  4. Flexible interface design

    As you all know, in 2017 Python has become the first programming language with the help of artificial intelligence. This is in part, of course, thanks to the gods of TensorFlow and PyTorch, but there are inevitable reasons behind this trend. The advantages of the Python language lie in its simple syntax, low difficulty in getting started, and abundant resources. It has abundant data, visualization and machine learning algorithm libraries, and has established a very good ecological environment. At the same time, it can be seamlessly integrated with C, and with py4j, Combine with Java. Based on the above reasons, Python can provide a friendly interface layer for the powerful platform in the background, and achieve the effect of simplicity but not simplicity. It is no wonder that it will emerge and stand out.

    However, Python is always only the embodiment of the back-end interface. It is the back-end interface design that determines the whole. At this time, the overall design ability of the architect is very important. The encapsulation and extension of core concepts, the integration of multiple machine learning concepts, the layering and decoupling of systems, and the consistency of multiple subsystems will ultimately be reflected in the interface, which will determine the difficulty for users to write algorithms based on the interface.

  5. Perfect peripheral system

    At the beginning of TensorFlow's open source, one of the eye-catching tools was its TensorBoard, which amazingly surpassed the products of the same period. At the time, it was also doubtful whether it would be partially open-sourced and not open-sourced this module. A good machine learning platform still needs to do more work on the improvement of the surrounding system. If users can quickly debug and locate bugs based on your platform, it will greatly enhance their confidence in use, which will form a strong impact on users. attractiveness and ultimately contribute to a better ecology.

In your opinion, how to efficiently build an excellent machine learning platform?

Huang Ming: Let me tell you a little episode that everyone knows: The predecessor of TensorFlow was DistBelief, which was not very popular in the deep learning community at that time. Most people doing deep learning, either Caffe or Torch, basically ignored DistBelief, and later TensorFlow launched , is very popular. Here is a timing detail. Hinton joined Google in 2013, DistBelief started development in 2011, TensorFlow was released in 2015, and Jeff Dean was in charge of this project at Google from start to finish. As an outsider, it is impossible to know what kind of contribution Hinton has made to TensorFlow, but the effect is obvious. Before DistBelief was too engineering, it was not friendly enough to the model essence and algorithm of deep learning. After Hinton joined, the level of the second generation of TensorFlow far surpassed that of the first generation of DistBelief. The design of the entire system, from the top to the bottom, from the name to the surrounding, reveals the intimate understanding of deep learning engineers. This is also the reason for the success of TensorFlow.

Therefore, to design and build an excellent machine learning platform, in my opinion, the following three conditions must be met:

First of all, we must build a team with strong engineering and algorithm model capabilities. **As a whole, this team needs good complementary capabilities, including algorithm engineers and system architects, and everyone cooperates with each other. The mathematical foundation and expressive ability of algorithm engineers are very important, while the understanding ability and rapid implementation ability of system architects are very important. In addition, it is best to combine the innovation ability of the academic world with the landing ability of the engineering world, so that the system can have both innovation and reliability. From the very beginning, Tencent's Angel project was a project jointly led by doctoral students from Peking University and engineers from Tencent. Although it is far from the level of great gods such as Hinton and Jeff Dean, the model is similar, which is a very critical element.

Second, it needs to be driven by big data . We have studied Petuum before and found that some concepts are very good, but the stability is very poor, it is difficult to run through under large data volume, and it is difficult to build. Therefore, in the process of Angel's research and development, we always adhere to the principle of being driven by big data. Various tricks and designs must be based on the principle of passing the final stress test, and closely rely on internal business to test the effect through the implementation of the scene, so as to ensure the system design rationality and usability. This is actually not too difficult for large companies, as long as they have the right concept and cooperation. But this is more difficult for small companies. Therefore, this is also one of the advantages of open source frameworks such as BAT and other large companies, compared with frameworks produced by laboratories or startups.

Finally, it is necessary to maintain a very fast evolution rate . TensorFlow is now often criticized for changing the interface too quickly. In fact, Angel's interface has changed a lot recently, and some of them are not backward compatible. The reason for this is very simple. One is that the deep learning in the industry is developing too fast, and new algorithms, models, and techniques are emerging one after another. Another reason is that there are too many developers. Even though Angel currently has relatively few Stars, it is difficult to ensure that all modules are reasonable due to a large number of internal parallel development. Regular refactoring is the only way to eliminate these unreasonable things. On the whole, as long as it is a reasonable refactoring and can improve the performance, it indicates that the project is still in the rapid growth period, which is a good thing.

InfoQ: Mr. Wang Yonggang from Innovation Workshop mentioned in "Why AI Engineers Need to Know a Little Architecture" that research should not only understand algorithms, algorithm implementation does not mean problem solving, problem solving does not mean on-site problem solving, and architecture knowledge is the efficient teamwork of engineers. common language. Can you talk a little bit about your perspective on architectural capabilities?

Huang Ming: What Teacher Wang Yonggang said should be understood "a little bit". The word means two things to me:

  1. You really need to understand, but you can't understand anything. Algorithm engineers and data scientists in the enterprise must have hands-on ability. They can’t just do research, write Paper, Matlab, and stand-alone Python experiments all day long. It’s a lot of fun to have a single GPU machine. On-line, once the communication reaches the engineering department, I can't talk anymore... Actually, it's in a very bad state. Such AI engineers, unless they are particularly strong or prominent in certain aspects, will be very difficult to survive in the enterprise.

  2. Can't expect to know too much. After all, algorithms and engineering have different thinking points, different brain circuits, and different methodologies. There are talents who are proficient in both aspects, but they are hard to find. This is also the original intention and purpose of Tencent to do Angel, which is to allow algorithm engineers to easily write efficient and distributed production code without knowing too much about the optimization of the underlying framework, and to integrate some general systematic systems and architectures. The details are shielded, so that the productivity of the enterprise can be greatly improved.

At present, the better frameworks, including Spark and TensorFlow, are successful because they enable data engineers and AI engineers to write efficient algorithm codes after properly shielding the underlying architectural details. .

New Changes and Prospects of the Angel Platform

InfoQ: Through your previous contributions, everyone has a good understanding of the series of refactorings and upgrades done before the Angel platform is open sourced. There must have been a lot of new changes since the open source. What optimizations have the platform made?

**Huang Ming:** Since the open source, Angel has released 2 small versions: 1.1.0 and 1.2.0, mainly adding new algorithms and optimization methods, strengthening stability, refining and improving the previous version. Function. The optimization in these 3 months is mainly based on stability and performance improvement. Because Angel is positioned as an industrial-grade usable platform, it is very concerned about the stability and performance under large data volumes. The algorithms we publish are all production-proven. At the same time, we have repeatedly refactored the interface of Spark on Angel, which is as close as possible to the interface of Angel itself and reused. This work will be highlighted at this QCon conference.

In addition, according to user feedback, the Angel development team is developing 2 major features that have not yet been released, including:

  1. Python interface : Interface optimization and refactoring to improve ease of use. Because in the previous promotion, the first question of many users is whether there is a Python interface... So we have to take this as the first priority to satisfy.

  2. Spark Streaming on Angel : Support online learning and add FTRL algorithm. As said before, real-time is also essential for machine learning. Angel itself does not do real-time, but supports Spark on Angel. It is a natural thing to access real-time training through Spark Streaming, and the cost is also very low. However, Angel's HA and memory management need further optimization.

These two new features should be able to meet with you in the next 2 versions. As for the support of deep learning, it is actually in progress, but it is somewhat difficult and will be launched later.

InfoQ: How has the promotion of the Angel platform been during the period since it was open sourced? Do you have any particularly impressive feedback?

Huang Ming: Since Angel was open-sourced, in fact, we didn’t promote it too deliberately, including the first day of public on github (June 16), we didn’t plan to do any PR, but due to the previous influence, the major media finally reported. However, Tencent TOSA (Open Source Committee) has been very supportive to open source projects in the past year, and its attitude is also very open, so we are mainly doing this with the power of Tencent open source, and published several articles. At present, the overall number of Stars is close to 2.5k. We are more pleased that the ratio of Fork and Stars is relatively high. It can be seen that many people are still very interested in the project. On the whole, we still follow the rhythm we set before, and develop new functions and versions in small steps.

According to our understanding and contacts, some companies (such as Xiaomi, Sina Weibo, etc.) are currently trying Angel, and there are many contributors. A few are impressive:

  1. An engineer from Huawei submitted a relatively large PR shortly after the project was released to help upgrade the Netty version, which was very powerful. Later, he wanted to integrate GraphX, but I felt that this direction was not right, so I rejected it, which is very embarrassing.

  2. One of the developers of Microsoft LightBGM raised an issue, interacted with Angel's GBDT students for about 10 times, and discussed in detail which of the MPI and PS network communication costs in machine learning tasks is smaller. Interesting academic interaction.

  3. An overseas user took the initiative to help translate Angel's documents. In order to open source it, the team spent nearly a month writing documents and fixing bugs. There should be about 100 documents in total, and the translation workload is huge. But now almost all of them have been translated.

All of these have made us realize the power and benefits of open source. After a platform is open source, it will receive attention from all over the world. As long as you manage it carefully, maintain good functions and performance, and can help users, users will take the initiative to help you do it. many things. And your vision will become wider. The needs of many external users are very objective, and it is they who drive us forward.

InfoQ: Looking at Angel after three months of open source, what are the advantages of Angel compared to other machine learning platforms (such as Spark, Petuum, GraphLab, TensorFlow)? What features of Angel most appeal to machine learning developers?

**Huang Ming:** First of all, at present, Petuum and GraphLab are not open source, and there is no comparability. Angel borrowed some ideas from Petuum in the early stage of research and development, but later in the experiment, it was found that Petuum could not reach the industrially usable level in terms of reliability and stability, so basically it was overturned and redone.

Compared with Spark, the current focus of Spark is still on SparkSQL, which can be seen from the number of PRs of each version, and the proportion of MLLib is very small. This is also partly due to the inherent limitations of Spark's RDDs. In contrast, Angel focuses on machine learning algorithms, and the PSModel-based programming model allows various machine learning optimizations and tricks to be easily implemented, which is very friendly to algorithm engineers. With the provision of the Python interface, this advantage will become more obvious.

TensorFlow's current position in deep learning is still far ahead, as can be seen from the number of 7w stars. However, TensorFlow's PS does not do well in the performance of multiple machines and multiple cards. The latest version released recently is still trying to go the MPI route, which is one of the industry's difficult problems. Angel will not independently create a new deep learning framework to compete with TensorFlow, but will use its own advantages to make PS-Service perfect to accelerate parallel training and complement each other.

Regarding the life cycle of traditional machine learning algorithms, I don't think you need to worry too much. An important point is that traditional machine learning algorithms are closer to problem solving than to simulating intelligence than deep learning. The deep network simulates the structure of the brain, so it is superior to traditional algorithms in the areas of intelligence that humans are good at, such as vision and hearing, various understanding of external signals... But there are also some non-intelligent areas where the human brain exhibits various cognitive defects (cognitive deficits). deficit), such as the judgment of whether the pattern is random or true, the perception of probability, the assessment of risk, and so on. Traditional machine learning methods are still more effective in these respects and do not need to go through a lot of brute force to get better conclusions. Maybe it will change later. But at present, machine learning of traditional ideas is still necessary. In many occasions, it is simple and useful, including many scenes of Tencent, and there is still a need for it, so Angel still has to manage it well.

On the whole, Angel is currently a relatively mature open source parameter server framework in the industry, including domestic and foreign countries. A platform for developing and running general-purpose machine learning algorithms at T-scale sample sizes. It is also worth mentioning that Angel's algorithm development is also relatively difficult. An open source contributor easily implemented Alibaba's MLR algorithm for CTR estimation on Angel and contributed to the Angel community. This is Angel what the team expects.

InfoQ: What problems and difficulties do you think the current machine learning platform still has? What are the priorities for future improvements?

**Huang Ming:** At present, there are still three major problems in the machine learning platform: horizontal expansion of computing power, efficient compression of models, and fast reasoning capabilities

  1. The biggest challenge for machine learning platforms is computing power. No matter how strong the performance is, no matter how easy the interface is, and no matter how good the underlying optimization is, the single-machine capability will eventually have its limit, and it needs to be able to scale horizontally. However, a large-scale, general-purpose multi-machine and multi-card distributed solution for deep learning is still a difficult problem, even if TensorFlow does not. Works great. This is why Tencent is committed to the Angel system, hoping that we can provide high-performance distributed machine learning solutions whether it is CPU or GPU.

  2. In addition, using the huge cluster scale, the fine models trained for a long time are generally relatively large in size. The models trained by Angel generally have hundreds of G, and most of the deep learning models are at the G level. Such a large model needs to be compressed before it can be used by the terminal. How to maximize the compression model while minimizing the loss of accuracy is also what the platform needs to consider.

  3. The last one is fast reasoning ability. Whether it is terminal reasoning (mobile terminal, driverless car...) or server-side reasoning (advertising, recommendation...), the requirements are the same, as fast as possible and high throughput. At this time, how to reasonably utilize and design a streaming real-time system to quickly access data for reasoning is also a point that the platform needs to consider.

In the coming year, I believe that most machine learning frameworks, including deep learning frameworks, will focus on the above-mentioned issues. This is also the challenge and opportunity that Angel needs to face.

Introduction of interview guests

Andymhuang (Huang Ming) , Tencent T4 expert, early Spark researcher and evangelist, has unique experience and research on distributed computing and machine learning. Currently, he is the leader of the massive computing group in the data platform department, responsible for building large-scale distributed computing and machine learning platforms to help Tencent's major data and machine learning businesses develop rapidly.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325414900&siteId=291194637