The evolution of the programming model of the PAI distributed machine learning platform

Abstract: In the fifth issue of Yunqi Computing Journey - Big Data and Artificial Intelligence Conference, Jiufeng from Alibaba Cloud Big Data Division shared "The Evolution of PAI Distributed Machine Learning Platform Programming Model". He mainly introduced how to gradually solve business pain points through the evolution of programming model when using machine learning to solve big data problems in the group.

In the fifth issue of Yunqi Computing Journey - Big Data and Artificial Intelligence Conference, Jiufeng from Alibaba Cloud Big Data Division shared the "The Evolution of PAI Distributed Machine Learning Platform Programming Model". He mainly introduced how to gradually solve business pain points through the evolution of programming model when using machine learning to solve big data problems in the group. Among them, he only shared the evolution of MapReduce, MPI, and Parameter Server programming models.



The following content is based on the conference video.

What is PAI?
The full name of PAI is Platform of Artificial Intelligence, which mainly provides a complete set of links when using machine learning for model training. Machine learning is a service, and machine learning is launched as a service. You can use more advanced machine learning algorithms on the platform. Integrate multiple deep learning frameworks into PAI, and package them into easier-to-use components based on deep learning frameworks. With heterogeneous computing service capabilities, CPU computing capabilities and GPU computing capabilities are uniformly scheduled. Heterogeneous computing capabilities are transparent. Everyone pays attention to which machine learning algorithms can help the business, not the application and allocation of underlying resources. PAI also supports online prediction services, and models are released with one click.

Challenges of large-scale distributed machine learning The scale and characteristics of
data will continue to increase, which will lead to an increase in the complexity of the model. The previous model can no longer handle such a high complexity. After the features increase, the model becomes larger and larger, and it is difficult to load the model on a single machine, so sharding and segmentation are required when the model is stored. In conventional machine learning, more attention is paid to understanding the data and features of the business, but now the attention will be shifted to the model itself, and more consideration will be given to how to achieve better prediction results by adjusting the structure of the model.

Programming Model Evolution
The MapReduce programming model

f17e59326548624e87d4eb1b6112677e2ab4ba17

The core idea of ​​MapReduce is to divide and conquer, that is, to divide the data into many pieces, and each node processes a small piece of it. There are many problems when working as a distributed system. For example, it is hoped that computing tasks can be divided and scheduled at the MapReduce framework level. MapReduce greatly reduces the difficulty and threshold of migrating tasks to distributed computing systems from the framework level. For the distributed storage and division of data, the data can be distributed and stored on thousands of machines, and there are corresponding copies. There is no need to worry about the loss of data, and the underlying distributed storage will be processed uniformly. The synchronization of computing tasks and the fault tolerance and recovery of computing nodes are common when ordinary machines are used to build large computing clusters, and when using MapReduce, you do not need to care about this. The picture on the right is the programming model of MapReduce, which was originally used to deal with problems such as SQL.

c4b5c4926f31f56c309775d73fbf30fd8fd112b0In


machine learning, some algorithms are implemented based on the MapReduce programming model. TF-IDF is used to evaluate whether the words in the document can represent the topic of the document. First, calculate the frequency of words in the document, remove predicates and interjections, and focus on words that are really meaningful. The IDF counts the frequency of the word appearing in all documents, and calculates the final result by calculating the frequency of the word and the frequency in the document. How to implement this process through MapReduce? Iteratively load each article to be trained in Mapper, and count the frequency of each word during the iteration. Put the statistical results into the Reducer, perform calculations, and get the TF-IDF result table.

The MapReduce programming model has two characteristics: different computing tasks are independent, each Mapper and Reducer can only calculate their own related data, and the data parallelism is high; it is suitable for machine learning algorithms that do not require communication between different nodes.

MPI programming model

The be087c53ccf71950f2022d817a664d38cd65b77c

logistic regression algorithm requires communication between nodes, and this algorithm can often be seen in personalized recommendations. The personalized recommendation algorithm means that after each click comes in, it will be classified, judged whether they are interested in certain products, and then recommend it. The model function is shown in the formula in the above figure, and the loss function is defined. The smaller the loss function value, the better the model fitting. The gradient descent algorithm is used in the process of finding the minimum value of the loss function.

In the early days, many logistic regression algorithms were implemented based on the MPI programming model. MPI is a message passing interface, which defines Send, Receive, BC Astu, and AllReduce interfaces. It supports single-machine multi-instance and multi-machine multi-instance, and is highly flexible. Strong ability, a lot of use for scientific computing.

2f001e2bede3cdffeda489bfcc6563e7ff186e9b


There are many restrictions when using MPI. First of all, you must know in advance on which computing nodes the stage tasks will be performed. In a large-scale computing cluster, the allocation of all resources is dynamic. Before the task is executed, it is not known which nodes the task will be scheduled to. However, in the early days, many algorithms need to be implemented based on MPI, so the bottom layer of MPI is network topology. Established and did a lot of refactoring to help MPI-related programs be scheduled based on the distributed scheduling system.

The implementation process of 0558fa00b2db757cf1cae62288d4473025fca97e


logistic regression is shown in the figure above. Among them, there are n computing nodes. First, the training samples will be loaded, the gradients will be calculated, and then added locally, and finally the AllReduce interface will be called to calculate the current position of the model. MPI itself also has some shortcomings: First, the number of workers in MPI has an upper limit, and when more nodes are needed, performance degradation will occur.

Parameter Server Parameter Server

56bdf827cbf2f860b9c99d0b9aba6ee98613dc71

Compared with MPI, Parameter Server defines the programming model and interface at a higher level. There are three roles in the Parameter Server. The server node is used to store the model, and the computing node will load some models and training data. In each iteration, each node will calculate the next gradient and communicate with the server. Coordinator is used to determine whether the training is over. In addition, Parameter Server supports an asynchronous communication interface, which does not require synchronization between different computing nodes.

d06fdc23976ea895ac1ea33c8d2bc14ec415b63c


Ali independently developed the PAI Parameter Server computing model in the second half of 2014, which has been used on a large scale within the group. The specific work is shown in the figure above. One of the shortcomings of MPI is that it does not support fault tolerance, and various failures will occur in clusters of tens of thousands of units every day. PAI Parameter Server provides node fault tolerance for large-scale clusters. Parameter Server integrates many algorithms, such as logistic regression.

Deep learning

a9c558d08e481d844d0fc42f28eba5e39d601149

Deep learning is an extension of artificial neural networks that can support deeper networks in comparison. In the figure above, Alexnet is a convolutional neural network with a total of 8 layers of networks. If you want to achieve better results in deep learning, you must build a deeper neural network. As the neural network gets deeper, more parameters are required and the model gets larger. Multi-level training requires a higher level of traffic.

TensorFlow
b998fbd81b952cab21b54d5d38dc0c231b33d8d9

TensorFlow is Google's second-generation deep learning framework, supporting various neural networks, with high flexibility, rich community ecology, and support for CNN, RNN, LSTM and other networks.


The example of TensorFlow in the above figure is a two-layer neural network for image classification. The training images and test data are defined above through the API, then the model (softmax multi-classification model) is defined, the loss function is defined by cross entropy, and finally the optimization function is selected to find the best point. The following part is to feed the training data to the model through the API and then calculate the accuracy of the current model. From the above example, it can be seen that the API is very flexible and based on Python, so it is very convenient.

After PAI
TensorFlow migrates TensorFlow to PAI, the TensorFlow job is serviced. When starting a TensorFlow job, there is no need to apply for resources and perform training data migration; distributed scheduling (including single-machine and multi-machine) only needs to submit model training Python files ; GPU card mapping; a variety of data sources, both structured data and unstructured data are supported; hyperparameter support, the learning rate will be adjusted when training the model, and parameters are stored through hyperparameters, so there is no need to adjust each time; model Online prediction, after training the model, you can deploy it to the online prediction service, and call the API to know whether the model result is positive.

PAI Pluto (multi-machine multi-card Caffe)

5640ebdc9960ab608d5b531d667a872c8167ea64

Caffe is earlier than TensorFlow. Caffe can be considered as the first-generation deep learning framework. When using Caffe, you need to configure the deep learning convolutional neural network through the configuration file. At present, many applications of images are based on Caffe, using CNN network, such as ID card recognition, driver's license recognition, etc. The disadvantage is that it is a single machine, and the training time is very long when there are many training samples. Grafting the bottom layer of Caffe to the OpenMPI communication framework can support multi-machine Caffe and achieve a linear acceleration ratio.

Summary

98f89c6e4fbfaffd8f6676d33364d4d6d562b959

The above mentioned various programming models supported on PAI. We hope to launch machine learning as a service on the public cloud, including data uploading, data cleaning, feature engineering, model training, and model evaluation. In this way, one-stop model training and prediction can be done on PAI.


Use Yunqi Community APP, feel comfortable~

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326224996&siteId=291194637