Perform machine learning based on the MaxCompute platform and display the results

Abstract: The goal of this experiment is to collect the user's operation behavior data, sample a part of the data for manual annotation, and mark the results as satisfied and dissatisfied, represented by 1 and 0, and then use the machine learning platform to train the behavior data to obtain a model, and finally The obtained model predicts whether the user is satisfied with the current result based on user behavior.

The MaxCompute big data computing service provides a machine learning platform, based on which users can effectively utilize their own data and discover value from it. This article mainly introduces the entire process from user behavior data collection, to computing in the machine learning platform, and finally synchronizing the results to the RDS database for easy display of results. The goal of this experiment is to collect the user's operation behavior data, sample a part of the data for manual labeling, and label the results as satisfied and dissatisfied, represented by 1 and 0, and then use the machine learning platform to train the behavior data to obtain a model, and finally get The model predicts whether users are satisfied with the current results based on user behavior. The innovation of this practice lies in combining the MaxCompute platform and local projects with the help of DataHub and RDS and task operation and maintenance, which can automatically realize model prediction.

1 User behavior data collection

      Alibaba Cloud's DataHub used in this experiment collects user behavior data. DataHub can collect user-generated behaviors in real time and synchronize them to the MaxCompute platform in real time. The main process is as follows (address is datahub.console.aliyun.com):

1.1 Create a project

       Click the Create Project button to pop up the following window:

                           4f32ab63067bc6e5e4ae0d15e1df3339d54fd7eb

       Enter a custom project name and click Create.

           92c751451937a3da556c02ae246ee938a9a55e31Click

to view the basic information of configurable items.

1.2 Create Topic

                        2993ec14e04dee23b83795dd2fc07b48ecfef31a

Click Create Topic to configure the information of which project and table in the MaxCompute platform to be synchronized. The pop-up window is as follows:

                       4577effd873924099eaad818f780f25086c4f9c4

There are two creation methods: directly create and import the MaxCompute table structure. It is recommended to use the second method. This method can directly import the table structure in MaxCompute without creating a table by yourself. Follow the prompts to complete the information to complete the creation of the Topic (hint: each blank needs to be filled in, including the final remarks, otherwise it cannot be created).

1.3 Configuration is complete.

       Topic is created. Click View to enter the page below. The data has been collected in the figure below. Click Data Sampling to view the collected data. In Connectors, you can view information about connections to MaxCompute.

                         6aca754ae35d6cbfe4f00d3e0940c8a49048e5ef

2 Machine Learning Platform

      The logistic regression algorithm used in this experiment classifies the behavioral data. The first step is that after DataHub collects data and synchronizes it to MaxCompute, the data can be processed and calculated.

2.1 Data preprocessing

       The user behavior data collected by DataHub often does not meet the requirements of the machine learning platform. In this case, some preprocessing operations on the data are required first. Machine Learning Platform Components provide various components for data preprocessing.

                         6cf5e0846248f18b8a41718605541a7558133608

In order to automatically preprocess the collected data, you can create a new task in the big data development kit and set its scheduling process. The process is as follows:

                        01ee87902f75afa06eaf316b0577d827ed7ff1c4

Click to create a new task:

                        8d692afb9a8b009276f7e4abbb3b2f817fcb2b59

There are types of tasks that can be set. Among them, creating a machine learning task is to select an experiment that has been created on the machine learning platform after the creation is completed, and then set the scheduling configuration, and then the machine learning experiment can be automatically executed.

2.2 After the model training

        data preprocessing is completed, write the data into the data table represented by 1 in the figure below, and then use the data writing component of machine learning to read the data. The following figure shows the data training process :

                                5ea6749a87fa884f2348815ace7061c44e0e42fd

The discrete operation is carried out, and then some data are sampled for training with the logistic regression algorithm, and some data are sampled for testing, and the confusion matrix can be used to obtain the test results.

2.3 Model prediction

                       9e17f7125e1e7335cdad505491bbf89c54021da9

After the model training is completed, the model can be used for prediction. In the above figure, 1 represents the model training data table, 2 represents the data table to be predicted, and 3 represents the prediction result storage table.

3 Synchronize the prediction results to the RDS database

3.1 Create a data source

      In the data integration of the big data development kit, a data source can be created. The function of the data source is to link the current project in MaxCompute with a data source (such as the RDS database):

                    51ac20e12b834af06219d953be99c179243c795e

Click to add a new data source:

                    c75fa2d3128ccd27affdfcfa7c818871c361c4ec Fill

in as required and the creation will be successful.

3.2 Synchronizing data to the RDS database

         In the data development of the big data development kit, click New Task to create a task of data synchronization type:

                            3484dfda855109f9faa940aa8c7d668e6ec2dc1a After the

creation is completed and the configuration is as follows:

                     5f352d7d9da1bd510a8e9f006c3339f0e8d7c45d

After clicking Save, click Submit and Test to run Data is instantly synchronized to the RDS database.

After 242c0dfb662f3110aa98978050acfaca23cd6cbc is

synchronized to the RDS database, the prediction results can be obtained from the database and displayed, so that the project can be combined with the MaxCompute platform.

4 Task operation and maintenance

      In task operation and maintenance, you can set the scheduling process of various tasks created in MaxCompute, that is, set scheduled tasks. Various components of the MaxCompute platform, including experiments in the machine learning platform, can be created as tasks, so task operation and maintenance can be used to control the automatic running of experiments on the machine learning platform.

5 Summary

       This article mainly introduces the entire process of DataHub collecting data, MaxCompute platform processing data, and synchronizing data to RDS database to display data. It shows how to obtain the behavior data in the project, process it with MaxCompute, and then synchronize it to the RDS database, and then the project obtains the processing results of MaxCompute from the RDS database. This perfectly combines local projects with the MaxCompute platform.

#MaxCompute Best Practices#

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326181656&siteId=291194637