[Machine Learning Notes] (Reprinted Learning) Workflow of the complete machine learning project

Workflow of a complete machine learning project

Original blog post: https://ask.julyedu.com/question/7013

1 Abstract into a mathematical problem

Identifying the problem is the first step in machine learning. The training process of machine learning is usually a very time-consuming thing, and the cost of random trial time is very high.

The abstraction here is a mathematical problem, which means that we know what kind of data we can obtain, whether the goal is a classification or regression or clustering problem, if not, if it is classified as one of the problems.

2 Get data

The data determines the upper limit of machine learning results, and the algorithm is only as close as possible to this upper limit.

The data must be representative, otherwise it will inevitably overfit.

Moreover, for the classification problem, the data skew cannot be too serious, and the number of data in different categories should not differ by several orders of magnitude.

And there is an assessment of the magnitude of the data. How many samples and how many features can be used to estimate its memory consumption and determine whether the memory can be put down during the training process. If you can't let go, you have to consider improving the algorithm or using some dimensionality reduction techniques. If the amount of data is too large, then it is necessary to consider distributed.

3 Feature preprocessing and feature selection

Good data must be able to extract good features in order to be truly effective.

Feature preprocessing and data cleaning are very important steps, which can often make the effect and performance of the algorithm significantly improved. Normalization, discretization, factorization, missing value processing, removal of collinearity, etc., spend a lot of time in the data mining process. These jobs are simple and reproducible, and the revenue is stable and predictable, which is a basic essential step of machine learning.

Screening out salient features and discarding non-significant features requires machine learning engineers to repeatedly understand the business. This has a decisive impact on many results. Once the features are selected, very simple algorithms can also produce good and stable results. This requires the use of related techniques for feature validity analysis, such as correlation coefficients, chi-square tests, average mutual information, conditional entropy, posterior probability, and logistic regression weights.

4 Training models and tuning

It was not until this step that the algorithm we mentioned above was used for training. Many algorithms can now be packaged into black boxes for human use. But the real test is to adjust the (super) parameters of these algorithms to make the results more excellent. This requires us to have a deep understanding of the principles of the algorithm. The deeper you understand, the more you can find the crux of the problem and propose a good tuning plan.

5 Model diagnosis

How to determine the direction and ideas of model tuning? This requires techniques for diagnosing the model.

Over-fitting and under-fitting judgment is a crucial step in model diagnosis. Common methods such as cross-validation, drawing a learning curve, etc. The basic tuning idea for overfitting is to increase the amount of data and reduce the complexity of the model. The basic tuning idea for underfitting is to increase the number and quality of features and increase the complexity of the model.

Error analysis is also a crucial step in machine learning. By observing the error sample, the cause of the error is fully analyzed: is it a parameter problem or an algorithm selection problem, is it a feature problem or a problem of the data itself ...

The model after diagnosis needs to be tuned, and the new model after tuning needs to be re-diagnosed. This is a process of repeated iterations and continuous approximation, which needs to be tried continuously to reach the optimal state.

6 Model fusion

Generally speaking, after the model is fused, the effect can be improved. And it works well.

In engineering, the main method to improve the accuracy of the algorithm is to work on the front end of the model (feature cleaning and preprocessing, different sampling modes) and the back end (model fusion). Because they are more standard and reproducible, the effect is more stable. But there is not much work to directly adjust the parameters. After all, large amounts of data are too slow to train, and the effect is difficult to guarantee.

7 Go online

This part is mainly related to the project implementation. Engineering is result-oriented, and the effect of the model running on the line directly determines the success or failure of the model. It not only includes its accuracy and error, but also its running speed (time complexity), resource consumption (space complexity), and whether the stability is acceptable.

These workflows are mainly some experiences summarized in engineering practice. Not every project contains a complete process. The part here is just a guide. Only when you practice more and accumulate more project experience will you have a deeper understanding.

Therefore, based on this, every ML algorithm class online in July hereby adds feature engineering, model tuning and other related courses.

Published 646 original articles · praised 198 · 690,000 views

Guess you like

Origin blog.csdn.net/seagal890/article/details/105260007