Complete the process of machine learning project

1 abstracted into mathematical problems

The problem is clearly the first step in machine learning. Training machine learning process is usually a very time-consuming thing, try random time cost is very high.

Abstract here into a mathematical problem, clearly referring to what kind of data we can get, abstract problem is a regression or classification or clustering problem.

2 Get Data

Data determines the upper limit of the machine learning outcomes, and the algorithm just approaching this limit as much as possible.

Data to be representative, will certainly be over-fitting.

And for classification, data skew can not be too severe, the number of different types of data do not have an order of magnitude difference.

But also there is an evaluation, how many samples, the number of features on the order of data, can estimate the extent of the memory consumption, memory training process to determine whether to let go. If you have to consider improved algorithm does not fit or use some of the dimensionality reduction techniques. If the amount of data is too great, it would have to consider a distributed.

3 wherein pretreatment and feature selection

Good data to be able to extract good features can really play a role.

Wherein preprocessing, data cleaning step is critical, and often enables the effect of the performance of the algorithm is significantly improved. Normalization, discretization, factorization, missing values, etc. removed collinearity, the data mining process and spent a lot of time on them. The work is simple reproducible, stable and predictable earnings, is the basis for essential step machine learning.

Screening out the salient features, get rid of non-salient features, you need to understand the business machine learning engineer repeatedly. This has a decisive influence on the results of many. Feature selection Well, very simple algorithm can obtain good and stable results. This requires the use of relevant technical analysis features validity, such as the correlation coefficient, chi-square test, the average mutual information, entropy condition, posterior probability, weighted logistic regression methods.

4 training model and tuning

This step is used only until we said above algorithm training. Now many algorithms can be encapsulated into a black box for human use. But the real test is to adjust the level of these algorithms (super) parameters, so that the result becomes more excellent. This requires us to have a deep understanding of the principles of the algorithm. The more in-depth understanding, the more you can find the crux of the problem, propose a good tuning program.

5 diagnostic model

How to determine the model of tune with the idea of ​​the direction it? This requires the model technical diagnosis.

Over-fitting, under-fitting model to determine the diagnosis is a crucial step. Common methods such as cross-validation learning curve drawing. Tuning basic idea is to increase the amount of over-fitting the data, reducing the complexity of the model. Underfitting tuning basic idea is to increase the quantity and quality characteristics, increase the complexity of the model.

Error analysis step is also critical to machine learning. A comprehensive analysis of the causes of errors by observing the error sample: the parameters of the problem or issue algorithm selection, is characteristic of the problem or question the data itself ......

Model diagnostics need to tune the new model of the need to re-tune diagnosis, this is an iterative process of constant approximation, we need to continue to try, thus achieving optimal state.

Fusion 6 model

In general, after the model of integration can enhance the effect of making certain. And it works well.

Engineering, mainly to enhance the accuracy of the algorithm is a method of efforts are on the rearward end (model merging) in the front end of the model (cleaning and pretreatment characteristics, different sampling modes). Because they can be copied relatively standard, the effect is relatively stable. Direct parameter adjustment will not be a lot of work, large amounts of data, after all, trained up too slow, but the effect is difficult to guarantee.

7 running on the line

This part is mainly associated with the project implementation is relatively large. Is a results-oriented engineering, the effect of running on-line model directly determines the success or failure of the model. Not simply includes instances degree of accuracy, error, etc., further including the speed of operation (time complexity), the degree of resource (spatial complexity) consumption, stability is acceptable.

These workflows are mainly summed up some experience in engineering practice. Not every project contains a complete process. Here is just a part of guiding instructions, only if we own more practice, more accumulated project experience, will have a more profound understanding of their own.

Guess you like

Origin www.cnblogs.com/xin-qing3/p/11410288.html