Machine learning should avoid seven common mistakes

Reprinted link : http: //blog.csdn.net/mmc2015/article/details/47322121

In the field of machine learning, modeling each given there are dozens of solution, each model is difficult to simply determine whether there are reasonable different assumptions. In this case, most employees tend to pick their familiar modeling algorithm, the author believes that the assumptions of the model algorithm does not necessarily apply to the data at hand; in the pursuit of optimum performance model, it is important to choose model algorithm for data sets (especially the "big data") is.

The following is the text:

Statistical modeling and engineering development is very similar.

In project development, people have a variety of ways to build a set of key - value storage systems, designed for use each model has different assumptions. In statistical modeling, there are many algorithms to construct a classifier algorithm for each data set of hypotheses there.

When dealing with small amounts of data, because the test costs low, we try as much as possible a variety of algorithms to select the optimal effect of the algorithm. However, reference to "big data", data analysis in advance, then the appropriate design "pipe" model (pretreatment, modeling, optimization, evaluation of product) is a multiplier.

As mentioned in my previous article, each given there are dozens of modeling solution. Each model will be presented on various assumptions, we also difficult to visually distinguish what is a reasonable assumption. In industry, most employees tend to pick their familiar modeling algorithm, rather than the most suitable data set. In this article, I will share some common misunderstanding (to be avoided). In the next article we will introduce some of the best practices (should do).

1. assume the default loss function

Many practitioners like to use the default loss function (such as square error) to train and select the best model. In fact, the default loss function rarely meet our business needs. Take for fraud detection. When we detect fraudulent transactions, our business needs is to minimize losses caused by fraud. However, the existing binary classifier default loss function equal treatment of false positives and false negatives hazards. For our business needs, not only for uncounted loss function than to punish false positives, but also the degree of punishment for uncounted amount of fraud and proportional. Moreover, fraud detection training data set is often positive and negative samples is extremely uneven. In this case, the loss function will tend to take care of rare (such as through the up / down sampling, etc.).

2. Process general linear model with nonlinear problem

When you need to build a binary classifier, many people immediately think of using logistic regression, because it is very simple. However, they forget logistic regression model is linear, cross-feature nonlinear factors need to rely on manual coding process. Back to the example just fraud detection, to obtain good model results, we need to introduce high-order crossover feature "billing address = Shipping Address && transaction value <$ 50" and the like. Accordingly, in dealing cross contains features we should choose as the nonlinear model, such as a kernel function of the SVM, or based on the classification tree.

3. ignore outliers

Outliers very interesting. According to the context of the situation, they either need to be special treatment, or should be completely ignored. Take the revenue projections for. If you look at the revenue abnormal spikes, we may have to pay more attention to them and analyze what is causing these peaks reasons. But if the outlier is due to mechanical error, measurement error, or any other non-generalized factors, then we prepare to filter out the best in these abnormal values ​​before the training data.

Some model algorithm is very sensitive to outliers. For example, AdaBoost they have "serious attention", giving a considerable weight value. Instead, simply put them on the tree as a misclassification to deal with. If the data set contains a considerable number of abnormal values, then the use of a robust direct modeling algorithm or filter out outlier it is important outliers.

4. When a high variance model number is much smaller than the number of samples wherein

SVM is one of the most popular modeling algorithm, one of its power lies with a different kernel functions to fit the model. SVM kernel is considered to be a combination of the prior spontaneous features, a method to form a higher dimensional feature space. As the consideration received this power is almost negligible, most people use the default kernel function in SVM training model. However, when the number of training samples is much less than the characteristic dimension (n << p) - common in medical data - high dimensional feature space data over-fitting risks will increase. In fact, in the above case, we should avoid the use of high-variance model.

5. do not standardized L1 / L2 regularization

Use L1 or L2 regularization or logistic regression is a linear regression of large weight penalty weight coefficient values ​​through conventional methods. However, many people in the use of these regularization method are not aware of the importance of standardization.

Back to fraud detection, imagine a linear regression model as a feature of the transaction amount. If you do not do regularization, when the transaction amount in US dollars, it will fit factor is 100 times the United States is divided into units. At the same time, because the L1 / L2 regularization coefficient value for large items heavier punishment, the dollar amount of the transaction when this dimension as the unit will be subject to more punishment. Therefore, regularization is not discriminatory, it tends to punish features on a smaller scale. To alleviate this problem, we need to standardize all the features in the pre-treatment process, so that they are in an equal position.

6. Consider not linearly related to the use of a linear model

Construction of a linear model is assumed that X1 and X2 contain two variables, the real model is Y = X1 + X2. Ideally, if the data contains only small amounts of noise, a linear regression model can restore the real model. However, the linear correlation exists if X1 and X2, for most optimization algorithm, regardless of Y = 2 * X1, Y = 3 * X1-X2 or Y = 100 * X1-99 * X2 results are just as good. Although this section does not bias our forecast, it looks like it does not seem to matter. However, it makes the problem becomes ill, because the weight factor can not be explained.

7. The absolute value of a linear model or interpretation logistic regression model coefficients is characterized importance

Because many ready-linear regression coefficients of each p-value returns, many people believe that the greater the absolute value of the coefficients, corresponding features to play a greater role. It is not true, because (a) the scaling variable coefficient absolute value will change; (ii) if the feature is linearly related to its coefficients may be transferred from one dimension to another characteristic feature dimension. Also, the more characteristic dimensional data set contains, the more likely a linear correlation between features, wherein the importance of a coefficient to explain the less reliable.

These are the practical operation of machine learning in the seven common mistakes. This list is not complete, it is only inspire readers to think, assuming that the model algorithm does not necessarily apply to the data at hand. In the pursuit of optimum performance model, it is important to select the appropriate data model algorithm, not that you are most familiar.

Original link: http://ml.posthaven.com/machine-learning-done-wrong

Published an original article · won praise 0 · Views 1800

Guess you like

Origin blog.csdn.net/happygirl_wxq/article/details/105202443