Statistical modeling is a lot like engineering.

　　In engineering, there are various ways to build a key-value storage, and each design makes a different set of assumptions about the usage pattern. In statistical modeling, there are various algorithms to build a classifier, and each algorithm makes a different set of assumptions about the data.

　　When dealing with small amounts of data, it’s reasonable to try as many algorithms as possible and to pick the best one since the cost of experimentation is low. But as we hit “big data”, it pays off to analyze the data upfront and then design the modeling pipeline (pre-processing, modeling, optimization algorithm, evaluation, productionization) accordingly.

　　As pointed out in my previous post, there are dozens of ways to solve a given modeling problem. Each model assumes something different, and it’s not obvious how to navigate and identify which assumptions are reasonable. In industry, most practitioners pick the modeling algorithm they are most familiar with rather than pick the one which best suits the data. In this post, I would like to share some common mistakes (the don't-s). I’ll save some of the best practices (the do-s) in a future post.

1. Take default loss function for granted

　　Many practitioners train and pick the best model using the default loss function (e.g., squared error). In practice, off-the-shelf loss function rarely aligns with the business objective. Take fraud detection as an example. When trying to detect fraudulent transactions, the business objective is to minimize the fraud loss. The off-the-shelf loss function of binary classifiers weighs false positives and false negatives equally. To align with the business objective, the loss function should not only penalize false negatives more than false positives, but also penalize each false negative in proportion to the dollar amount. Also, data sets in fraud detection usually contain highly imbalanced labels. In these cases, bias the loss function in favor of the rare case (e.g., through up/down sampling).

2. Use plain linear models for non-linear interaction

　　When building a binary classifier, many practitioners immediately jump to logistic regression because it’s simple. But, many also forget that logistic regression is a linear model and the non-linear interaction among predictors need to be encoded manually. Returning to fraud detection, high order interaction features like "billing address = shipping address and transaction amount < $50" are required for good model performance. So one should prefer non-linear models like SVM with kernel or tree based classifiers that bake in higher-order interaction features.

3. Forget about outliers

　　Outliers are interesting. Depending on the context, they either deserve special attention or should be completely ignored. Take the example of revenue forecasting. If unusual spikes of revenue are observed, it's probably a good idea to pay extra attention to them and figure out what caused the spike. But if the outliers are due to mechanical error, measurement error or anything else that’s not generalizable, it’s a good idea to filter out these outliers before feeding the data to the modeling algorithm.

　　Some models are more sensitive to outliers than others. For instance, AdaBoost might treat those outliers as "hard" cases and put tremendous weights on outliers while decision tree might simply count each outlier as one false classification. If the data set contains a fair amount of outliers, it's important to either use modeling algorithm robust against outliers or filter the outliers out.

4. Use high variance model when n<<p

　　SVM is one of the most popular off-the-shelf modeling algorithms and one of its most powerful features is the ability to fit the model with different kernels. SVM kernels can be thought of as a way to automatically combine existing features to form a richer feature space. Since this power feature comes almost for free, most practitioners by default use kernel when training a SVM model. However, when the data has n<<p (number of samples << number of features) -- common in industries like medical data -- the richer feature space implies a much higher risk to overfit the data. In fact, high variance models should be avoided entirely when n<<p.

5. L1/L2/... regularization without standardization

　　Applying L1 or L2 to penalize large coefficients is a common way to regularize linear or logistic regression. However, many practitioners are not aware of the importance of standardizing features before applying those regularization.

　　Returning to fraud detection, imagine a linear regression model with a transaction amount feature. Without regularization, if the unit of transaction amount is in dollars, the fitted coefficient is going to be around 100 times larger than the fitted coefficient if the unit were in cents. With regularization, as the L1 / L2 penalize larger coefficient more, the transaction amount will get penalized more if the unit is in dollars. Hence, the regularization is biased and tend to penalize features in smaller scales. To mitigate the problem, standardize all the features and put them on equal footing as a preprocessing step.

6. Use linear model without considering multi-collinear predictors

　　Imagine building a linear model with two variables X1 and X2 and suppose the ground truth model is Y=X1+X2. Ideally, if the data is observed with small amount of noise, the linear regression solution would recover the ground truth. However, if X1 and X2 are collinear, to most of the optimization algorithms' concerns, Y=2X1, Y=3X1-X2 or Y=100X1-99X2 are all as good. The problem might not be detrimental as it doesn't bias the estimation. However, it does make the problem ill-conditioned and make the coefficient weight uninterpretable.

7. Interpreting absolute value of coefficients from linear or logistic regression as feature importance

　　Because many off-the-shelf linear regressor returns p-value for each coefficient, many practitioners believe that for linear models, the bigger the absolute value of the coefficient, the more important the corresponding feature is. This is rarely true as (a) changing the scale of the variable changes the absolute value of the coefficient (b) if features are multi-collinear, coefficients can shift from one feature to others. Also, the more features the data set has, the more likely the features are multi-collinear and the less reliable to interpret the feature importance by coefficients.

　　So there you go: 7 common mistakes when doing ML in practice. This list is not meant to be exhaustive but merely to provoke the reader to consider modeling assumptions that may not be applicable to the data at hand. To achieve the best model performance, it is important to pick the modeling algorithm that makes the most fitting assumptions -- not just the one you’re most familiar with.

　　统计建模非常像工程学。

　　在工程学中，有多种构建键-值存储系统的方式，每个设计都会构造一组不同的关于使用模式的假设集合。在统计建模中，有很多分类器构建算法，每个算法构造一组不同的关于数据的假设集合。

　　当处理少量数据时，尝试尽可能多的算法，然后挑选最好的一个的做法是比较合理的，因为此时实验成本很低。但当遇到“大数据”时，提前分析数据，然后设计相应“管道”模型（预处理，建模，优化算法，评价，产品化）是值得的。

　　正如我之前文章中所指出的，有很多种方法来解决一个给定建模问题。每个模型做出不同假设，如何导引和确定哪些假设合理的方法并不明确。在业界，大多数实践者是挑选他们更熟悉而不是最合适的建模算法。在本文中，我想分享一些常见错误（不能做的），并留一些最佳实践方法（应该做的）在未来一篇文章中介绍。

1. 想当然地使用缺省损失函数

　　许多实践者使用缺省损失函数(如，均方误差)训练和挑选最好的模型。实际上，现有损失函数很少符合业务目标。以欺诈检测为例，当试图检测欺诈性交易时，业务目标是最小化欺诈损失。现有二元分类器损失函数为误报率和漏报率分配相等权重，为了符合业务目标，损失函数惩罚漏报不仅要多于惩罚误报，而且要与金额数量成比例地惩罚每个漏报数据。此外，欺诈检测数据集通常含有高度不平衡的标签。在这些情况下，偏置损失函数能够支持罕见情况（如，通过上、下采样）。

2. 非线性情况下使用简单线性模型

　　当构建一个二元分类器时，很多实践者会立即跳转到逻辑回归，因为它很简单。但是，很多人也忘记了逻辑回归是一种线性模型，预测变量间的非线性交互需要手动编码。回到欺诈检测问题，要获得好的模型性能，像“billing address = shipping address and transaction amount < $50”这种高阶交互特征是必须的。因此，每个人都应该选择适合高阶交互特征的带核SVM或基于树的分类器。

3. 忘记异常值

　　异常值非常有趣，根据上下文环境，你可以特殊关注或者完全忽略它们。以收入预测为例，如果观察到不同寻常的峰值收入，给予它们额外关注并找出其原因可能是个好主意。但是如果异常是由于机械误差，测量误差或任何其它不可归纳的原因造成的，那么在将数据输入到建模算法之前忽略掉这些异常值是个不错的选择。

　　相比于其它模型，有些模型对异常值更为敏感。比如，当决策树算法简单地将每个异常值计为一次误分类时，AdaBoost算法会将那些异常值视为“硬”实例，并为异常值分配极大权值。如果一个数据集含有相当数量的异常值，那么，使用一种具有异常值鲁棒性的建模算法或直接过滤掉异常值是非常重要的。

4. 样本数少于特征数（n<<p）时使用高方差模型

　　SVM是现有建模算法中最受欢迎算法之一，它最强大的特性之一是，用不同核函数去拟合模型的能力。SVM核函数可被看作是一种自动结合现有特征，从而形成一个高维特征空间的方式。由于获得这一强大特性不需任何代价，所以大多数实践者会在训练SVM模型时默认使用核函数。然而，当数据样本数远远少于特征数（n<<p）—业界常见情况如医学数据—时,高维特征空间意味着更高的数据过拟合风险。事实上，当样本数远小于特征数时，应该彻底避免使用高方差模型。

5. 尚未标准化就进行L1/L2/等正则化

　　使用L1或L2去惩罚大系数是一种正则化线性或逻辑回归模型的常见方式。然而，很多实践者并没有意识到进行正则化之前标准化特征的重要性。

　　回到欺诈检测问题，设想一个具有交易金额特征的线性回归模型。不进行正则化，如果交易金额的单位为美元，拟合系数将是以美分为单位时的100倍左右。进行正则化，由于L1/L2更大程度上惩罚较大系数，如果单位为美元，那么交易金额将受到更多惩罚。因此，正则化是有偏的，并且趋向于在更小尺度上惩罚特征。为了缓解这个问题，标准化所有特征并将它们置于平等地位，作为一个预处理步骤。

6. 不考虑线性相关直接使用线性模型

　　设想建立一个具有两变量X1和X2的线性模型，假设真实模型是Y=X1+X2。理想地，如果观测数据含有少量噪声，线性回归解决方案将会恢复真实模型。然而，如果X1和X2线性相关（大多数优化算法所关心的），Y=2X1, Y=3X1-X2或Y=100X1-99X2都一样好，这一问题可能并无不妥，因为它是无偏估计。然而，它却会使问题变得病态，使系数权重变得无法解释。

7. 将线性或逻辑回归模型的系数绝对值解释为特征重要性

　　因为很多现有线性回归量为每个系数返回P值，对于线性模型，许多实践者认为，系数绝对值越大，其对应特征越重要。事实很少如此，因为：(a)改变变量尺度就会改变系数绝对值；(b)如果特征是线性相关的，则系数可以从一个特征转移到另一个特征。此外，数据集特征越多，特征间越可能线性相关，用系数解释特征重要性就越不可靠。

　　这下你就知道了机器学习实践中的七种常见错误。这份清单并不详尽，它只不过是引发读者去考虑，建模假设可能并不适用于手头数据。为了获得最好的模型性能，挑选做出最合适假设的建模算法—而不只是选择你最熟悉那个算法，是很重要的。

机器学习实践中的7种常见错误