8 Worst Predictive Modeling Techniques

Read the full article http://click.aliyun.com/m/23305/

Most of the following technologies have been developed for a long time (in the past 10 years), and most of their shortcomings have been compensated, so the updated technology is far different Performance is also greatly improved over its original version. But often, these flawed techniques are still widely used.

1. Linear regression
  relies on general criteria, heteroscedasticity, and other assumptions and cannot capture highly nonlinear chaotic patterns. It tends to overfit, parameters are difficult to interpret, and it is very unstable when independent variables are highly correlated. Correction methods include reducing variables, performing variable transformations, and using constrained regression (eg, Ridge or Lasso regression).

2. Traditional decision trees are
  large , unstable, uninterpretable, and prone to overfitting. The fix involves using multiple small decision trees instead of one large decision tree.

3. Linear discriminant analysis is
  used for supervised clustering. This is a poor technique because it assumes that the clusters do not overlap and are completely separated by a hyperplane. In practice this has never been the case. Density estimation techniques should be used instead.

4. K-means clustering
  tends to produce annular clusters and is not easy to deal with data points that do not fit a Gaussian mixture distribution.

5. Neural networks
  are not easy to interpret, unstable, and prone to overfitting.

6. Maximum Likelihood Estimation
  requires that your data conform to a pre-specified probability distribution. It's not data-driven, and many times the pre-specified Gaussian distribution doesn't quite fit your data.

7. High-dimensional density estimation is
  often affected by dimensionality. One of the corrections is to use nonparametric kernel density estimation with adaptive bandwidth.

8. Naive Bayes
  For e.g. fraud detection, spam detection and scoring. They assume the variables are independent, but fail miserably if they are not. In fraud detection and spam detection, variables (sometimes called rules) are highly correlated. One of the fixes is to divide the variables into independent variable clusters, each cluster containing highly correlated variables. Then apply Naive Bayes to the clusters, or use data reduction techniques. Bad text mining techniques (eg basic "word" rules in spam detection) combined with Naive Bayes can produce very dire results with a lot of false positives and false negatives.

  The reasons why these bad models are still widely used are as follows.
Read the full text http://click.aliyun.com/m/23305/

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326179929&siteId=291194637