[Data analysis] an important part - how missing values

Reprinted Source: https://blog.csdn.net/Q2605894893/article/details/81327027

table of Contents

1 reason for missing data

Missing data type 2

Missing data processing method 3

1. Delete Record

2. Data to fill

3. No treatment

4 Summary


 

1 reason for missing data

First, we should know: why the data is missing? Missing data that we can not avoid, there are many possible causes, bloggers are summarized in the following three categories:

  • Unintentional : information is missing, for example due to the negligence of staff, forgetting missing; or because of lack of data acquisition, such as failure and other causes, such as high real-time requirement, the machine was too late and cause loss of judgment and decision-making;
  • Intentional : Some of the features described in the set of data will be missing values predetermined as a characteristic value, this time missing values can be seen as a special feature value;
  • Does not exist : some characteristic properties simply do not exist, such as a spouse's unmarried name would not be able to fill, another example, a child's income can not fill;

All in all, for the cause of missing values, we need to be clear: because of unintentional omission or negligence caused by, or that deliberately caused, or non-existent. Only know its origin, we can address the problem, do the appropriate treatment.

To deal with specific issues specific analysis of missing values, why should analyze specific issues it? Because the property is missing sometimes does not mean that the data is missing, the missing information itself is contained, it needs to be reasonable based on information filled in different application scenarios may contain missing values. By following a few examples to illustrate how to analyze specific issues, eyes of the beholder wise see wisdom, for reference only:

  1. "Annual income": Recommended goods under the average scenario filled, filling a minimum amount of borrowing at the scene;

  2. "Behavioral point of time": Fill the mode;

  3. "Price": The next scene filled with merchandise recommended minimum, match the merchandise at the scene filling the average;

  4. "Human life": insurance costs estimated maximum filling scenario, the population was estimated at the scene filling the average;

  5. "Driving experience": the user does not fill this one could be without a car, it is reasonable to fill 0;

  6. "Graduate": did not fill this one user may not be on the university, it fills positive infinity reasonable;

  7. "Marital status": did not fill this one user may be more sensitive to their privacy should be set to a single classification, as a married, unmarried 0, unfilled -1.

 

Missing data type 2

Before the missing data are processed, to understand the mechanisms and forms of missing data is essential. The data set does not contain the value of a variable called complete missing variable data set comprising a variable value called incomplete missing variable. From the distribution of the deletions may be deletions into missing completely at random, completely missing at random and non-random deletions.

  • Missing completely at random (Random Missing Completely AT, MCAR) : refers to the missing data is completely random, independent of any incomplete or complete variable variable does not affect the unbiasedness samples, such as home address deletion;
  • Missing at random (Random Missing AT, MAR) : refers to the missing data is not completely random, that is kind of missing data is completely dependent on other variables, such as the lack of financial data related to the case of small enterprises;
  • Not missing at random (not AT Missing Random, MNAR) : refers to the missing data and incomplete variables related to the value of their own, such as high-income people do not intent to provide family income;

For missing at random and non-random missing, delete records is not appropriate, the reasons have been given above. Random deletion of missing values ​​can be estimated by the known variables, rather than a random non-random missing there is no good solution.

 

Missing data processing method 3

The following are four ways to deal with missing values: delete records, fill data, and does not handle .

1. Delete Record

advantage:

  • The most simple and crude;

Disadvantages:

  • In exchange for complete information by reducing the historical data, which may be missing a lot of important information hidden;
  • When the proportion of missing data is large, especially when the non-random distribution of missing data, delete data might cause deviated, such as a normal distribution had become non-normal;

 

2. Data to fill

Interpolation of missing values can be roughly divided into three types: replace missing values, missing values fit, dummy variables . Is replaced by the similarity of the data to fill the missing data of Central Africa, the core idea is to find a common feature of the same group, is fit to fill through other feature-based modeling, dummy variables are derived new variable instead of missing values.

Replace Missing Values

  • Mean imputation:

对于定类数据: Using  the mode (mode) to fill, such as a number of school boys and girls, 500 boys, 50 girls, then for the rest of the missing value we will use the higher number of boys to fill.

对于定量(定比)数据: Using the average (mean) or median (median) to fill, such as height feature a class of students, for some height value of the missing students can use the average or median height of the whole class to fill. If the distribution feature is generally positive too distribution, better results using the average value, and when there is an abnormal value due to the distribution of the case where the distribution is not a positive too, using the median effect is better.

Note: This method is simple, but not accurate, may introduce noise, or change the characteristics of the original distribution.

 

  • Calorie filling (Hot deck imputation):

Calorie filling method is to find a most similar to the object in its complete data, then use the value of the object is similar to the filling. Usually find beyond a similar subject, not the best match in all subjects, but from a randomly selected as the fill value. The key to this problem is a different problem may use different criteria to determine for similar conduct, as well as how to develop the criteria. This method is conceptually very simple, and the relationship between the use of data to estimate the value empty, but the disadvantage is difficult to define a similar standard, more subjective factors.

  • K-means clustering (K-means clustering)

Another method is to use unsupervised clustering method machine learning. By clustering the K-means to cluster classification method for all samples, and then to fill in missing values of each class divided by the kind of mean. Owned by its nature or to fill in missing values by looking for similar.

Note: fill missing values ​​depends on the accuracy of the clustering result is good or bad, and the clustering results highly variable, generally related to the initial selection point, and in the next figure can be seen the individual characteristics of each class value is also very different, and therefore should be used with caution.

 

Fitting missing values

Fit is the use of other variables make inputs to the model to predict the missing variables, like our normal methods of modeling, but becomes the target variable missing values.

Note: If the variable is independent of the other features missing variable, the predicted results meaningless. If the forecast is quite accurate, then you need to explain this variable did not predict, because this is necessarily duplicate information between the characteristic variables. In general, the effect will be interposed therebetween is preferably introduced after the autocorrelation if forced to impute missing values, which would create obstacles for subsequent analysis.

There are many missing variables using the model to predict the method, here only briefly a few.

  • Regression prediction:

As we mentioned before prices forecast projects, like data analysis, real - Beijing second-hand housing prices analysis (modeling papers) , based on a complete set of data, regression equation. For missing values eigenvalue, wherein the known values into the model to estimate the unknown characteristic value, in order to fill the estimate, the following Example FIG. Of course, there are many methods on the return, here is not described in detail.

Missing values ​​are continuous, i.e., quantitative type, can be predicted using the regression.
  • Maximum Likelihood Estimation (Maximum likelyhood):

Deletion type under conditions of missing at random, the model is assumed for the complete sample is correct, then the unknown parameters may be marginal distributions observation data Maximum Likelihood Estimation (Little and Rubin). This method is also referred to as missing values ignored maximum likelihood estimation, for the maximum likelihood estimation of the parameters of the actual calculation method is often used in expectation maximization (Expectation Maximization (EM), the EM) . This method than the deletion of a single value interpolation and more attractive, it is an important premise: for large samples . Number of samples is sufficient to ensure effective ML estimate is unbiased and asymptotically normally distributed. However, this method may fall into local minimum convergence speed is not fast, and the calculation is very complicated, and is limited to a linear model.

  • Multiple imputation (Mutiple imputation):

Multi-value interpolation thought comes Bayesian estimation, interpolation values are considered to be random , its value from the value has been observed. Practice of the specific value is generally estimated to be interpolated, and different noise plus, optionally forming a plurality of sets interpolated values. The basis of a choice, select the most appropriate interpolation.

We see fitting and replacement method proposed above is a single imputation method, and multiple imputation to make up for the shortcomings of a single interpolation, it did not attempt to go to estimate each missing value by an analog value, but proposed then a missing data sample values ​​(these samples may be the result of a combination of fitting different models). Embodiment of such a procedure appropriately reflected missing values ​​due to the uncertainty caused by the statistical inference that effective. Multiple imputation inference can be divided into the following three steps:

  • Generating a set of possible values ​​for each interpolated missing values, these values ​​reflect the uncertainty in model no response;
  • Each set of interpolation data are used for statistical analysis of the complete data set of statistical methods;
  • Results from each of the sets of interpolation data, selected in accordance with the scoring function, to produce the final interpolated value;

The mechanism of the missing data, and the mode variable types, respectively regression, predictive mean matching (predictive mean matching, PMM), Trends score (propensity score, PS), Logistic regression, discriminant analysis and Markov chain Monte Carlo ( Markov Chain Monte Carlo, MCMC), such as different methods to fill.

Suppose a set of data, including three variables Y1, Y2, Y3, their joint distribution is normal, this group of data into three groups, A group holding the original data, deletion of only the B group Y3, Y1 and deletion of group C Y2. When multi-value interpolation, the group A without any treatment, to produce a set of estimated values ​​for Y3 group B (for about Y3 Y1, Y2 of the return), group C of a composition for generating Y1 and Y2 is estimated value (as Y1, Y2 and Y3 on the return).

When the multi-value interpolation, the A groups will not be processed, to B, C to complete the random sample is formed as long as the group is m (m is an optional set of interpolated value m), the number of cases per group effective estimated parameters on it. To estimate the distribution of property exists missing values, and then based on this set of observations m, m for this set of samples were generated set of estimated values ​​for the parameters m given the corresponding estimates by the method of maximum likelihood for the time employed However method, in a computer as a specific algorithm the expectation maximization method (EM). Group B of a set of estimated values ​​of Y3, the use of C Y1, Y2, Y3 their joint distribution is a normal distribution of this premise, a set of estimates (Y1, Y2).

In the embodiment assumes Y1, joint distribution Y2, Y3 is a normal distribution. This assumption is artificial, but has been verified by variable (Graham and Schafer in 1999), the joint distribution of non-normal, under this assumption can still be very close to the results to estimate the true value.

NOTE: Use multiple imputation of missing data is required randomness deletions, typically a high repetition accuracy 20-50 times, but the calculation is very complicated, requires a lot of calculations.

  • Random Forests:

Another commonly used method is to fit a random forest, and this is a way Kaggle contest bigwigs often used, and specific implementation normally the same, but the missing value as the target variable.

 

virtual variable

It is actually a dummy variable missing values derived variable . This is done by determining whether the value of the missing feature values to define a new dichotomous. For example, wherein A containing missing values, we derived a new feature B, if A characteristic values are missing, then the corresponding B value of 1, if A is the feature value is not missing, the corresponding B A value of 0.

 

3. No treatment

Padded handle only up to the unknown value our subjective estimates may not be entirely consistent with the objective facts, at the same time filled incomplete information processing, we more or less changed the original information system. Further, null values ​​are often incorrect filling noise introduced into the new data in the mining tasks produce erroneous results. Therefore, in many cases, we still want to process the information systems under the premise of maintaining the original information does not change.

In practice, some models can not cope with missing data values, and therefore the missing values are handled. However, there are some models itself can cope with missing data values, this time without the need for data processing, for example Xgboost, rfrand other high-level model.

 

4 Summary

In summary, most of the data mining preprocessing will use more convenient way to deal with missing values, such as mean value method, but the effect is good and constant, so it is necessary to select a suitable method depending on the need, and does not solve all the problems of a universal method. Specific methods employ a plurality of aspects need to be considered:

  • Reasons for missing data;
  • Missing data value type;
  • The amount of data samples;

Published 44 original articles · won praise 16 · views 10000 +

Guess you like

Origin blog.csdn.net/YYIverson/article/details/103388295