Missing data, abnormality data, the data processing method standardized

Missing data

First, the reason for missing values ​​generated

Cause a variety of missing values, it divided into mechanical reasons and man-made causes. The reason is the mechanical data due to mechanical failure causes data collection or storage caused by missing, such as data storage failure, damage to the memory, resulting in mechanical failure could not be collected data over time (in terms of the timing of data collection). Human reason is due to the subjective errors, the historical limitations of the data is missing or intended to conceal the cause of, for example, in the market survey respondents who declined to give answers to questions, or answer the question invalid data entry personnel mistakes leak recorded data.

Second, the type of missing values

From the distribution of the missing missing values ​​it can be classified in terms of missing completely at random, completely missing at random and non-random deletions. Completely random deletions (missing completely at random, MCAR) refers to the random missing data, missing data is not dependent on any variable or incomplete completely variable. Random deletions (missing at random, MAR) refers to the missing data is not completely random, i.e. such missing data depends on other variables completely. Completely non-random deletions (missing not at random, MNAR) refers to the missing data is not completely dependent on the variable itself.

From the perspective of the value of your property missing, if all the missing property values ​​are the same, then this lack of a single value is missing, if the missing values ​​belong to different properties, called arbitrary missing. In addition to the class of time series data, as there may be a lack of time, this is called monotone missing missing.

Third, the missing values ​​of treatment

For handling missing values, divided on the whole, there is a case deleting missing values ​​and missing value imputation. Subjective data, will affect the authenticity of the data, there are other samples of the true value of the missing property value is not guaranteed, it depends on the values ​​of these attributes interpolation is not reliable, it is generally not recommended for subjective data interpolation method. Interpolation is mainly for objective data, its reliability is guaranteed.

1. Delete the cases containing missing values

There are simple deletion method and the weight method. Simply delete method is the most primitive method of handling missing values. It will delete the existence of cases of missing values. If data is missing the issue target can be achieved by simply deleting the small part of the sample, this method is most effective. When the type of non-missing values ​​missing completely at random to be complete by the deviation data reduced weighting. After the data is incomplete case marks the complete data case given different weights, heavy weights case can be obtained by logistic or probit regression. If there is the explanatory variables decisive variable weight estimate row factor, then this method can effectively reduce the deviation. If the explanatory variable and not related to the weight, it does not reduce the deviation. In the case of multiple attributes missing, you need a different combination of lack of empowerment of the different properties of weight, which will greatly increase the difficulty of computing, reducing the accuracy of the prediction, then weight method is not ideal.

2. Possible values ​​interpolated missing values

It is thought the source of information is the most likely value to impute missing values ​​than delete all incomplete sample produced fewer lost. In data mining, the face is usually a large database, its properties have dozens or even hundreds, as a missing property value and give up a lot of other property values, this is a great waste of deleting information , so the ideas and methods to generate possible values ​​for interpolation of missing values. There are several commonly used methods.

(1) mean imputation. Attribute data and non-spacer into the spacer type. If the values ​​are missing from the given type, it is the average value of the presence attribute value interpolated missing values; missing values ​​if a non-spacer type, the mode according to the principles of statistics, the mode with the attribute (ie the highest value of the frequency of occurrence) to pad the missing values.

(2) using the same mean imputation. The method of the interpolation belong to the same single mean value interpolation, except that it is missing variable prediction model with hierarchical clustering type, then the type of interpolation to the mean. Suppose X = (X1, X2 ... Xp) for the complete variable information, Y is the value of the missing variable exists, then the first line of X, or a subset of clusters, and then press-missing class belongs to a different class mean interpolation. If you need to explain and Y variables introduced for analysis at a later statistical analysis, then this interpolation method will introduce autocorrelation in the model, to analyze obstacle.

(3) maximum likelihood estimation (Max Likelihood, ML). Deletion type under conditions of missing at random, the model is assumed for the complete sample is correct, then the unknown parameters may be marginal distributions observation data Maximum Likelihood Estimation (Little and Rubin). This method is also referred to as missing values ​​ignored maximum likelihood estimation, for the maximum likelihood estimation of the parameters of the actual calculation method is often used in expectation maximization (Expectation Maximization, EM). This method than the deletion of a single value interpolation and more attractive, it is an important premise: for large samples. Number of samples is sufficient to ensure effective ML estimate is unbiased and asymptotically normally distributed. However, this method may fall into local minimum convergence speed is not fast, and very complex calculation.

(4) multiple imputation (Multiple Imputation, MI). Multi-value interpolation thought comes Bayesian estimation, interpolation values ​​are considered to be random, its value from the value has been observed. Practice of the specific value is generally estimated to be interpolated, and different noise plus, optionally forming a plurality of sets interpolated values. The basis of a choice, select the most appropriate interpolation.

Multiple Interpolation Method three: ① From generating a set of possible values ​​for each interpolation null values ​​that reflect no response model uncertainty; each value interpolation data can be used to set missing values, generating a plurality of complete data set. ② each interpolation data sets were used for statistical analysis of the complete data set of statistical methods. ③ the results from each set of interpolation data, selected in accordance with the scoring function, to produce the final interpolated value.

Suppose a set of data, including three variables Y1, Y2, Y3, their joint distribution is normal, this group of data into three groups, A group holding the original data, deletion of only the B group Y3, Y1 and deletion of group C Y2. When multi-value interpolation, the group A without any treatment, to produce a set of estimated values ​​for Y3 group B (for about Y3 Y1, Y2 of the return), group C of a composition for generating Y1 and Y2 is estimated value (as Y1, Y2 and Y3 on the return).

When the multi-value interpolation, the A groups will not be processed, to B, C to complete the random sample is formed as long as the group is m (m is an optional set of interpolated value m), the number of cases per group effective estimated parameters on it. To estimate the distribution of property exists missing values, and then based on this set of observations m, m for this set of samples were generated set of estimated values ​​for the parameters m given the corresponding predicted words, the estimation method employed at this time is the maximum likelihood method, in particular the computer implemented method expectation maximization algorithm (EM). Group B of a set of estimated values ​​of Y3, the use of C Y1, Y2, Y3 their joint distribution is a normal distribution of this premise, a set of estimates (Y1, Y2).

In the embodiment assumes Y1, joint distribution Y2, Y3 is a normal distribution. This assumption is artificial, but has been verified by variable (Graham and Schafer in 1999), the joint distribution of non-normal, under this assumption can still be very close to the results to estimate the true value.

Multiple imputation and Bayesian estimation of the idea is the same, but multiple imputation to make up for lack of a few Bayesian estimation.

(1) Bayesian method to estimate the maximum likelihood estimation, maximum likelihood method requires formal model must be accurate, if the parameter is incorrect form, will have to get the wrong conclusion that the prior distribution will affect the posterior distribution accuracy. The multiple imputation is based on complete data of large sample asymptotic theory, the amount of data in data mining are large, the prior distribution results will be minimal impact, so little effect on the results of prior distribution .

(2) Bayesian estimation requires only known prior distribution of the unknown parameters, and parameters without using the relationship. The multiple imputation made estimates of the parameters of the joint distribution using the relationship between parameters.

Above four interpolation method, the value for the missing type of missing data interpolation have a good effect. Two kinds of mean imputation method is the easiest to implement, but it is also present in the sample before people often use extremely disruptive, especially when the value of the interpolation regression as explanatory variables, the estimated value of the parameter and the real a large deviation value. In comparison, the maximum likelihood estimation and multiple imputation are two good interpolation method, and multiple imputation contrast, maximum likelihood lacks an element of uncertainty, so more and more people tend to use more than value interpolation method.
3. Repeat value detecting

IV Summary

Unknown value interpolation process just to make up our subjective estimates may not be entirely consistent with the objective facts. The above analysis is theoretical analysis, missing values ​​for itself because it can not be observed, it is impossible to know its missing their type, there can be no interpolation estimate the effect of a method of interpolation. In addition, these methods are common to all areas, with the universal, then the interpolation results for a professional field will not be ideal, it is for this reason, many professional data miners by their understanding of the industry, the missing manual value interpolation effect but it may be better than these methods. Impute missing values ​​is the case where the data mining process in order not to give up a large amount of information, while the use of human intervention missing values, the relationship between that affect both the variable processing method, the incomplete information to be filled At the same time deal, we more or less changed the original data information system, there is the potential impact on later analysis, the treatment of missing values ​​must be careful.

Data anomalies

In the data analysis, the raw data we face are some of the dirty data in which outliers One is dirty data. So, we conduct data analysis time must be abnormal values ​​in the data processing, so if you know how to outlier data cleaning is to clean it? Here we'll tell you about how to deal with outliers in the data cleaning.

First, we need to have an understanding of the outliers, in general, outliers often called "outliers", to deal with outliers, there are many commonly used methods, the first is a simple statistical analysis, the second 3∂ process is the use of the principle, that the third box plot analysis is based on the fourth detection model, based on the distance detected is a fifth, a sixth detection is based on the density, clustering is based on the seventh. Here we were to tell you about these methods.

First to tell you about simple statistical analysis, when we get the data can be a simple descriptive statistical analysis of the data, such as minimum and maximum values ​​can be used to determine whether the value of this variable exceeds the reasonable range, irrational for outliers.

The second principle is 3∂, if the data is normally distributed in 3∂ principles, a set value of the abnormal measured value deviation from the average over three times the standard deviation value. If the data follow a normal distribution, the probability of the average distance beyond 3∂ value appears as P (| xu |> 3∂) <= 0.003, belongs to a very few small probability event. If the data do not follow a normal distribution, it can also be described by the number of standard deviations away from the average.

The third is the box plot analysis, in general, provides a box plot outlier identification criteria: if a value is less than or greater than QL01.5IQR the OU-1.5IQR, is called outliers. QL is the lower quartile, represents a quarter of all observations it is smaller than the value of the data; QUs upper quartile, represents a quarter of all observed values ​​data values ​​than it Great; quartile range of the IQR is the upper quartile QU and QL lower quartile difference values ​​comprising half of all observations. In general, the method of determination box plot outlier in quartiles, and interquartile range based robust quartile: 25% of data can be made and will not interfere with any distance Quartile number, the outliers can not exert influence on this standard. Accordingly box plot outlier identification more objective, it has some advantages in identifying outliers.

In this article we tell you about the related methods of data cleansing, through the introduction of these methods, we find that these methods are very classic, due to space reasons we gave you here, in the back article, we will continue to introduce methods of data cleansing for everyone.

Author: CDA Data Analyst Training link: https: //www.jianshu.com/p/8692df30766e Source: Jane books
are copyrighted by the author. Commercial reprint please contact the author authorized, non-commercial reprint please indicate the source.

For a brief summary of the missing data, data anomalies, data conflicts, too much data dimensions, data standardization method.

Missing Data:
mainly due to the data collection process of human, equipment failure or private data are not disclosed, furthermore is
not applicable (inapplicability NA) caused. From the distribution of the missing missing values it can be classified in terms of missing completely at random, completely missing at random and non-random deletions. For handling missing values, divided on the whole, there is a case deleting missing values and missing value imputation. Delete missing values are mainly simple deletion method and the weight method. Missing value imputation is mainly aimed at objective data, it is more assured reliability. There are several commonly used methods: 1) mean 2 interpolation) interpolation using the mean grade 3) maximum likelihood estimation (Max Likelihood, ML) 4) multiple imputation (Multiple Imputation, MI).
Exception Data:
In general, abnormal value is usually referred to as "outliers", the processing for outliers, there are many commonly used method, the first is the simple statistical analysis, the second principle is to use 3∂ process, the third is the box plot analysis is based on the fourth detection model, based on the distance that the fifth detection (LOF method of calculating the relative distance, the greater the value the greater is the probability of outlier), is based on the density detection sixth, seventh is based on clustering, association rules eighth is: a high degree of confidence and support of the association rule defines a different model.
Duplicate detection value:
specific analysis of specific data on different areas and in different environments, to eliminate duplicate records may be set for the two data sets or a combined data, you first need to detect the real identity repeatedly recorded same entity, i.e. matching process . Duplicate detection algorithm records are: substantially matching algorithm field, a field recursive matching algorithm, Smith-Waterman algorithm, Cosine similarity function. Data Conflict: need to analyze specific processing in accordance with different types of data collisions.
Data conflicts:
(1) data conflicts for those columns that contain a large number of null values, calculated for each column proportions in this column a null value share, and as a basis to determine whether the column should be removed.
(2) For those with only a few single - state data problems columns, each column is calculated as the number of single values for the column, and is the basis of this information to decide whether to remove those columns that do not seem to use
(3 ) for those data (data terminal) are beyond the normal recording a column calculating the number of data in the column terminal (outlier). and those where terminal data out line marked oil and decide how to handle them
(4) those lines do not meet certain format by format conversion can be converted to the correct format is the best format will be unified in the APB data warehouse
(5) with different attributes for those columns of the same record would be meaningless when compared, by calculating the relationship between the various rows and columns (greater than, less than, equal to) the ratio of the number, the smaller the number of view the relationship, and then determined according to the meaning of the columns of the column.
Excessive data dimension:
Solutions: dimensionality reduction
(1) Principal Component Analysis
(2) Random Forest
data standardization:
normalization method is also called the normalized deviation, is a linear transformation of the original data, so that the results are mapped to [0,1] range.
1) Method This method normalized standardized data based on the mean of the raw data (mean) and standard deviation (standard deviation). The original value x A using a z-score normalized to x '. z-score normalization method is suitable for the maximum and minimum attribute A is unknown, or circumstances beyond the range of outliers. The default is the standardized method spss z-score normalized.
2) normalization method.
3) min-max normalized
4) z-score normalized
5) Minimum - Maximum Mean -0- - scaling the fractional

Published an original article · won praise 2 · views 24

Guess you like

Origin blog.csdn.net/w47478/article/details/104874580