Data Mining Practice <1>: Data Quality Check

There is a classic saying in the data industry - "Garbage in, Garbage out, GIGO", which means that if there is a problem with the basic data used, then any output based on this data is worthless. of. For data analysis and mining, only a high-quality basic data can get correct and useful conclusions. This article mainly introduces the basic ideas and methods of data quality inspection, and implements it based on Python.
In addition, data quality inspection is an important topic in data governance, involving a wide range of content. Due to the limited experience of the author, this article will not cover it, but only start with the data quality inspection in analysis and mining.

The data quality check is carried out after the wide-table data development is completed, and it mainly includes four aspects: duplicate value check, missing value check, data skew problem, and outlier check.

1. Duplicate value check

1.1 What is a duplicate value

The first thing to check for duplicate values ​​is to be clear about the definition of duplicate values. For a dataset in the form of a two-dimensional table, what is a duplicate value? There are two main levels :
① Duplicate records appear in key fields, such as main index fields;
② Duplicate records appear in all fields.
Whether the first level is duplication must be determined from the business meaning of this data. For example, in a table, from a business perspective, a user should have only one record, so if a user has more than one record, then this is a duplicate value. The second level must be repeated values.

1.2 Reasons for Duplicate Values

There are two main reasons for the generation of duplicate values , one is caused by upstream source data, and the other is caused by data association in the data preparation script. From the perspective of data preparation, first check the data preparation script to determine whether the source table used has duplicate records, and at the same time check the correctness and rigor of the associated statement, such as whether the associated conditions are reasonable, whether there is a limited data period, and so on.
For example: SQL to check whether the source table data is duplicated:

SELECT MON_ID,COUNT(*),COUNT(DISTINCT USER_ID)
FROM TABLE_NAME
GROUP BY MON_ID;

If the upstream source data is duplicated, it should be reflected to the upstream for correction in time; if it is caused by the script association, modify the script and regenerate the data.
In another case, this dataset is a separate dataset, not the data developed in the data warehouse. There is neither upstream source data nor a script to generate data, such as a public dataset, then how to deal with it Duplicate values ​​in it? The general processing method is to directly delete duplicate values.

import pandas as pd
dataset = pd.read_excel("/labcenter/python/dataset.xlsx")
#判断重复数据
dataset.duplicated()    #全部字段重复是重复数据
dataset.duplicated(['col2'])    #col2字段重复是重复数据
#删除重复数据
dataset.drop_duplicates()     #全部字段重复是重复数据
dataset.drop_duplicates(['col2'])   #col2字段重复是重复数据

Attachment: The dataset of this article: dataset.xlsx


2. Missing value checking

Missing values ​​mainly refer to the missing information of some fields in some records in the dataset.

2.1 Reasons for missing values

There are three main reasons for the trend value :
① The upstream source system cannot fully obtain this information due to technical or cost reasons, such as the analysis of the user's mobile phone APP Internet records;
② From a business perspective, this information does not exist in the first place. For example, the income of a student, the name of the spouse of an unmarried person;
③ Caused by errors in the development of data preparation scripts.
The first reason cannot be solved in the short term; the second reason is that the lack of data is not an error and cannot be avoided; the third reason is that you only need to verify and modify the script.
The existence of missing values ​​not only represents the loss of a certain part of information, but also affects the reliability and stability of mining analysis conclusions. Therefore, missing values ​​must be dealt with.
If the number of missing value records exceeds 50% of the total number of records, the field should be removed directly from the data set, and try to find an alternative field from the business;
if the number of missing value records does not exceed 50%, you should first look at this field in Whether there is an alternative field in the business, if so, remove the field directly, if not, it must be processed.

#查看哪些字段有缺失值   
dataset.isnull().any()    #获取含有NaN的字段
#统计各字段的缺失值个数
dataset.isnull().apply(pd.value_counts)
#删除含有缺失值的字段
nan_col = dataset.isnull().any()
dataset.drop(nan_col[nan_col].index,axis=1)

2.2 Handling of missing values

There are two main ways to deal with missing values: filtering and filling.

(1) Filtering of missing values

Directly deleting records with missing values ​​will generally affect the number of samples. This method is not recommended if there are too many deleted samples or the dataset is originally a small dataset.

#删除含有缺失值的记录
dataset.dropna()

(2) Filling of missing values

There are three main methods for filling missing values :
① Method 1: Filling with specific values ​​Filling
with statistics such as the mean, median, and mode of the missing value field.
Advantages: simple and fast;
Disadvantages: prone to data skew;
② Method 2: Use the algorithm to predict and fill
the missing value field as the dependent variable, use the field without missing value as the independent variable, and use decision trees, random forests, KNN, regression and other predictions The algorithm makes predictions for missing values ​​and fills in with the predicted results.
Advantages: Relatively accurate
Disadvantages: Low efficiency, if the missing value field has little correlation with other fields, the prediction effect is poor
③ Method 3: Use the missing value as a separate group, specify the value to fill
, and select a separate value to fill in the business , so that the missing values ​​are distinguished from other values ​​as a group, so that it does not affect the algorithm calculation.
Pros: Simple, practical
Cons: Inefficient

#使用Pandas进行特定值填充
dataset.fillna(0)   #不同字段的缺失值都用0填充
dataset.fillna({'col2':20,'col5':0})    #不同字段使用不同的填充值
dataset.fillna(dataset.mean())   #分别使用各字段的平均值填充
dataset.fillna(dataset.median())     #分别使用个字段的中位数填充

#使用sklearn中的预处理方法进行缺失值填充(只适用于连续型字段)
from sklearn.preprocessing import Imputer
dataset2 = dataset.drop(['col4'],axis=1)
colsets = dataset2.columns
nan_rule1 = Imputer(missing_values='NaN',strategy='mean',axis=0)    #创建填充规则(平均值填充)
pd.DataFrame(nan_rule1.fit_transform(dataset2),columns=colsets)    #应用规则
nan_rule2 = Imputer(missing_values='median',strategy='mean',axis=0) #创建填充规则(中位数填充)
pd.DataFrame(nan_rule2.fit_transform(dataset2),columns=colsets)    #应用规则
nan_rule3 = Imputer(missing_values='most_frequent',strategy='mean',axis=0)  #创建填充规则(众数填充)
pd.DataFrame(nan_rule3.fit_transform(dataset2),columns=colsets)    #应用规则


3. Data skew problem

Data skew means that the value distribution of a field is mainly concentrated in a specific category or a specific interval.

3.1 Reasons for the data skew problem

There are three main reasons for this problem :
① There is a problem with the upstream source data;
② There is a problem with the data preparation script;
③ The distribution of the data itself is like this.
If there is a data skew problem in a field, the first and second reasons above must be checked first. If there is no problem or cannot be checked (such as a separate data set), then it is necessary to consider whether this field has any effect on subsequent analysis and modeling. value. Generally speaking, fields with serious data skew are weak in distinguishing target variables and have little value for analysis and modeling, so they should be eliminated directly.

3.2 How to measure the skewness of data

To measure the inclination of the data, the frequency analysis method is mainly used, but there are differences due to different data types:
① For continuous fields, it is necessary to first use equal-width binning for discretization, and then calculate the distribution of the number of records in each bin. ;
② For discrete fields, directly calculate the distribution of the number of records in each category.
Generally speaking, if more than 90% of the records in a field are mainly concentrated in a specific category or a specific interval, then there is a serious data skew problem in this field.

#对于连续型变量进行等宽分箱
pd.value_counts(pd.cut(dataset['col3'],5))  #分成5箱
#对于离散型变量进行频数统计
pd.value_counts(dataset['col4'])


4. Outlier checking

Outliers refer to data that are outside of a specific distribution, range, or trend. These data are generally referred to as outliers, outliers, noise, etc.

4.1 Reasons for outliers

There are two main reasons for the occurrence of outliers :
① Errors in the process of data collection, generation or transmission;
② Some special situations in the business operation process.
The abnormal value generated by the first reason is called statistical abnormality , which is a data problem caused by errors and needs to be solved; the abnormal value generated by the second reason is called business abnormality , which reflects the business operation process. A special result, it is not an error, but needs to be studied in depth. A typical application in data mining is anomaly detection models, such as credit card fraud, network intrusion detection, customer behavior identification and so on.

4.2 Identification method of outliers

There are mainly the following methods for identifying outliers:

(1) Extreme value check

It mainly checks whether the value of the field is beyond the reasonable range.
① Method 1: Maximum value and minimum value
Use the maximum value and the minimum value to judge. For example, the maximum age of a customer is 199 years old, and the minimum cost of a customer bill is -20, which are obviously abnormal.
② Method 2: 3σ principle
If the data follow a normal distribution, under the 3σ principle, an outlier is defined as a value that deviates from the mean by more than 3 times the standard deviation. This is because, under the assumption of normal distribution, the probability of occurrence of values ​​other than 3 times the standard deviation of the specific mean is lower than 0.003, which is an extremely small probability event.
③ Method 3: Boxplot Analysis
Boxplots provide criteria for identifying anomalies: outliers are defined as less than the lower quartile -1.5 times the interquartile range, or greater than the upper quartile + 1.5 times the interquartile range The value of the spacing.
Boxplot analysis does not require the data to obey any distribution, so the identification of outliers is more objective.

#计算相关统计指标
statDF = dataset2.describe()  #获取描述性统计量
statDF.loc['mean+3std'] = statDF.loc['mean'] + 3 * statDF.loc['std']  #计算平均值+3倍标准差
statDF.loc['mean-3std'] = statDF.loc['mean'] - 3 * statDF.loc['std']  #计算平均值-3倍标准差
statDF.loc['75%+1.5dist'] = statDF.loc['75%'] + 1.5 * (statDF.loc['75%'] - statDF.loc['25%'])  #计算上四分位+1.5倍的四分位间距
statDF.loc['25%-1.5dist'] = statDF.loc['25%'] - 1.5 * (statDF.loc['75%'] - statDF.loc['25%'])  #计算下四分位-1.5倍的四分位间距
#获取各字段最大值、最小值
statDF.loc[['max','min']]
#判断取值是否大于平均值+3倍标准差
dataset3 = dataset2 - statDF.loc['mean+3std']
dataset3[dataset3>0]
#判断取值是否小于平均值-3倍标准差
dataset4 = dataset2 - statDF.loc['mean-3std']
dataset4[dataset4<0]
#判断取值是否大于上四分位+1.5倍的四分位间距
dataset5 = dataset2 - statDF.loc['75%+1.5dist']
dataset5[dataset5>0]
#判断取值是否小于下四分位-1.5倍的四分位间距
dataset6 = dataset2 - statDF.loc['25%-1.5dist']
dataset6[dataset6<0]

(2) Check the distribution of the number of records

It mainly checks whether the distribution of the number of records in the field exceeds the reasonable distribution range, including three indicators: the number of zero-valued records, the number of positive-valued records, and the number of negative-valued records.

(3) Fluctuation check

Volatility check is mainly applicable to supervised data to check whether the dependent variable fluctuates significantly as the independent variable changes.
The above identification methods for outliers are mainly for continuous fields, while the abnormal identification for discrete fields is mainly by checking whether there is data outside a reasonable threshold in the category. For example, in the Apple terminal model field, the value of "P20" appears. .

4.3 Handling of outliers

For the processing of statistical outliers, there are mainly two ways: elimination or replacement. Elimination refers to the direct deletion of records marked as outliers from the data set, while replacement refers to replacing the outlier with a non-outlier, such as a boundary value, or a supervised target variable with a similar representation. value.
For the processing of business outliers, the principle is to conduct in-depth exploration and analysis to find the root cause of this special situation.


5. References and thanks

[1] Python data analysis and data operation
[2] Python data analysis and data mining actual combat





Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324684433&siteId=291194637