[Data Mining] data preprocessing - MISSING VALUES


Data pre-processing architecture:
Here Insert Picture Description
ETL: a data pipeline is responsible for the distribution of heterogeneous data, data cleansing conversion, inheritance (transform) according to certain business rules, and finally loading the processed data to the destination data - data warehouse ;

Data Extraction

Oralcle SQL Server Flat data teradata, check the data types to ensure data integrity, removing duplicate data, remove the dirty data, ensure consistent export data attributes and metadata;

  • Update extraction:

When the source system has added new data or data update, the system will issue a reminder alert;

  • The total amount of 抽取:

When a new data source or data added to the data update occurs, the system will not issue a reminder, data similar to the data migration or replication, the data source table or view data intact extracted from the database, and themselves into the format recognizable ETL tool, generally used only when the system is initialized, the total amount of time, it is necessary to extract incremental per day;

  • Incremental extraction:

When a new data source or data added to the data update occurs, the system is not a reminder, but the updated data can be identified, only the incremental extraction The extraction data table in the database since the last dependency to add or modify the
stream processing : for small data sets, real-time data acquisition
batch extraction: suitable for large data sets
for data extraction incremental approach:
small amount of data • source data
• the source data is not easy to change
• source data integrity ah regular changes
• target amount of data huge

Capture method

  • Trigger Capture (snapshot trap):
    the source table to capture data changes trigger an increase, changes in data entry temporary table, the target system temporary table from data entry, taken after the mark or delete
    the advantages of the high degree of automation
    drawback: the original system performance some influence, not recommended for frequent use
  • Increase timestamp:

Increase in the source table update timestamp field, a time stamp value data changes, the time stamp value is determined by determining whether the extracted top hungry recorded extracting
automatic update, manually update
advantages: performance optimization, extraction cleaning ideas
disadvantages: business systems poured into of larger

  • Full insert mode table Delete: delete the target table during the extraction, the entire table reintroduced into the source table
    advantages: Simple Extraction Rule
    disadvantages: key with external dimension tables are not suitable

Data Conversion

  • Data Conversion: Cleaning + to
    wash repeated incomplete + + error
    specific business rules to make data conversion rule, in order to achieve the purpose of data conversion;
    O Cleaning: the null value becomes "0" or "NULL", Male variable to M.
    o Deduplication: deduplication.
    o Data Threshold Validation Check: Combining real-world data to determine such as age can not exceed 3 digits.
    o Transpose: column transposed.
    o Filtering: selects only the determined data loaded into the data destination.
    o Joining: combines together data from multiple sources (merge, lookup).
    o Splitting: the data into a plurality of columns.
    o Integration: the multiple columns of data are integrated into a
    high-quality data: pass through data conversion does not require

Loading

After the data has been loaded converted to the specified data warehouse for subsequent data analysis, data mining provides preparation;
• missing values and null detection
• target data and metadata consistency check
whether • Expedition converted data in line with expectations, i.e. verification data load test

  • Full load volume: Clear the table before the data is loaded, simpler than the incremental load. Usually just before the data is loaded, the target table empty, then the total amount of data can be introduced into the source table. But when the source data volume, real-time traffic is high, large quantities of data can not be loaded successfully in a short time, then it needs to be used in conjunction with the incremental load
  • Incremental load:
    data update target table o only of changes in the source table.
    o incremental load difficulty is updated location data, clear rules must be designed to extract the changed data from the data source, the data of these changes after completion of the update to the corresponding logic converts data destination.
    Incremental load:
    O log analysis system:
    in this way to determine the change data by analyzing log database itself. Relational database systems are all DML operations stored in the log file to backup and restore functionality of the database. ETL incremental extraction process by logging database analysis, information extraction DML operations related to the source table occur after a certain time, it is possible that the data in the table since the last time changes drawn to guide the incremental extraction action.
    o Trigger:
    direct data load:
    create a temporary table similar to the structure of the source table, and then create a three types of triggers, respectively insert, update, delete operation. Whenever there is data source table changes, the use of the trigger will change the data into this temporary exemplar in. Finally, the corresponding data in the target table is modified by time maintaining the temporary table, during the ETL process. After the ETL process, empty the temporary table
    using delta log table for incremental load
    o timestamp
    o whole table compares
    the incremental loading data directly or conversion

ETL

Extraction - Conversion - loading
the data conversion is loaded into the data warehouse

ELT

Extraction - loading - converting
loaded into the data warehouse, the conversion is performed
(1) a simplified architecture ETL
(2) reduce the amount of extracted and actual performance overhead
ETL require multiple extraction, transformation, loading, and can achieve a ELT extract, load, multiple conversions

Data cleaning

  • Missing values

: Human negligence, mechanical failure that can hide;
the data itself does not exist, such as student wages;
high real-time requirement;
historical limitations result in incomplete data collection;

  • Impact: data and characteristics determine the upper limit of machine learning, and the application model and algorithm just approaching this limit.
    Data set is missing part of the data model can reduce the chance of over-fitting, but there is also the risk of excessive deviation model, because there is no proper analysis of the behavior of variables and relationships, leading to erroneous predictions or analysis.

  • Totally variable: the variable data does not contain the missing values

  • Incomplete variable: the variable data set containing missing values

  • Missing completely at random MCAR: missing data missing variables are the same in all probability does not depend on complete or incomplete variable variables, flip a coin

  • Missing at random MAR: complete missing data and other variables related to household income information collected by this variable for women than men, but gender is completely variable

  • Non-random deletion of MNAR: incomplete and missing data values related to the variables themselves, both classifications MNAP
    missing values were not observed variable dependent on:
    the missing data is not random, depending on the variable is not observed. For example, the school offers a course, the more people drop out, may not be related to the quality of the course content, but has not collected data set "course quality score" is variable.
    Missing values depend on itself:
    the probability of missing values and missing values themselves are directly related, such as high or low income people do not want to provide their own proof of income

Approach

  • Delete: suitable for large volumes of data, less missing data values ​​in the data set can be used directly when missing completely at random deletion, follow the 80% rule, Disadvantages: will delete the original data will lose some points, undermine the integrity of historical data; missing data accounted for a relatively long time, delete likely to change the distribution of raw data; reduce model accuracy

  • Filler: artificial fill, filling special value, the average value of the filling, hot filling card, the KNN, prediction model, disadvantages: may change the distribution of original data; node may introduce noise; reducing model accuracy.
    KNN: All samples are divided by the KNN method, the Euclidean distance, selecting the missing samples nearest K samples, by voting fat or K values weighting (weight) average to estimate the missing values of the sample
    Here Insert Picture Description
    k worthy choice will the results of the k-nearest neighbor method have a significant impact, the k value and generally a relatively small value, by cross validation to select the optimal value of k
    learning KNN algorithm may specifically refer to the following Bowen:
    https://blog.csdn.net / eeeee123456 / article / details / 79927128

  • It does not deal with: xgboost

  • Regression: the data does not contain the missing data as part of the training set, regression model based on the training set, the primary regression model to predict the missing value and filling, only applies to the case where continuous values are missing.
    Better than predicted theoretically fill fill effect, but if there is no missing values correlated with other variables, the predicted loss of value is not statistically significant

  • Variable mapping: map the variables to the high-dimensional space
    Example: Customer Information Form, a sex column in three ways, "male", "female" and null value (missing data information table), customer information table may be gender property value to "whether male", "whether the female", "is missing" three.
    Advantages: can preserve original data integrity, without regard to missing values
    Cons: greatly enhance the computational overhead
    may appear sparse, but lower quality model

Data preprocessing implemented Python

Experimental tool: anaconda, python comes iris public data sets

  • Delete missing values:
    basic format:DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
1.axis=0代表行 axis=1代表列。
2.how=any代表若某行或某列中存在缺失值,则删除该行或该列。how=all代表若某行或某列中数值全部为空,则删除该行或该列。
3.thresh=N,可选参数,代表若某行或某列中至少含有N个缺失值,则删除该行或该列。
4.subset=列名,可选参数,代表若指定列中有缺失值,则删除该行。
5.inplace=True/FalseBoolean数据, 默认为False。inplace=True代表直接对原数据集N做出修改。
6.inplace=False代表修改后生成新数据集M,原数据集N保持不变。

#导入pandas库和numpy库
import pandas as pd
import numpy as np
#导入数据集iris
from sklearn.datasets import load_iris
iris=load_iris()
print(iris)

Here Insert Picture Description

#数据转换,由于dropna ( ) 函数适用于DataFrame结构的数据集,因此需先将iris数据集转换成DataFrame结构
dfx=pd.DataFrame(data=iris.data, columns=iris.feature_names)
#人为加入缺失值
c=pd.DataFrame({"sepal length (cm)":[np.nan,4.9],
                 "sepal width (cm)":[np.nan,3],
                 "petal length (cm)":[np.nan,1.4],
                "petal width (cm)":[0.2,np.nan]})
                df=dfx.append(c,ignore_index=True)
#删除含缺失值的数据行
df.dropna()
#删除含缺失值的数据列
df.dropna(axis='columns')
#删除各属性全为NaN的数据行
df.dropna(how='all')
#删除存在缺失值NaN的数据行
df.dropna(how='any')
#删除属性列sepal width (cm)中含有缺失值NaN的数据行。
df.dropna(subset=['sepal width (cm)'])
#删除含缺失值NaN的数据行,并生成新数据集
df.dropna(inplace=False)
  • Padding
    data import methods consistent with the
    basic format:
Imputer(missing_values='NaN', strategy='mean', axis=0, verbose=0, copy=True)
fit_transform() 先拟合数据,再标准化
	missing_value,可选参数,int型或NaN,默认为NaN。 
	strategy =mean/median/most_frequent,可选参数。
	strategy=mean代表用某行或某列的均值填充缺失值。
	strategy=median代表用某行或某列的中位数值填充缺失值。 
	strategy =most_frequent代表用某行或某列的众数填充缺失值。
	axis=0/1,默认为0,可选参数。axis=0代表行数据,axis=1代表列数据。一般实际应用中,axis取0。
	verbose, 默认为0,可选参数。控制Imputer的冗长。
	copy=True/FalseBoolean数据, 默认为True,可选参数。copy=True代表创建数据集的副本。

#导入pandas库和numpy库
import pandas as pd
import numpy as np
#导入数据集iris
from sklearn.datasets import load_iris
iris=load_iris()
dfx=pd.DataFrame(data=iris.data, columns=iris.feature_names)
c=pd.DataFrame({"sepal length (cm)":[np.nan,4.9],
                 "sepal width (cm)":[np.nan,3],
                 "petal length (cm)":[np.nan,1.4],
                "petal width (cm)":[0.2,np.nan]})

df=dfx.append(c,ignore_index=True)
#众数填充
#导入Imputer类
from sklearn.preprocessing import Imputer
#使用Imputer函数进行缺失值填充,参数strategy设置为'most_frequent'代表用众数填充
data = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
#对填充后的数据做归一化操作
df=dfx.append(c,ignore_index=True)
dataMode = data.fit_transform(df)
#输出dataMode
print(dataMode)

Here Insert Picture Description

#均值填充
data = Imputer(missing_values='NaN', strategy='mean', axis=0) 
dataMode = data.fit_transform(df)


``
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200309140106663.png)

```python
#中位数填充
data = Imputer(missing_values='NaN', strategy='median', axis=0)
dataMode = data.fit_transform(df)

Here Insert Picture Description

  • KNN
    basic format:
 KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=1, **kwargs)
	n_neighbors可选参数,代表聚类数量,默认为5。
	weights可选参数,代表近邻的权重值,uniform代表近邻权重一样,distance代表权重为距离的倒数,也可自定义函数确定权重,默认为uniform。
	algorithm可选参数,代表计算近邻的方式,具体包括{'auto', 'ball_tree', 'kd_tree', 'brute'}。
	leaf_size可选参数,代表构造树的大小,默认为30。值一般选取默认值即可,太大会影响建模速度。
	n_jobs可选参数,代表数据计算的jobs数量,选取-1后虽然占据CPU比重会减小,但运行速度会变慢,所有的core都会运行,默认为1


#导入pandas库和numpy库
import pandas as pd
import numpy as np
#导入数据集iris
from sklearn.datasets import load_iris
iris=load_iris()
# 导入iris数据集,自变量X命名为dfx,分类变量Y命名为dfy
dfx=pd.DataFrame(data=iris.data, columns=iris.feature_names)
dfy=pd.DataFrame(data=iris.target)
#手动导入含缺失值的测试集
x_test=pd.DataFrame({"sepal length (cm)":[5.6,4.9],
                 "sepal width (cm)":[2.5,3],
                 "petal length (cm)":[4.5,1.4],
                "petal width (cm)":[0.2,2.1]})
# 导入KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
# KNN模型建立,将此数据集分为三簇,因此n_neighbors=3
modelKNN = KNeighborsClassifier(n_neighbors=3)
modelKNN.fit(dfx,dfy)
# 预测
predicted = modelKNN.predict(x_test)
print (predicted)

Here Insert Picture Description
Description Test Data Categories 1 and 2, data 2 corresponding to the first row, a second row of data corresponding to

Published 20 original articles · won praise 23 · views 982

Guess you like

Origin blog.csdn.net/surijing/article/details/104742404