Practical Machine Learning Notes (4): Data Cleaning

Data Clearning

1. Data Errors

  1. Data often have errors -the mismatch with ground truth (data error)
  2. Good ML model are resilient to the errors (a good network model has a certain tolerance for data errors)
  3. Deploying these models online may impact the quality of the new collected data (deploying these models online may affect the quality of the new collected data)

2. Types of data errors

  1. outliers (flying points)
  2. rule violations : violate integrity constraints such as “Not Null” and “Must be unique” (data outside the rules)
  3. pattern violations: violate syntactic and semantic constraints such as alignment,formatting ,misspelling(模式违规)

3. Outlier Detection

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns
from IPython import display
display.set_matplotlib_formats("svg")
data = pd.read_feather("./data/house_sales.ftr")
# 统计空值数据
null_sum = data.isnull().sum() # .sum默认对列进行求和
# 清除控制三成以上的列
data.drop(columns=data.columns[null_sum>len(data)*0.3],inplace=True) # 将缺失大于30%的列去掉,并且直接替换
# 对钱类别的数据进行数据类型转换 (这里使用正则化匹配非常简单)
currency = ['Sold Price', 'Listed Price', 'Tax assessed value', 'Annual tax amount']
for c in currency:
    data[c] = data[c].replace(
        r'[$,-]', '', regex=True).replace(
        r'^\s*$', np.nan, regex=True).astype(float)
# 对数值类型的数据进行类型转换
areas = ['Total interior livable area', 'Lot size']
for c in areas:
    acres = data[c].str.contains('Acres') == True
    col = data[c].replace(r'\b sqft\b|\b Acres\b|\b,\b','', regex=True).astype(float)
    col[acres] *= 43560
    data[c] = col
data["Type"].value_counts()[0:20]
SingleFamily            102040
Condo                    27443
MultiFamily               7346
Townhouse                 7108
VacantLand                6199
Unknown                   5849
MobileManufactured        3605
Apartment                 1922
Single Family              463
Cooperative                176
Residential Lot             76
Single Family Lot           56
MFD-F                       51
Acreage                     48
2 Story                     42
3 Story                     27
Hi-Rise (9+), Luxury        21
Condominium                 19
RESIDENTIAL                 19
Duplex                      19
Name: Type, dtype: int64
types = data['Type'].isin(['SingleFamily', 'Condo', 'MultiFamily', 'Townhouse'])
data['Price per living sqft'] = data['Sold Price'] / data['Total interior livable area']
ax = sns.boxplot(x='Type', y='Price per living sqft', data=data[types])
ax.set_ylim([0, 2000]);

4. Rule based detection

  1. functional dependencies: x-> y means a value x determines a unique value y (internal value)

  2. denial constraints: a more flexible first-order logic formalism (否认约束)

    • phone number is not empty if vendor has an EIN
    • if two captures of the same animal indicated by the same tag number then the first one must be marked as original.

5. Pattern-based detection

  1. syntactic patterns
    • eng -> English
    • map a column to the most prominent data type and identify values do not fit
  2. Semantic patterns
    • add rules through knowledge graph

Guess you like

Origin blog.csdn.net/jerry_liufeng/article/details/123430557