文章目录
Data Clearning
1. Data Errors
- Data often have errors -the mismatch with ground truth(数据误差)
- Good ML model are resilient to the errors (好的网络模型对数据误差有一定的容忍度)
- Deploying these models online may impact the quality of the new collected data(在线部署这些模型可能会影响新收集数据的质量)
2. Types of data errors
- outliers (飞点)
- rule violations : violate integrity constraints such as “Not Null” and “Must be unique”(规则外的数据)
- pattern violations: violate syntactic and semantic constraints such as alignment,formatting ,misspelling(模式违规)
3. Outlier Detection
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython import display
display.set_matplotlib_formats("svg")
data = pd.read_feather("./data/house_sales.ftr")
# 统计空值数据
null_sum = data.isnull().sum() # .sum默认对列进行求和
# 清除控制三成以上的列
data.drop(columns=data.columns[null_sum>len(data)*0.3],inplace=True) # 将缺失大于30%的列去掉,并且直接替换
# 对钱类别的数据进行数据类型转换 (这里使用正则化匹配非常简单)
currency = ['Sold Price', 'Listed Price', 'Tax assessed value', 'Annual tax amount']
for c in currency:
data[c] = data[c].replace(
r'[$,-]', '', regex=True).replace(
r'^\s*$', np.nan, regex=True).astype(float)
# 对数值类型的数据进行类型转换
areas = ['Total interior livable area', 'Lot size']
for c in areas:
acres = data[c].str.contains('Acres') == True
col = data[c].replace(r'\b sqft\b|\b Acres\b|\b,\b','', regex=True).astype(float)
col[acres] *= 43560
data[c] = col
data["Type"].value_counts()[0:20]
SingleFamily 102040
Condo 27443
MultiFamily 7346
Townhouse 7108
VacantLand 6199
Unknown 5849
MobileManufactured 3605
Apartment 1922
Single Family 463
Cooperative 176
Residential Lot 76
Single Family Lot 56
MFD-F 51
Acreage 48
2 Story 42
3 Story 27
Hi-Rise (9+), Luxury 21
Condominium 19
RESIDENTIAL 19
Duplex 19
Name: Type, dtype: int64
types = data['Type'].isin(['SingleFamily', 'Condo', 'MultiFamily', 'Townhouse'])
data['Price per living sqft'] = data['Sold Price'] / data['Total interior livable area']
ax = sns.boxplot(x='Type', y='Price per living sqft', data=data[types])
ax.set_ylim([0, 2000]);
4. Rule based detection
-
functional dependencies: x-> y means a value x determines a unique value y (函数依赖)
-
denial constraints: a more flexible first-order logic formalism (否认约束)
- phone number is not empty if vendor has an EIN
- if two captures of the same animal indicated by the same tag number then the first one must be marked as original.
5. Pattern-based detection
- syntactic patterns (语法模式)
- eng -> English
- map a column to the most prominent data type and identify values do not fit
- Semantic patterns (语义模式)
- add rules through knowledge graph