异常值处理
- 指那些偏离正常范围的值,不是错误值
- 异常值出现频率较低,但又会对实际项目分析造成偏差
- 异常值一般用过箱线图法(分位差法)或者分布图(标准差法)来判断
- 异常值检测可以使用均值的二倍标准差范围,也可以使用上下4分位数差方法
- 异常值往往采取盖帽法或者数据离散化
import pandas as pd
import numpy as np
import os
os.getcwd()
'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据清洗之数据预处理'
os.chdir('D:\\Jupyter\\notebook\\Python数据清洗实战\\数据')
df = pd.read_csv('MotorcycleData.csv', encoding='gbk', na_values='Na')
def f(x):
if '$' in str(x):
x = str(x).strip('$')
x = str(x).replace(',', '')
else:
x = str(x).replace(',', '')
return float(x)
df['Price'] = df['Price'].apply(f)
df['Mileage'] = df['Mileage'].apply(f)
df.head(5)
|
Condition |
Condition_Desc |
Price |
Location |
Model_Year |
Mileage |
Exterior_Color |
Make |
Warranty |
Model |
... |
Vehicle_Title |
OBO |
Feedback_Perc |
Watch_Count |
N_Reviews |
Seller_Status |
Vehicle_Tile |
Auction |
Buy_Now |
Bid_Count |
0 |
Used |
mint!!! very low miles |
11412.0 |
McHenry, Illinois, United States |
2013.0 |
16000.0 |
Black |
Harley-Davidson |
Unspecified |
Touring |
... |
NaN |
FALSE |
8.1 |
NaN |
2427 |
Private Seller |
Clear |
True |
FALSE |
28.0 |
1 |
Used |
Perfect condition |
17200.0 |
Fort Recovery, Ohio, United States |
2016.0 |
60.0 |
Black |
Harley-Davidson |
Vehicle has an existing warranty |
Touring |
... |
NaN |
FALSE |
100 |
17 |
657 |
Private Seller |
Clear |
True |
TRUE |
0.0 |
2 |
Used |
NaN |
3872.0 |
Chicago, Illinois, United States |
1970.0 |
25763.0 |
Silver/Blue |
BMW |
Vehicle does NOT have an existing warranty |
R-Series |
... |
NaN |
FALSE |
100 |
NaN |
136 |
NaN |
Clear |
True |
FALSE |
26.0 |
3 |
Used |
CLEAN TITLE READY TO RIDE HOME |
6575.0 |
Green Bay, Wisconsin, United States |
2009.0 |
33142.0 |
Red |
Harley-Davidson |
NaN |
Touring |
... |
NaN |
FALSE |
100 |
NaN |
2920 |
Dealer |
Clear |
True |
FALSE |
11.0 |
4 |
Used |
NaN |
10000.0 |
West Bend, Wisconsin, United States |
2012.0 |
17800.0 |
Blue |
Harley-Davidson |
NO WARRANTY |
Touring |
... |
NaN |
FALSE |
100 |
13 |
271 |
OWNER |
Clear |
True |
TRUE |
0.0 |
5 rows × 22 columns
x_bar = df['Price'].mean()
x_std = df['Price'].std()
any(df['Price'] > x_bar + 2 * x_std)
True
any(df['Price'] < x_bar - 2 * x_std)
False
df['Price'].describe()
count 7493.000000
mean 9968.811557
std 8497.326850
min 0.000000
25% 4158.000000
50% 7995.000000
75% 13000.000000
max 100000.000000
Name: Price, dtype: float64
Q1 = df['Price'].quantile(q = 0.25)
Q3 = df['Price'].quantile(q = 0.75)
IQR = Q3 - Q1
any(df['Price'] > Q3 + 1.5 * IQR)
True
any(df['Price'] < Q1 - 1.5 * IQR)
False
import matplotlib.pyplot as plt
%matplotlib inline
df['Price'].plot(kind='box')
<matplotlib.axes._subplots.AxesSubplot at 0x11ddad20ac8>
plt.style.use('seaborn')
df.Price.plot(kind='hist', bins=30, density=True)
df.Price.plot(kind='kde')
plt.show()
P99 = df['Price'].quantile(q=0.99)
P1 = df['Price'].quantile(q=0.01)
P99
39995.32
df['Price_new'] = df['Price']
df.loc[df['Price'] > P99, 'Price_new'] = P99
df.loc[df['Price'] < P1, 'Price_new'] = P1
df[['Price', 'Price_new']].describe()
|
Price |
Price_new |
count |
7493.000000 |
7493.000000 |
mean |
9968.811557 |
9821.220873 |
std |
8497.326850 |
7737.092537 |
min |
0.000000 |
100.000000 |
25% |
4158.000000 |
4158.000000 |
50% |
7995.000000 |
7995.000000 |
75% |
13000.000000 |
13000.000000 |
max |
100000.000000 |
39995.320000 |