22.数据预处理之异常值处理

指那些偏离正常范围的值，不是错误值
异常值出现频率较低，但又会对实际项目分析造成偏差
异常值一般用过箱线图法(分位差法)或者分布图(标注差法)来判断
异常值往往采取盖帽法或者数据离散化

#正态分部法
#对价格做异常值处理
x_bar=df['Price'].mean()#均值
x_std=df['Price'].std()#标准差


#返回一个值false或True
any(df["Price"]>x_bar+2.5*x_std)
any(df["Price"]<x_bar-2.5*x_std)
#统计量
df['Price'].describe()

#箱线图法
#1/4分位数
Q1=df["Price"].quantile(q=0.25)
#3/4分位数
Q3=df["Price"].quantile(q=0.75)
#分位差
IQR=Q3-Q1
#判断数据是否超过上限
any(df["Price"]>Q3+1.5*IQR)
#判断数是否超过下线
any(df["Price"]<Q1-1.5*IQR)

import matplotlib.pyplot as plt

#确保图像可以在jupty notebook中显示出来
%matplotlib inline

#箱线图显示异常值
df["Price"].plot(kind='box')

#直方图显示异常值
plt.style.use('seaborn')
#par1:直方图种类;par2:柱状图个数；par3:概率密度形式
df.Price.plot(kind='hist',bins=30,density=True)
df.Price.plot(kind='kde')
plt.show()

P99=df['Price'].quantile(q=0.99)
P1 =df['Price'].quantile(q=0.01)
#复制一个新变量
df['Price_new']=df['Price']
df.loc[df['Price']>P99,'Price_new']=P99
df.loc[df['Price']<P1,'Price_new']=P1

df[['Price','Price_new']].describe()
#显示箱线图，异常值就木有了，使用盖帽法
df["Price_new"].plot(kind='box')

23.数据预处理之数据离散化处理

数据离散化就是分箱
一般常用分箱方法是等频分箱或者等宽分箱
一般使用pd.cut或者pd.qcut函数

pandas.cut(x,bins,right=True,labels)
x:数据
bins:离散化的数据，或者切分的区间
labels:离散化后各个类别的标签
right:是否包含区间右边的值

#等宽分箱
df['age_bin']=pd.cut(df['age_new'],5,labels=range(5))
#不加标签 时。将分段的表征展示出来
df['Price_bin']=pd.cut(df['Price_new'],bins=5,labels=range(0,5))

df['Price_bin']
#以柱状图绘制
df['Price_bin'].value_counts().plot(kind='bar')
#或df['Price_bin'].hist()

#自定义分箱标准
w=[100,1000,5000,10000,20000,100000]
df['Price_bin']=pd.cut(df['Price_new'],bins=w)

df[['Price_bin','Price_new']]


df['Price_bin']=pd.cut(df['Price_new'],bins=w,labels=range(0,5))

df[['Price_bin','Price_new']]

df['Price_bin'].hist()


#等频分段
#w为分位点，labels是箱子的标签
k=5
w=[1.0*i/k for i in range(k+1)]
w[0.0,0.2,0.4,0.6,0.8,1.0]
#此处分为5段
df['Price_bin']=pd.qcut(df['Price_new'],q=w,labels=range(5))

df['Price_bin'].hist()

#自定义等频分段，先算出分割点w1
k=5
w1=df["Price_new"].quantile([1.0*i/k for i in range(k+1)])
#等频分割点
w1
#分段标准的最小值要小于数据的最小值，分段标准的最大值要大于数据的最大值
w1[0]=w[0]*0.95
w1[1.0]=w[1.0]*1.1
df['Price_bin']=pd.cut(df['Price_new'],bins=w1,labels=range(0,5))
df['Price_bin'].hist()

DLANDML

发布了94 篇原创文章 · 获赞 22 · 访问量 4万+

私信关注

22.数据预处理之异常值处理

猜你喜欢