Python中pandas的qcut函数的用法

在《利用Python进行数据分析》这本书的第七章介绍了pandas的qcut函数的用法。原书介绍qcut函数是一个与分箱密切相关的函数,它基于样本分位数进行分箱,可以通过qcut获得等长的箱:

data = np.random.randn(1000)#data服从正态分布
cats = pd.qcut(data, 4)#将data均匀分成四份
cats
Out: [(0.657, 3.349], (-0.722, -0.0358], (-3.016, -0.722], (-3.016, -0.722], (-3.016, -0.722], ..., (-0.722, -0.0358], (-0.722, -0.0358], (-0.722, -0.0358], (-0.0358, 0.657], (-0.0358, 0.657]]
Length: 1000
Categories (4, interval[float64]): [(-3.016, -0.722] < (-0.722, -0.0358] < (-0.0358, 0.657] < (0.657, 3.349]]
pd.value_counts(cats)#使用qcut获得了等长的箱
Out: (0.657, 3.349]       250
(-0.0358, 0.657]     250
(-0.722, -0.0358]    250
(-3.016, -0.722]     250
dtype: int64

data里面的元素服从正态分布,长度为1000,qcut将data按照每250的数据量将data分箱,分出4个等长的箱
我们也可以传入自定义的分位数(0和1之间的数据)

pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])
Out: [(1.258, 3.349], (-1.392, -0.0358], (-1.392, -0.0358], (-1.392, -0.0358], (-1.392, -0.0358], ..., (-1.392, -0.0358], (-1.392, -0.0358], (-1.392, -0.0358], (-0.0358, 1.258], (-0.0358, 1.258]]
Length: 1000
Categories (4, interval[float64]): [(-3.016, -1.392] < (-1.392, -0.0358] < (-0.0358, 1.258] < (1.258, 3.349]]
catss = pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])
pd.value_counts(catss)
Out: (-0.0358, 1.258]     400
(-1.392, -0.0358]    400
(1.258, 3.349]       100
(-3.016, -1.392]     100
dtype: int64

当我对qcut函数传入了[0, 0.1, 0.5, 0.9, 1.]自定义分位数,它的原理是将(0, 0.1]的区间长度作为频率,求得第一个箱的数据量的大小,即1000*0.1 = 100,所以分箱结果的第一个区间(-3.016, -1.392]的数量为100,依次类推将1000个数据分成四个区间
另外qcut函数的第一个区间的左侧是样本中最小值,最后一个区间的右侧是样本数据的最大值,由此根据分位数来进行分箱的

猜你喜欢

转载自blog.csdn.net/Mr_Liuzhongbin/article/details/87710460