1.1删除重复值
由于各种原因,DataFrame中会出现重复行:
import pandas as pd
data = pd.DataFrame({'k1':['one','two'] * 3 + ['two'],
'k2':[1,1,2,3,3,4,1]})
print(data)
----------
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 1
DataFrame的duplicated方法返回的是一个布尔值的Series,这个Series反映的是每一行是否存在重复(与之前出现过的行相同)情况:
print(data.duplicated())
-----------
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
drop_duplicates返回的是DataFrame,内容是duplicated返回数组中为False的部分:
print(data.drop_duplicates())
----------
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
假设我们有一个额外的列,并想基于‘k1’列去除重复值:
data['v1'] = range(7)
print(data.drop_duplicates(['k1']))
--------------
k1 k2 v1
0 one 1 0
1 two 1 1
duplicated和drop_duplicates默认都是保留第一个观测到的值,传入参数keep='last’将会返回最后一个:
print(data.drop_duplicates(['k1','k2'],keep = 'last'))
--------------
k1 k2 v1
0 one 1 0
2 one 2 2
3 two 3 3
4 one 3 4
5 two 4 5
6 two 1 6
1.2使用函数或映射进行数据转换
考虑下面这些收集到的关于肉类的假设数据:
data = pd.DataFrame({'food':['bacon','pulled pork','bacon','Pastrami','corned beef','Bacon','pastrami','honey ham','nova lox'],
'ounces':[4,3,12,6,7.5,8,3,5,6]})
print(data)
----------------------
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
假设想要添加一列用于表明每种食物的动物肉类型。让我们先写下一个食物和类型的映射:
meat_to_animal = {
'bacon':'pig',
'pulled pork':'pig',
'corned beef':'cow',
'honey ham':'pig',
'nova lox':'salmon'
}
Series的map方法接收一个函数或一个包含映射关系的字典型对象,但是这里我们有一个小的问题在于一些肉类大写了,而另一部分肉类没有。因此,我们需要使用Series的str.lower方法将每个值都转换为小写:
meat_to_animal = {
'bacon':'pig',
'pulled pork':'pig',
'pastrami':'cow',
'corned beef':'cow',
'honey ham':'pig',
'nova lox':'salmon'
}
lowercased = data['food'].str.lower()
print(lowercased)
----------------
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
data['animal'] = lowercased.map(meat_to_animal)
print(data)
------------------------------
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
1.3替代值
使用fillna填充缺失值是通用值替换的特殊案例。让我们考虑下面的Series:
data = pd.Series([1.,-999.,2.,-999.,-1000.,3.])
print(data)
--------------
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
-999可能是缺失值的标识。如果要使用NA来替代这些值,我们可以使用replace方法生成新的Series(除非传入了inplace=True):
data = pd.Series([1.,-999.,2.,-999.,-1000.,3.])
print(data)
--------------
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
print(data.replace(-999,np.nan))
--------------
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
如果想要替代一次替代多个值,可以传入一个列表和替代值:
print(data.replace([-999,-1000],np.nan))
--------------
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
要求不同的值替换为不同的值,可以传入替代值的列表:
print(data.replace([-999,-1000],[np.nan,0]))
---------------
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
参数也可以通过字典传递:
print(data.replace({-999:np.nan,-1000:0}))
--------------
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
1.4重命名轴索引
和Series中的值一样,可以通过函数或某种形式的映射对轴标签进行类似的转换,生成新的且带有不同标签的对象。我们也可以不生成新的数据结构的情况下修改轴。下面是简单的示例:
data = pd.DataFrame(np.arange(12).reshape(3,4),index = ['Ohio','Colorado','New York'],
columns = ['one','two','three','four'])
与Series类似,轴索引也有一个map方法:
transform = lambda x:x[:4].upper()
print(data.index.map(transform))
-----------------------------------------------
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
我们也可以赋值给index,修改DataFrame:
data.index = data.index.map(transform)
print(data)
---------------------------
one two three four
OHIO 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
如果我们想要创建数据集转换后的版本,并且不修改原有的数据集,一个有用的方法是rename:
print(data.rename(index = str.title,columns = str.upper))
-------------------------------
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
值得注意的是,rename可以结合字典型对象使用,为轴标签的子集提供新的值:
print(data.rename(index = {'Ohio':'INDIANA'},columns={'three':'peekaboo'}))
----------------------------------
one two peekaboo four
INDIANA 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
rename可以让你从手动复制DataFrame并为其分配索引和列属性的繁琐工作中解放出来。如果想要修改原有的数据集,传入inplace=True:
data.rename(index = {'Ohio':'INDIANA'},inplace = True)
print(data)
-------------------------------
one two three four
INDIANA 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
1.5离散化和分箱
连续值经常需要离散化,或者分箱成“箱子”进行分析。假设你有某项研究中一组人群的数据,你想将他们进行分组,放入离散的年龄框中:
ages = [20,22,25,27,21,23,37,31,61,45,41,32]
让我们将这些年龄分为18-25、26-35、36-60以及61及以上等若干组。为了实现这个,我们可以使用pandas中的cut:
ages = [20,22,25,27,21,23,37,31,61,45,41,32]
bins = [18,25,35,60,100]
cats = pd.cut(ages,bins)
print(cats)
---------------------------------------------------------------------------------------------------------
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
pandas返回的对象是一个特殊的Categorical对象。我们看到的输出描述了由pandas.cut计算出的箱。我们可以将它当作一个表示箱名的字符串数组;它在内部包含一个categories(类别)数组,它指定了不同的类别名称以及cods属性中的ages(年龄)数据标签:
print(cats.codes)
-------------------------
[0 0 0 1 0 0 2 1 3 2 2 1]
print(cats.categories)
-------------------------------------------------------
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
closed='right',
dtype='interval[int64]')
print(pd.value_counts(cats))
--------------
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
这里pd.value_counts(cats)是对pandas.cut的结果中的箱数量的计数。
与区间的数学符号一致,小括号表示边是开放的,中括号表示它是封闭的(包括边),这里可以传入right=False来改变哪一边是封闭的:
print(pd.cut(ages,[18,26,36,61,100],right = False))
---------------------------------------------------------------------------------------------------------
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
也可以通过向labels选项传递一个列表或数组来传入自定义的箱名:
print(pd.cut(ages,[18,26,36,61,100],right = False))
group_names = ['Youth','YoungAdult','MiddleAged','Senior']
print(pd.cut(ages,bins,labels=group_names))
-----------------------------------------------------------------------------
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]
如果你传给cut整数个的箱来代替显示的箱边,pandas将根据数据中的最小值和最大值计算出等长的箱。请考虑一些均匀分布的数据被切成四份的情况:
data = np.random.rand(20)
print(pd.cut(data,4,precision=2))
---------------------------------------------------------------------------------------------------------------------------------------------------
[(0.72, 0.95], (0.25, 0.48], (0.019, 0.25], (0.72, 0.95], (0.25, 0.48], ..., (0.72, 0.95], (0.48, 0.72], (0.72, 0.95], (0.019, 0.25], (0.25, 0.48]]
Length: 20
Categories (4, interval[float64]): [(0.019, 0.25] < (0.25, 0.48] < (0.48, 0.72] < (0.72, 0.95]]
precision=2的选项将十进制精度限制在两位。
qcut是一个与分箱密切相关的函数,它基于样本分位数进行分箱。取决于数据的分布,使用cut通常不会使每个箱具有相同数据量的数据点。由于qcut使用样本的分位数,我们可以通过qcut获得等长的箱:
data = np.random.randn(1000)
cats = pd.qcut(data,4)
print(cats)
---------------------------------------------------------------------------------------------------------------------------------------------------
[(0.29, 0.51], (0.51, 0.73], (0.29, 0.51], (0.73, 0.95], (0.069, 0.29], ..., (0.069, 0.29], (0.069, 0.29], (0.73, 0.95], (0.73, 0.95], (0.73, 0.95]]
Length: 20
Categories (4, interval[float64]): [(0.069, 0.29] < (0.29, 0.51] < (0.51, 0.73] < (0.73, 0.95]]
[(-0.633, 0.0281], (0.732, 2.705], (0.0281, 0.732], (-3.221, -0.633], (-3.221, -0.633], ..., (-0.633, 0.0281], (0.0281, 0.732], (-3.221, -0.633], (-0.633, 0.0281], (-3.221, -0.633]]
Length: 1000
Categories (4, interval[float64]): [(-3.221, -0.633] < (-0.633, 0.0281] < (0.0281, 0.732] <
(0.732, 2.705]]
print(cats.value_counts())
------------------------------------
(-3.2929999999999997, -0.694] 250
(-0.694, 0.0107] 250
(0.0107, 0.662] 250
(0.662, 2.736] 250
dtype: int64
1.6检测和过滤异常值
过滤或转换异常值在很大程度上是应用数组操作的事情。考虑一个具有正态分布数据的DataFrame:
data = pd.DataFrame(np.random.randn(1000,4))
print(data.describe())
---------------------------------------------------------
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.000511 -0.021196 -0.031225 -0.010533
std 0.990803 0.990293 0.947879 1.009659
min -3.577692 -2.771380 -2.759002 -3.021722
25% -0.668135 -0.704517 -0.700162 -0.671600
50% 0.008604 -0.060045 -0.072687 -0.035403
75% 0.693142 0.618066 0.601413 0.693484
max 3.419051 3.714937 3.881218 3.615359
假设我们想要找出一列中绝对值大于三的值:
col = data[2]
print(col[np.abs(col) > 3])
-----------------------
721 3.052446
Name: 2, dtype: float64
要选出所有值大于3或小于-3的行,我们可以对布尔值DataFrame使用any方法:
print(data[(np.abs(data) > 3).any(1)])
-------------------------------------------
0 1 2 3
66 0.442000 -3.667465 0.134274 1.486624
296 0.111119 0.253513 3.058447 -2.098602
535 0.787788 3.190138 -0.741357 0.391135
538 -0.591268 -1.335684 -3.062085 0.679055
979 3.112996 0.512741 -1.307721 -0.389606
值可以根据这些标准来设置,下面代码限制了-3到3之间的数值:
data[np.abs(data) > 3] = np.sign(data) * 3
print(data.describe())
---------------------------------------------------------
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.040334 0.011034 -0.005806 -0.041304
std 1.041037 1.016363 0.969064 1.005545
min -3.000000 -3.000000 -3.000000 -3.000000
25% -0.676272 -0.691892 -0.652303 -0.679903
50% 0.035862 0.000163 0.002989 -0.018314
75% 0.761366 0.700559 0.647850 0.646082
max 3.000000 3.000000 3.000000 3.000000
语句np.sign(data)根据数据中的值的正负分别生成1和-1的数值:
print(np.sign(data).head())
---------------------
0 1 2 3
0 1.0 1.0 -1.0 -1.0
1 -1.0 1.0 1.0 1.0
2 -1.0 1.0 -1.0 1.0
3 1.0 -1.0 1.0 1.0
4 -1.0 -1.0 -1.0 -1.0
1.7置换和随机取样
使用numpy.random.permutation对DataFrame中的Series或行进行置换(随机重新排序)是非常方便的。在调用permutation时根据你想要的轴长度可以产生一个表示新顺序的整数数组:
df = pd.DataFrame(np.arange(5 * 4).reshape((5,4)))
sampler = np.random.permutation(5)
print(sampler)
-----------
[4 0 2 3 1]
整数数组可以用在基于iloc的索引或等价的take函数中:
print(df)
-----------------
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
print(df.take(sampler))
-----------------
0 1 2 3
4 16 17 18 19
0 0 1 2 3
2 8 9 10 11
3 12 13 14 15
1 4 5 6 7
1.8计算指标/虚拟变量
将分类变量转换为“虚拟”或“指标”矩阵是另一种用于统计建模或机器学习的转换操作。如果DataFrame中的一列有k个不同的值,则可以衍生一个k列的值为1和0的矩阵或DataFrame。pandas有一个get_dummies函数用于实现该功能:
df = pd.DataFrame({'key':['b','b','a','c','a','b'],
'data1':range(6)})
print(pd.get_dummies(df['key']))
----------
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
在某些情况下,我们可能想在指标DataFrame的列上加入前缀,然后与其他数据合并。在get_dummies方法中有一个前缀参数用于实现该功能:
dummies = pd.get_dummies(df['key'],prefix = 'key')
df_with_dummy = df[['data1']].join(dummies)
print(df_with_dummy)
-----------------------------
data1 key_a key_b key_c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0
将get_dummies与cut等离散化函数结合使用时统计应用的一个有用方法:
np.random.seed(12345)
values = np.random.randn(10)
print(values)
------------------------------------------------------------------------
[-0.20470766 0.47894334 -0.51943872 -0.5557303 1.96578057 1.39340583
0.09290788 0.28174615 0.76902257 1.24643474]
bins = [0,0.2,0.4,0.6,0.8,1]
print(pd.get_dummies(pd.cut(values,bins)))
-------------------------------------------------------------
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]
0 0 0 0 0 0
1 0 0 1 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 1 0 0 0 0
7 0 1 0 0 0
8 0 0 0 1 0
9 0 0 0 0 0
我们使用numpy.random.seed来设置随机种子以确保示例的正确性。