版权声明:转载请注明出处 https://blog.csdn.net/sixkery/article/details/83385913
这是网上十分钟入门 pandas 的教程,在此手敲一遍。
ps:这哪是十分钟,tm明明敲了好久,蓝瘦香菇。
首先导入库:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
创建对象
创建一个 series 通过传递值的列表,让 pandas 创建一个整数索引:
s = pd.Series([1,2,3,4,5,np.nan,6])
s
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 NaN
6 6.0
dtype: float64
创建空值一般用 np.nan 。
DataFrame 通过传递带有日期索引和标记列的 numpy 数组来创建:
datas = pd.date_range('20180501',periods=6)
datas
DatetimeIndex(['2018-05-01', '2018-05-02', '2018-05-03', '2018-05-04',
'2018-05-05', '2018-05-06'],
dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.rand(6,4),index=datas,columns=list('ABCD'))
df
A | B | C | D | |
---|---|---|---|---|
2018-05-01 | 0.225279 | 0.735710 | 0.009639 | 0.408149 |
2018-05-02 | 0.705007 | 0.900239 | 0.551207 | 0.165471 |
2018-05-03 | 0.608068 | 0.332345 | 0.519019 | 0.181947 |
2018-05-04 | 0.921958 | 0.626003 | 0.945828 | 0.357211 |
2018-05-05 | 0.304423 | 0.836494 | 0.731351 | 0.678947 |
2018-05-06 | 0.342860 | 0.053211 | 0.670777 | 0.186546 |
- np.random.randn:正态分布生成随机数
- np.random.rand:随机分布生成随机数(0-1)之间的数
通过字典类型创建 DataFrame 。
df2 = pd.DataFrame({'A':1.,
'B':pd.Timestamp('20180101'),
'C':pd.Series(1,index=list(range(4)),dtype='float32'),
'D':np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F':'foo'})
df2
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 1.0 | 2018-01-01 | 1.0 | 3 | test | foo |
1 | 1.0 | 2018-01-01 | 1.0 | 3 | train | foo |
2 | 1.0 | 2018-01-01 | 1.0 | 3 | test | foo |
3 | 1.0 | 2018-01-01 | 1.0 | 3 | train | foo |
- pd.Timestamp:时间戳,相当于 python 中的 datetime
- pd.Series:dataframe 中的每一列是由 series 组成的。
- np.array:也可以用 numpy 的数组来生成。
- pd.Categorical:分类的数值
结果有不同的类型:
df2.dtypes
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
查看数据
df.head() # 数据框中的前五条数据
A | B | C | D | |
---|---|---|---|---|
2018-05-01 | 0.225279 | 0.735710 | 0.009639 | 0.408149 |
2018-05-02 | 0.705007 | 0.900239 | 0.551207 | 0.165471 |
2018-05-03 | 0.608068 | 0.332345 | 0.519019 | 0.181947 |
2018-05-04 | 0.921958 | 0.626003 | 0.945828 | 0.357211 |
2018-05-05 | 0.304423 | 0.836494 | 0.731351 | 0.678947 |
df.tail(3) # 尾部的三条数据
A | B | C | D | |
---|---|---|---|---|
2018-05-04 | 0.921958 | 0.626003 | 0.945828 | 0.357211 |
2018-05-05 | 0.304423 | 0.836494 | 0.731351 | 0.678947 |
2018-05-06 | 0.342860 | 0.053211 | 0.670777 | 0.186546 |
显示索引
df.index # 显示索引
DatetimeIndex(['2018-05-01', '2018-05-02', '2018-05-03', '2018-05-04',
'2018-05-05', '2018-05-06'],
dtype='datetime64[ns]', freq='D')
df.columns # 显示列名
Index(['A', 'B', 'C', 'D'], dtype='object')
df.values # 显示值
array([[0.22527927, 0.73570966, 0.00963855, 0.40814939],
[0.70500688, 0.90023873, 0.55120699, 0.16547071],
[0.60806789, 0.33234503, 0.51901872, 0.18194742],
[0.92195751, 0.62600349, 0.94582788, 0.35721069],
[0.304423 , 0.83649383, 0.73135067, 0.67894676],
[0.34286019, 0.05321105, 0.67077744, 0.18654591]])
describe() 显示快速统计:
df.describe()
A | B | C | D | |
---|---|---|---|---|
count | 6.000000 | 6.000000 | 6.000000 | 6.000000 |
mean | 0.517932 | 0.580667 | 0.571303 | 0.329712 |
std | 0.271382 | 0.326662 | 0.314447 | 0.199088 |
min | 0.225279 | 0.053211 | 0.009639 | 0.165471 |
25% | 0.314032 | 0.405760 | 0.527066 | 0.183097 |
50% | 0.475464 | 0.680857 | 0.610992 | 0.271878 |
75% | 0.680772 | 0.811298 | 0.716207 | 0.395415 |
max | 0.921958 | 0.900239 | 0.945828 | 0.678947 |
- count:计数
- mean:平均值
- std:标准差
- min:最小值
- 25%:较低的百分位数
- 50%:中位数
- 75%:较高的百分位数
- max:最大值
数据转置:
df.T
2018-05-01 00:00:00 | 2018-05-02 00:00:00 | 2018-05-03 00:00:00 | 2018-05-04 00:00:00 | 2018-05-05 00:00:00 | 2018-05-06 00:00:00 | |
---|---|---|---|---|---|---|
A | 0.225279 | 0.705007 | 0.608068 | 0.921958 | 0.304423 | 0.342860 |
B | 0.735710 | 0.900239 | 0.332345 | 0.626003 | 0.836494 | 0.053211 |
C | 0.009639 | 0.551207 | 0.519019 | 0.945828 | 0.731351 | 0.670777 |
D | 0.408149 | 0.165471 | 0.181947 | 0.357211 | 0.678947 | 0.186546 |
按轴排序:
df.sort_index(axis=1,ascending=False)
D | C | B | A | |
---|---|---|---|---|
2018-05-01 | 0.408149 | 0.009639 | 0.735710 | 0.225279 |
2018-05-02 | 0.165471 | 0.551207 | 0.900239 | 0.705007 |
2018-05-03 | 0.181947 | 0.519019 | 0.332345 | 0.608068 |
2018-05-04 | 0.357211 | 0.945828 | 0.626003 | 0.921958 |
2018-05-05 | 0.678947 | 0.731351 | 0.836494 | 0.304423 |
2018-05-06 | 0.186546 | 0.670777 | 0.053211 | 0.342860 |
- axis:通常所说的按列排序,0,index;1,columns
- ascending:布尔值,升序还是降序
按值排序:
df.sort_values('B')
A | B | C | D | |
---|---|---|---|---|
2018-05-06 | 0.342860 | 0.053211 | 0.670777 | 0.186546 |
2018-05-03 | 0.608068 | 0.332345 | 0.519019 | 0.181947 |
2018-05-04 | 0.921958 | 0.626003 | 0.945828 | 0.357211 |
2018-05-01 | 0.225279 | 0.735710 | 0.009639 | 0.408149 |
2018-05-05 | 0.304423 | 0.836494 | 0.731351 | 0.678947 |
2018-05-02 | 0.705007 | 0.900239 | 0.551207 | 0.165471 |
切片
选择一个列,产生一个 series ,相当于 df.A
df['A']
2018-05-01 0.225279
2018-05-02 0.705007
2018-05-03 0.608068
2018-05-04 0.921958
2018-05-05 0.304423
2018-05-06 0.342860
Freq: D, Name: A, dtype: float64
df.A
2018-05-01 0.225279
2018-05-02 0.705007
2018-05-03 0.608068
2018-05-04 0.921958
2018-05-05 0.304423
2018-05-06 0.342860
Freq: D, Name: A, dtype: float64
df[:3]
A | B | C | D | |
---|---|---|---|---|
2018-05-01 | 0.225279 | 0.735710 | 0.009639 | 0.408149 |
2018-05-02 | 0.705007 | 0.900239 | 0.551207 | 0.165471 |
2018-05-03 | 0.608068 | 0.332345 | 0.519019 | 0.181947 |
按标签选择
使用标签获取横截面的数据:
df.loc[datas[0]]
A 0.225279
B 0.735710
C 0.009639
D 0.408149
Name: 2018-05-01 00:00:00, dtype: float64
df.loc[datas[2]]
A 0.608068
B 0.332345
C 0.519019
D 0.181947
Name: 2018-05-03 00:00:00, dtype: float64
按标签选择多轴:
df.loc[:,['A','B']]
A | B | |
---|---|---|
2018-05-01 | 0.225279 | 0.735710 |
2018-05-02 | 0.705007 | 0.900239 |
2018-05-03 | 0.608068 | 0.332345 |
2018-05-04 | 0.921958 | 0.626003 |
2018-05-05 | 0.304423 | 0.836494 |
2018-05-06 | 0.342860 | 0.053211 |
两个端点也包括:
df.loc['20180502':'20180504',['A','C']]
A | C | |
---|---|---|
2018-05-02 | 0.705007 | 0.551207 |
2018-05-03 | 0.608068 | 0.519019 |
2018-05-04 | 0.921958 | 0.945828 |
df.loc['20180505',['A','C']]
A 0.304423
C 0.731351
Name: 2018-05-05 00:00:00, dtype: float64
获取某一个值:
df.loc[datas[0],'A']
0.22527926638468565
也可以用 at
df.at[datas[0],'A']
0.22527926638468565
按位置选择
通过传递的整数位置选择:
df.iloc[3]
A 0.921958
B 0.626003
C 0.945828
D 0.357211
Name: 2018-05-04 00:00:00, dtype: float64
切片
df.iloc[0:6,0:3]
A | B | C | |
---|---|---|---|
2018-05-01 | 0.225279 | 0.735710 | 0.009639 |
2018-05-02 | 0.705007 | 0.900239 | 0.551207 |
2018-05-03 | 0.608068 | 0.332345 | 0.519019 |
2018-05-04 | 0.921958 | 0.626003 | 0.945828 |
2018-05-05 | 0.304423 | 0.836494 | 0.731351 |
2018-05-06 | 0.342860 | 0.053211 | 0.670777 |
通过整数位置位置列表
df.iloc[[0,2,4],[1,3]]
B | D | |
---|---|---|
2018-05-01 | 0.735710 | 0.408149 |
2018-05-03 | 0.332345 | 0.181947 |
2018-05-05 | 0.836494 | 0.678947 |
df.iloc[:,1:3]
B | C | |
---|---|---|
2018-05-01 | 0.735710 | 0.009639 |
2018-05-02 | 0.900239 | 0.551207 |
2018-05-03 | 0.332345 | 0.519019 |
2018-05-04 | 0.626003 | 0.945828 |
2018-05-05 | 0.836494 | 0.731351 |
2018-05-06 | 0.053211 | 0.670777 |
获取某一个值:
df.iloc[1,1]
0.9002387294615217
布尔索引
df[df>0.2]
A | B | C | D | |
---|---|---|---|---|
2018-05-01 | 0.225279 | 0.735710 | NaN | 0.408149 |
2018-05-02 | 0.705007 | 0.900239 | 0.551207 | NaN |
2018-05-03 | 0.608068 | 0.332345 | 0.519019 | NaN |
2018-05-04 | 0.921958 | 0.626003 | 0.945828 | 0.357211 |
2018-05-05 | 0.304423 | 0.836494 | 0.731351 | 0.678947 |
2018-05-06 | 0.342860 | NaN | 0.670777 | NaN |
使用 isin() 过滤:
df2 = df.copy()
df2['E'] = ['one','two','three','four','five','six']
df2
A | B | C | D | E | |
---|---|---|---|---|---|
2018-05-01 | 0.225279 | 0.735710 | 0.009639 | 0.408149 | one |
2018-05-02 | 0.705007 | 0.900239 | 0.551207 | 0.165471 | two |
2018-05-03 | 0.608068 | 0.332345 | 0.519019 | 0.181947 | three |
2018-05-04 | 0.921958 | 0.626003 | 0.945828 | 0.357211 | four |
2018-05-05 | 0.304423 | 0.836494 | 0.731351 | 0.678947 | five |
2018-05-06 | 0.342860 | 0.053211 | 0.670777 | 0.186546 | six |
df2['E'].isin(['one','five'])
2018-05-01 True
2018-05-02 False
2018-05-03 False
2018-05-04 False
2018-05-05 True
2018-05-06 False
Freq: D, Name: E, dtype: bool
df2[df2['E'].isin(['one','five'])]
A | B | C | D | E | |
---|---|---|---|---|---|
2018-05-01 | 0.225279 | 0.735710 | 0.009639 | 0.408149 | one |
2018-05-05 | 0.304423 | 0.836494 | 0.731351 | 0.678947 | five |
自定义设置值
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20180501',periods=6))
s1
2018-05-01 1
2018-05-02 2
2018-05-03 3
2018-05-04 4
2018-05-05 5
2018-05-06 6
Freq: D, dtype: int64
df['F'] = s1
按照标签设置值
df.loc[datas[0],'A'] = 0
按照位置设置值
df.iloc[0,1] = 0
看一下操作的结果:
df
A | B | C | D | F | |
---|---|---|---|---|---|
2018-05-01 | 0.000000 | 0.000000 | 0.009639 | 0.408149 | 1 |
2018-05-02 | 0.705007 | 0.900239 | 0.551207 | 0.165471 | 2 |
2018-05-03 | 0.608068 | 0.332345 | 0.519019 | 0.181947 | 3 |
2018-05-04 | 0.921958 | 0.626003 | 0.945828 | 0.357211 | 4 |
2018-05-05 | 0.304423 | 0.836494 | 0.731351 | 0.678947 | 5 |
2018-05-06 | 0.342860 | 0.053211 | 0.670777 | 0.186546 | 6 |
缺失的数据
pandas 主要使用 np.nan 来表示缺失的数据,默认不包含在计算中。
重建索引允许您更改/添加/删除指定轴上的索引。这将返回数据的副本。
df1 = df.reindex(index=datas[0:4],columns=list(df.columns) + ['E'])
df1.loc[datas[0]:datas[1],'E'] = 1
df1
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2018-05-01 | 0.000000 | 0.000000 | 0.009639 | 0.408149 | 1 | 1.0 |
2018-05-02 | 0.705007 | 0.900239 | 0.551207 | 0.165471 | 2 | 1.0 |
2018-05-03 | 0.608068 | 0.332345 | 0.519019 | 0.181947 | 3 | NaN |
2018-05-04 | 0.921958 | 0.626003 | 0.945828 | 0.357211 | 4 | NaN |
删除任何缺少数据的行:
df1.dropna()
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2018-05-01 | 0.000000 | 0.000000 | 0.009639 | 0.408149 | 1 | 1.0 |
2018-05-02 | 0.705007 | 0.900239 | 0.551207 | 0.165471 | 2 | 1.0 |
填充数据:
df1.fillna(value=5)
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2018-05-01 | 0.000000 | 0.000000 | 0.009639 | 0.408149 | 1 | 1.0 |
2018-05-02 | 0.705007 | 0.900239 | 0.551207 | 0.165471 | 2 | 1.0 |
2018-05-03 | 0.608068 | 0.332345 | 0.519019 | 0.181947 | 3 | 5.0 |
2018-05-04 | 0.921958 | 0.626003 | 0.945828 | 0.357211 | 4 | 5.0 |
判断是否为空,返回布尔值
df1.isna()
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2018-05-01 | False | False | False | False | False | False |
2018-05-02 | False | False | False | False | False | False |
2018-05-03 | False | False | False | False | False | True |
2018-05-04 | False | False | False | False | False | True |
一些函数的操作
df.mean() # 列统计
A 0.480386
B 0.458049
C 0.571303
D 0.329712
F 3.500000
dtype: float64
df.mean(1) # 行统计
2018-05-01 0.283558
2018-05-02 0.864385
2018-05-03 0.928276
2018-05-04 1.370200
2018-05-05 1.510243
2018-05-06 1.450679
Freq: D, dtype: float64
df.apply(np.cumsum)
A | B | C | D | F | |
---|---|---|---|---|---|
2018-05-01 | 0.000000 | 0.000000 | 0.009639 | 0.408149 | 1 |
2018-05-02 | 0.705007 | 0.900239 | 0.560846 | 0.573620 | 3 |
2018-05-03 | 1.313075 | 1.232584 | 1.079864 | 0.755568 | 6 |
2018-05-04 | 2.235032 | 1.858587 | 2.025692 | 1.112778 | 10 |
2018-05-05 | 2.539455 | 2.695081 | 2.757043 | 1.791725 | 15 |
2018-05-06 | 2.882315 | 2.748292 | 3.427820 | 1.978271 | 21 |
这会应用到所有的数据框中。
直方图
s = pd.Series(np.random.randint(1,7,size=9))
s
0 6
1 1
2 1
3 5
4 1
5 3
6 2
7 1
8 1
dtype: int32
s.value_counts()
1 5
6 1
5 1
3 1
2 1
dtype: int64
字符串方法
s = pd.Series(['A','B','C','Al',np.nan,'dOg'])
s.str.lower()
0 a
1 b
2 c
3 al
4 NaN
5 dog
dtype: object
合并
s1 = pd.Series(['a','b'])
s2 = pd.Series(['c','d'])
pd.concat([s1,s2])
0 a
1 b
0 c
1 d
dtype: object
通过将 ignore_index 选项设置为,清除现有索引并在结果中重置它 True 。
pd.concat([s1,s2],ignore_index=True)
0 a
1 b
2 c
3 d
dtype: object
使用该 keys 选项在数据的最外层添加分层索引
pd.concat([s1,s2],keys=['s1','s2'])
s1 0 a
1 b
s2 0 c
1 d
dtype: object
pd.concat([s1,s2],keys=['s1','s2'],ignore_index=True)
0 a
1 b
2 c
3 d
dtype: object
合并
left = pd.DataFrame({'key':['foo','bar'],'lval':[1,2]})
right = pd.DataFrame({'key':['foo','bar'],'rval':[3,4]})
left
key | lval | |
---|---|---|
0 | foo | 1 |
1 | bar | 2 |
right
key | rval | |
---|---|---|
0 | foo | 3 |
1 | bar | 4 |
pd.merge(left,right)
key | lval | rval | |
---|---|---|---|
0 | foo | 1 | 3 |
1 | bar | 2 | 4 |
追加
将行追加到数据框:
df = pd.DataFrame(np.random.rand(4,4),columns=['A','B','C','D'])
df
A | B | C | D | |
---|---|---|---|---|
0 | 0.304266 | 0.558159 | 0.699805 | 0.964887 |
1 | 0.083297 | 0.228968 | 0.825672 | 0.483591 |
2 | 0.497066 | 0.203718 | 0.894997 | 0.830234 |
3 | 0.001110 | 0.323248 | 0.066382 | 0.074556 |
s = df.iloc[3]
s
A 0.001110
B 0.323248
C 0.066382
D 0.074556
Name: 3, dtype: float64
df.append(s,ignore_index=True)
A | B | C | D | |
---|---|---|---|---|
0 | 0.304266 | 0.558159 | 0.699805 | 0.964887 |
1 | 0.083297 | 0.228968 | 0.825672 | 0.483591 |
2 | 0.497066 | 0.203718 | 0.894997 | 0.830234 |
3 | 0.001110 | 0.323248 | 0.066382 | 0.074556 |
4 | 0.001110 | 0.323248 | 0.066382 | 0.074556 |
有点像列表添加元素。
分组
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
....: 'foo', 'bar', 'foo', 'foo'],
....: 'B' : ['one', 'one', 'two', 'three',
....: 'two', 'two', 'one', 'three'],
....: 'C' : np.random.randn(8),
....: 'D' : np.random.randn(8)})
....:
df
A | B | C | D | |
---|---|---|---|---|
0 | foo | one | 0.015713 | -0.276890 |
1 | bar | one | 0.563566 | 0.089973 |
2 | foo | two | -1.203656 | 2.242553 |
3 | bar | three | -0.254199 | -1.358523 |
4 | foo | two | -0.625421 | 0.252078 |
5 | bar | two | 0.461810 | -2.049906 |
6 | foo | one | -1.272169 | 0.447615 |
7 | foo | three | -0.100721 | 0.131472 |
df.groupby('A').sum()
C | D | |
---|---|---|
A | ||
bar | 0.771176 | -3.318457 |
foo | -3.186254 | 2.796829 |
df.groupby(['A','B']).sum()
C | D | ||
---|---|---|---|
A | B | ||
bar | one | 0.563566 | 0.089973 |
three | -0.254199 | -1.358523 | |
two | 0.461810 | -2.049906 | |
foo | one | -1.256456 | 0.170726 |
three | -0.100721 | 0.131472 | |
two | -1.829077 | 2.494631 |
数据透视表
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
.....: 'B' : ['A', 'B', 'C'] * 4,
.....: 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
.....: 'D' : np.random.randn(12),
.....: 'E' : np.random.randn(12)})
df
A | B | C | D | E | |
---|---|---|---|---|---|
0 | one | A | foo | 0.167128 | -0.501473 |
1 | one | B | foo | -1.218322 | -0.875397 |
2 | two | C | foo | -1.327522 | -1.608971 |
3 | three | A | bar | -0.917783 | -0.537453 |
4 | one | B | bar | -0.803415 | -0.129088 |
5 | one | C | bar | -0.686571 | 0.123554 |
6 | two | A | foo | 0.051545 | 1.138850 |
7 | three | B | foo | 0.138666 | 0.396274 |
8 | one | C | foo | 0.840112 | 0.820482 |
9 | one | A | bar | 0.452267 | -1.411540 |
10 | two | B | bar | -1.000297 | 1.037715 |
11 | three | C | bar | 2.481947 | -1.184744 |
pd.pivot_table(df,values='D',index=['A','B'],columns=['C'])
C | bar | foo | |
---|---|---|---|
A | B | ||
one | A | 0.452267 | 0.167128 |
B | -0.803415 | -1.218322 | |
C | -0.686571 | 0.840112 | |
three | A | -0.917783 | NaN |
B | NaN | 0.138666 | |
C | 2.481947 | NaN | |
two | A | NaN | 0.051545 |
B | -1.000297 | NaN | |
C | NaN | -1.327522 |
数据读取
df.read_csv() # 从CSV读取数据
df.to_excel('foo.xlsx', sheet_name='Sheet1') # 从excel读取数据