10 minutes of pandas

版权声明:转载请注明出处 https://blog.csdn.net/sixkery/article/details/83385913

这是网上十分钟入门 pandas 的教程,在此手敲一遍。

ps:这哪是十分钟,tm明明敲了好久,蓝瘦香菇。

首先导入库:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

创建对象

创建一个 series 通过传递值的列表,让 pandas 创建一个整数索引:

s = pd.Series([1,2,3,4,5,np.nan,6])
s

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    NaN
6    6.0
dtype: float64

创建空值一般用 np.nan 。

DataFrame 通过传递带有日期索引和标记列的 numpy 数组来创建:

datas = pd.date_range('20180501',periods=6)
datas
DatetimeIndex(['2018-05-01', '2018-05-02', '2018-05-03', '2018-05-04',
               '2018-05-05', '2018-05-06'],
              dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.rand(6,4),index=datas,columns=list('ABCD'))
df
A B C D
2018-05-01 0.225279 0.735710 0.009639 0.408149
2018-05-02 0.705007 0.900239 0.551207 0.165471
2018-05-03 0.608068 0.332345 0.519019 0.181947
2018-05-04 0.921958 0.626003 0.945828 0.357211
2018-05-05 0.304423 0.836494 0.731351 0.678947
2018-05-06 0.342860 0.053211 0.670777 0.186546
  • np.random.randn:正态分布生成随机数
  • np.random.rand:随机分布生成随机数(0-1)之间的数

通过字典类型创建 DataFrame 。

df2 = pd.DataFrame({'A':1.,
                   'B':pd.Timestamp('20180101'),
                   'C':pd.Series(1,index=list(range(4)),dtype='float32'),
                   'D':np.array([3] * 4,dtype='int32'),
                   'E' : pd.Categorical(["test","train","test","train"]),      
                   'F':'foo'})
df2
A B C D E F
0 1.0 2018-01-01 1.0 3 test foo
1 1.0 2018-01-01 1.0 3 train foo
2 1.0 2018-01-01 1.0 3 test foo
3 1.0 2018-01-01 1.0 3 train foo
  • pd.Timestamp:时间戳,相当于 python 中的 datetime
  • pd.Series:dataframe 中的每一列是由 series 组成的。
  • np.array:也可以用 numpy 的数组来生成。
  • pd.Categorical:分类的数值

结果有不同的类型:

df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

查看数据

df.head() # 数据框中的前五条数据
A B C D
2018-05-01 0.225279 0.735710 0.009639 0.408149
2018-05-02 0.705007 0.900239 0.551207 0.165471
2018-05-03 0.608068 0.332345 0.519019 0.181947
2018-05-04 0.921958 0.626003 0.945828 0.357211
2018-05-05 0.304423 0.836494 0.731351 0.678947
df.tail(3) # 尾部的三条数据
A B C D
2018-05-04 0.921958 0.626003 0.945828 0.357211
2018-05-05 0.304423 0.836494 0.731351 0.678947
2018-05-06 0.342860 0.053211 0.670777 0.186546

显示索引

df.index # 显示索引
DatetimeIndex(['2018-05-01', '2018-05-02', '2018-05-03', '2018-05-04',
               '2018-05-05', '2018-05-06'],
              dtype='datetime64[ns]', freq='D')
df.columns # 显示列名
Index(['A', 'B', 'C', 'D'], dtype='object')
df.values # 显示值
array([[0.22527927, 0.73570966, 0.00963855, 0.40814939],
       [0.70500688, 0.90023873, 0.55120699, 0.16547071],
       [0.60806789, 0.33234503, 0.51901872, 0.18194742],
       [0.92195751, 0.62600349, 0.94582788, 0.35721069],
       [0.304423  , 0.83649383, 0.73135067, 0.67894676],
       [0.34286019, 0.05321105, 0.67077744, 0.18654591]])

describe() 显示快速统计:

df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.517932 0.580667 0.571303 0.329712
std 0.271382 0.326662 0.314447 0.199088
min 0.225279 0.053211 0.009639 0.165471
25% 0.314032 0.405760 0.527066 0.183097
50% 0.475464 0.680857 0.610992 0.271878
75% 0.680772 0.811298 0.716207 0.395415
max 0.921958 0.900239 0.945828 0.678947
  • count:计数
  • mean:平均值
  • std:标准差
  • min:最小值
  • 25%:较低的百分位数
  • 50%:中位数
  • 75%:较高的百分位数
  • max:最大值

数据转置:

df.T 
2018-05-01 00:00:00 2018-05-02 00:00:00 2018-05-03 00:00:00 2018-05-04 00:00:00 2018-05-05 00:00:00 2018-05-06 00:00:00
A 0.225279 0.705007 0.608068 0.921958 0.304423 0.342860
B 0.735710 0.900239 0.332345 0.626003 0.836494 0.053211
C 0.009639 0.551207 0.519019 0.945828 0.731351 0.670777
D 0.408149 0.165471 0.181947 0.357211 0.678947 0.186546

按轴排序:

df.sort_index(axis=1,ascending=False)
D C B A
2018-05-01 0.408149 0.009639 0.735710 0.225279
2018-05-02 0.165471 0.551207 0.900239 0.705007
2018-05-03 0.181947 0.519019 0.332345 0.608068
2018-05-04 0.357211 0.945828 0.626003 0.921958
2018-05-05 0.678947 0.731351 0.836494 0.304423
2018-05-06 0.186546 0.670777 0.053211 0.342860
  • axis:通常所说的按列排序,0,index;1,columns
  • ascending:布尔值,升序还是降序

按值排序:

df.sort_values('B')
A B C D
2018-05-06 0.342860 0.053211 0.670777 0.186546
2018-05-03 0.608068 0.332345 0.519019 0.181947
2018-05-04 0.921958 0.626003 0.945828 0.357211
2018-05-01 0.225279 0.735710 0.009639 0.408149
2018-05-05 0.304423 0.836494 0.731351 0.678947
2018-05-02 0.705007 0.900239 0.551207 0.165471

切片

选择一个列,产生一个 series ,相当于 df.A

df['A']
2018-05-01    0.225279
2018-05-02    0.705007
2018-05-03    0.608068
2018-05-04    0.921958
2018-05-05    0.304423
2018-05-06    0.342860
Freq: D, Name: A, dtype: float64
df.A
2018-05-01    0.225279
2018-05-02    0.705007
2018-05-03    0.608068
2018-05-04    0.921958
2018-05-05    0.304423
2018-05-06    0.342860
Freq: D, Name: A, dtype: float64
df[:3]
A B C D
2018-05-01 0.225279 0.735710 0.009639 0.408149
2018-05-02 0.705007 0.900239 0.551207 0.165471
2018-05-03 0.608068 0.332345 0.519019 0.181947

按标签选择

使用标签获取横截面的数据:

df.loc[datas[0]]
A    0.225279
B    0.735710
C    0.009639
D    0.408149
Name: 2018-05-01 00:00:00, dtype: float64
df.loc[datas[2]]
A    0.608068
B    0.332345
C    0.519019
D    0.181947
Name: 2018-05-03 00:00:00, dtype: float64

按标签选择多轴:

df.loc[:,['A','B']]
A B
2018-05-01 0.225279 0.735710
2018-05-02 0.705007 0.900239
2018-05-03 0.608068 0.332345
2018-05-04 0.921958 0.626003
2018-05-05 0.304423 0.836494
2018-05-06 0.342860 0.053211

两个端点也包括:

df.loc['20180502':'20180504',['A','C']]
A C
2018-05-02 0.705007 0.551207
2018-05-03 0.608068 0.519019
2018-05-04 0.921958 0.945828
df.loc['20180505',['A','C']]
A    0.304423
C    0.731351
Name: 2018-05-05 00:00:00, dtype: float64

获取某一个值:

df.loc[datas[0],'A']
0.22527926638468565

也可以用 at

df.at[datas[0],'A']
0.22527926638468565

按位置选择

通过传递的整数位置选择:

df.iloc[3]
A    0.921958
B    0.626003
C    0.945828
D    0.357211
Name: 2018-05-04 00:00:00, dtype: float64

切片

df.iloc[0:6,0:3]
A B C
2018-05-01 0.225279 0.735710 0.009639
2018-05-02 0.705007 0.900239 0.551207
2018-05-03 0.608068 0.332345 0.519019
2018-05-04 0.921958 0.626003 0.945828
2018-05-05 0.304423 0.836494 0.731351
2018-05-06 0.342860 0.053211 0.670777

通过整数位置位置列表

df.iloc[[0,2,4],[1,3]]
B D
2018-05-01 0.735710 0.408149
2018-05-03 0.332345 0.181947
2018-05-05 0.836494 0.678947
df.iloc[:,1:3]
B C
2018-05-01 0.735710 0.009639
2018-05-02 0.900239 0.551207
2018-05-03 0.332345 0.519019
2018-05-04 0.626003 0.945828
2018-05-05 0.836494 0.731351
2018-05-06 0.053211 0.670777

获取某一个值:

df.iloc[1,1]
0.9002387294615217

布尔索引

df[df>0.2]
A B C D
2018-05-01 0.225279 0.735710 NaN 0.408149
2018-05-02 0.705007 0.900239 0.551207 NaN
2018-05-03 0.608068 0.332345 0.519019 NaN
2018-05-04 0.921958 0.626003 0.945828 0.357211
2018-05-05 0.304423 0.836494 0.731351 0.678947
2018-05-06 0.342860 NaN 0.670777 NaN

使用 isin() 过滤:

df2 = df.copy()
df2['E'] = ['one','two','three','four','five','six']
df2
A B C D E
2018-05-01 0.225279 0.735710 0.009639 0.408149 one
2018-05-02 0.705007 0.900239 0.551207 0.165471 two
2018-05-03 0.608068 0.332345 0.519019 0.181947 three
2018-05-04 0.921958 0.626003 0.945828 0.357211 four
2018-05-05 0.304423 0.836494 0.731351 0.678947 five
2018-05-06 0.342860 0.053211 0.670777 0.186546 six
df2['E'].isin(['one','five'])
2018-05-01     True
2018-05-02    False
2018-05-03    False
2018-05-04    False
2018-05-05     True
2018-05-06    False
Freq: D, Name: E, dtype: bool
df2[df2['E'].isin(['one','five'])]
A B C D E
2018-05-01 0.225279 0.735710 0.009639 0.408149 one
2018-05-05 0.304423 0.836494 0.731351 0.678947 five

自定义设置值

s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20180501',periods=6))
s1
2018-05-01    1
2018-05-02    2
2018-05-03    3
2018-05-04    4
2018-05-05    5
2018-05-06    6
Freq: D, dtype: int64
df['F'] = s1

按照标签设置值

df.loc[datas[0],'A'] = 0

按照位置设置值

df.iloc[0,1] = 0

看一下操作的结果:

df
A B C D F
2018-05-01 0.000000 0.000000 0.009639 0.408149 1
2018-05-02 0.705007 0.900239 0.551207 0.165471 2
2018-05-03 0.608068 0.332345 0.519019 0.181947 3
2018-05-04 0.921958 0.626003 0.945828 0.357211 4
2018-05-05 0.304423 0.836494 0.731351 0.678947 5
2018-05-06 0.342860 0.053211 0.670777 0.186546 6

缺失的数据

pandas 主要使用 np.nan 来表示缺失的数据,默认不包含在计算中。

重建索引允许您更改/添加/删除指定轴上的索引。这将返回数据的副本。

df1 = df.reindex(index=datas[0:4],columns=list(df.columns) + ['E'])
df1.loc[datas[0]:datas[1],'E'] = 1
df1
A B C D F E
2018-05-01 0.000000 0.000000 0.009639 0.408149 1 1.0
2018-05-02 0.705007 0.900239 0.551207 0.165471 2 1.0
2018-05-03 0.608068 0.332345 0.519019 0.181947 3 NaN
2018-05-04 0.921958 0.626003 0.945828 0.357211 4 NaN

删除任何缺少数据的行:

df1.dropna()
A B C D F E
2018-05-01 0.000000 0.000000 0.009639 0.408149 1 1.0
2018-05-02 0.705007 0.900239 0.551207 0.165471 2 1.0

填充数据:

df1.fillna(value=5)
A B C D F E
2018-05-01 0.000000 0.000000 0.009639 0.408149 1 1.0
2018-05-02 0.705007 0.900239 0.551207 0.165471 2 1.0
2018-05-03 0.608068 0.332345 0.519019 0.181947 3 5.0
2018-05-04 0.921958 0.626003 0.945828 0.357211 4 5.0

判断是否为空,返回布尔值

df1.isna()
A B C D F E
2018-05-01 False False False False False False
2018-05-02 False False False False False False
2018-05-03 False False False False False True
2018-05-04 False False False False False True

一些函数的操作

df.mean() # 列统计
A    0.480386
B    0.458049
C    0.571303
D    0.329712
F    3.500000
dtype: float64
df.mean(1) # 行统计
2018-05-01    0.283558
2018-05-02    0.864385
2018-05-03    0.928276
2018-05-04    1.370200
2018-05-05    1.510243
2018-05-06    1.450679
Freq: D, dtype: float64
df.apply(np.cumsum)
A B C D F
2018-05-01 0.000000 0.000000 0.009639 0.408149 1
2018-05-02 0.705007 0.900239 0.560846 0.573620 3
2018-05-03 1.313075 1.232584 1.079864 0.755568 6
2018-05-04 2.235032 1.858587 2.025692 1.112778 10
2018-05-05 2.539455 2.695081 2.757043 1.791725 15
2018-05-06 2.882315 2.748292 3.427820 1.978271 21

这会应用到所有的数据框中。

直方图

s = pd.Series(np.random.randint(1,7,size=9))
s
0    6
1    1
2    1
3    5
4    1
5    3
6    2
7    1
8    1
dtype: int32
s.value_counts()
1    5
6    1
5    1
3    1
2    1
dtype: int64

字符串方法

s = pd.Series(['A','B','C','Al',np.nan,'dOg'])
s.str.lower()
0      a
1      b
2      c
3     al
4    NaN
5    dog
dtype: object

合并

s1 = pd.Series(['a','b'])
s2 = pd.Series(['c','d'])
pd.concat([s1,s2])

0    a
1    b
0    c
1    d
dtype: object

通过将 ignore_index 选项设置为,清除现有索引并在结果中重置它 True 。

pd.concat([s1,s2],ignore_index=True)
0    a
1    b
2    c
3    d
dtype: object

使用该 keys 选项在数据的最外层添加分层索引

pd.concat([s1,s2],keys=['s1','s2'])
s1  0    a
    1    b
s2  0    c
    1    d
dtype: object
pd.concat([s1,s2],keys=['s1','s2'],ignore_index=True)
0    a
1    b
2    c
3    d
dtype: object

合并

left = pd.DataFrame({'key':['foo','bar'],'lval':[1,2]})
right = pd.DataFrame({'key':['foo','bar'],'rval':[3,4]})
left
key lval
0 foo 1
1 bar 2
right
key rval
0 foo 3
1 bar 4
pd.merge(left,right)
key lval rval
0 foo 1 3
1 bar 2 4

追加

将行追加到数据框:

df = pd.DataFrame(np.random.rand(4,4),columns=['A','B','C','D'])
df
A B C D
0 0.304266 0.558159 0.699805 0.964887
1 0.083297 0.228968 0.825672 0.483591
2 0.497066 0.203718 0.894997 0.830234
3 0.001110 0.323248 0.066382 0.074556
s = df.iloc[3]
s
A    0.001110
B    0.323248
C    0.066382
D    0.074556
Name: 3, dtype: float64
df.append(s,ignore_index=True)
A B C D
0 0.304266 0.558159 0.699805 0.964887
1 0.083297 0.228968 0.825672 0.483591
2 0.497066 0.203718 0.894997 0.830234
3 0.001110 0.323248 0.066382 0.074556
4 0.001110 0.323248 0.066382 0.074556

有点像列表添加元素。

分组

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ....:                           'foo', 'bar', 'foo', 'foo'],
   ....:                    'B' : ['one', 'one', 'two', 'three',
   ....:                           'two', 'two', 'one', 'three'],
   ....:                    'C' : np.random.randn(8),
   ....:                    'D' : np.random.randn(8)})
   ....: 
df
A B C D
0 foo one 0.015713 -0.276890
1 bar one 0.563566 0.089973
2 foo two -1.203656 2.242553
3 bar three -0.254199 -1.358523
4 foo two -0.625421 0.252078
5 bar two 0.461810 -2.049906
6 foo one -1.272169 0.447615
7 foo three -0.100721 0.131472
df.groupby('A').sum()
C D
A
bar 0.771176 -3.318457
foo -3.186254 2.796829
df.groupby(['A','B']).sum()
C D
A B
bar one 0.563566 0.089973
three -0.254199 -1.358523
two 0.461810 -2.049906
foo one -1.256456 0.170726
three -0.100721 0.131472
two -1.829077 2.494631

数据透视表

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
   .....:                    'B' : ['A', 'B', 'C'] * 4,
   .....:                    'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
   .....:                    'D' : np.random.randn(12),
   .....:                    'E' : np.random.randn(12)})
df
A B C D E
0 one A foo 0.167128 -0.501473
1 one B foo -1.218322 -0.875397
2 two C foo -1.327522 -1.608971
3 three A bar -0.917783 -0.537453
4 one B bar -0.803415 -0.129088
5 one C bar -0.686571 0.123554
6 two A foo 0.051545 1.138850
7 three B foo 0.138666 0.396274
8 one C foo 0.840112 0.820482
9 one A bar 0.452267 -1.411540
10 two B bar -1.000297 1.037715
11 three C bar 2.481947 -1.184744
pd.pivot_table(df,values='D',index=['A','B'],columns=['C'])
C bar foo
A B
one A 0.452267 0.167128
B -0.803415 -1.218322
C -0.686571 0.840112
three A -0.917783 NaN
B NaN 0.138666
C 2.481947 NaN
two A NaN 0.051545
B -1.000297 NaN
C NaN -1.327522

数据读取

df.read_csv() # 从CSV读取数据

df.to_excel('foo.xlsx', sheet_name='Sheet1') # 从excel读取数据


猜你喜欢

转载自blog.csdn.net/sixkery/article/details/83385913