学习篇之数据分析库pandas

import numpy as np
import pandas as pd  
#导入numpy、pandas模块

Series 数据结构(带有标签的一维数组)

# Series 是带有标签的一维数组,可以保存任何数据类型(整数,字符串,浮点数,Python对象等),轴标签统称为索引

s = pd.Series(np.random.rand(5))
print(s)
print(type(s))
# 查看数据、数据类型

print(s.index,type(s.index))
print(s.values,type(s.values))
# .index查看series索引,类型为rangeindex
# .values查看series值,类型是ndarray

# 核心:series相比于ndarray,是一个自带索引index的数组 → 一维数组 + 对应索引
# 所以当只看series的值的时候,就是一个ndarray
# series和ndarray较相似,索引切片功能差别不大
# series和dict相比,series更像一个有顺序的字典(dict本身不存在顺序),其索引原理与字典相似(一个用key,一个用index)


0    0.467169
1    0.054509
2    0.965586
3    0.507866
4    0.694997
dtype: float64
<class 'pandas.core.series.Series'>
RangeIndex(start=0, stop=5, step=1) <class 'pandas.core.indexes.range.RangeIndex'>
[0.46716932 0.05450883 0.96558627 0.50786619 0.69499677] <class 'numpy.ndarray'>

Series 创建方法一:由字典创建,字典的key就是index,values就是values

dic = {'a':1 ,'b':2 , 'c':3, '4':4, '5':5}
s = pd.Series(dic)
print(s)
# 注意:key肯定是字符串,假如values类型不止一个会怎么样? → dic = {'a':1 ,'b':'hello' , 'c':3, '4':4, '5':5}

4    4
5    5
a    1
b    2
c    3
dtype: int64

Series 创建方法二:由数组创建(一维数组)

arr = np.random.randn(5)
s = pd.Series(arr)
print(arr)
print(s)
# 默认index是从0开始,步长为1的数字

s = pd.Series(arr, index = ['a','b','c','d','e'],dtype = np.object)
print(s)
# index参数:设置index,长度保持一致
# dtype参数:设置数值类型

[-2.56328023  0.87233579  0.47630666  1.91715736 -1.26924024]
0   -2.563280
1    0.872336
2    0.476307
3    1.917157
4   -1.269240
dtype: float64
a    -2.56328
b    0.872336
c    0.476307
d     1.91716
e    -1.26924
dtype: object

Series 名称属性:name

s1 = pd.Series(np.random.randn(5))
print(s1)
print('-----')
s2 = pd.Series(np.random.randn(5),name = 'test')
print(s2)
print(s1.name, s2.name,type(s2.name))
# name为Series的一个参数,创建一个数组的 名称
# .name方法:输出数组的名称,输出格式为str,如果没用定义输出名称,输出为None

s3 = s2.rename('hehehe')
print(s3)
print(s3.name, s2.name)
# .rename()重命名一个数组的名称,并且新指向一个数组,原数组不变

0   -1.285306
1   -0.586416
2   -1.966362
3   -1.507387
4    0.622088
dtype: float64
-----
0   -0.763427
1   -1.588831
2   -1.676116
3    0.453159
4   -0.874990
Name: test, dtype: float64
None test <class 'str'>
0   -0.763427
1   -1.588831
2   -1.676116
3    0.453159
4   -0.874990
Name: hehehe, dtype: float64
hehehe test

Dataframe 数据结构(带有行列标签的二维数组)

# Dataframe是一个表格型的数据结构,“带有标签的二维数组”。
# Dataframe带有index(行标签)和columns(列标签)

data = {'name':['Jack','Tom','Mary'],
        'age':[18,19,20],
       'gender':['m','m','w']}
frame = pd.DataFrame(data)
print(frame)  
print(type(frame))
print(frame.index,'\n该数据类型为:',type(frame.index))
print(frame.columns,'\n该数据类型为:',type(frame.columns))
print(frame.values,'\n该数据类型为:',type(frame.values))
# 查看数据,数据类型为dataframe
# .index查看行标签
# .columns查看列标签
# .values查看值,数据类型为ndarray

   age gender  name
0   18      m  Jack
1   19      m   Tom
2   20      w  Mary
<class 'pandas.core.frame.DataFrame'>
RangeIndex(start=0, stop=3, step=1) 
该数据类型为: <class 'pandas.core.indexes.range.RangeIndex'>
Index(['age', 'gender', 'name'], dtype='object') 
该数据类型为: <class 'pandas.core.indexes.base.Index'>
[[18 'm' 'Jack']
 [19 'm' 'Tom']
 [20 'w' 'Mary']] 
该数据类型为: <class 'numpy.ndarray'>

创建方式三种

  • 由数组/list组成的字典
  • 由Series组成的字典
  • 通过二维数组直接创建

Dataframe 创建方法一:由数组/list组成的字典

# 创建方法:pandas.Dataframe()

data1 = {'a':[1,2,3],
        'b':[3,4,5],
        'c':[5,6,7]}
data2 = {'one':np.random.rand(3),
        'two':np.random.rand(3)}   # 这里如果尝试  'two':np.random.rand(4) 会怎么样?
print(data1)
print(data2)
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
print(df2)
# 由数组/list组成的字典 创建Dataframe,columns为字典key,index为默认数字标签
# 字典的值的长度必须保持一致!

df1 = pd.DataFrame(data1, columns = ['b','c','a','d'])
print(df1)
df1 = pd.DataFrame(data1, columns = ['b','c'])
print(df1)
# columns参数:可以重新指定列的顺序,格式为list,如果现有数据中没有该列(比如'd'),则产生NaN值
# 如果columns重新指定时候,列的数量可以少于原数据

df2 = pd.DataFrame(data2, index = ['f1','f2','f3'])  # 这里如果尝试  index = ['f1','f2','f3','f4'] 会怎么样?
print(df2)
# index参数:重新定义index,格式为list,长度必须保持一致


{'a': [1, 2, 3], 'b': [3, 4, 5], 'c': [5, 6, 7]}
{'one': array([0.62316529, 0.75502024, 0.07149083]), 'two': array([0.18033469, 0.90637716, 0.6449568 ])}
   a  b  c
0  1  3  5
1  2  4  6
2  3  5  7
        one       two
0  0.623165  0.180335
1  0.755020  0.906377
2  0.071491  0.644957
   b  c  a    d
0  3  5  1  NaN
1  4  6  2  NaN
2  5  7  3  NaN
   b  c
0  3  5
1  4  6
2  5  7
         one       two
f1  0.623165  0.180335
f2  0.755020  0.906377
f3  0.071491  0.644957

Dataframe 创建方法二:由Series组成的字典

data1 = {'one':pd.Series(np.random.rand(2)),
        'two':pd.Series(np.random.rand(3))}  # 没有设置index的Series
data2 = {'one':pd.Series(np.random.rand(2), index = ['a','b']),
        'two':pd.Series(np.random.rand(3),index = ['a','b','c'])}  # 设置了index的Series
print(data1)
print(data2)
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
print(df2)
# 由Seris组成的字典 创建Dataframe,columns为字典key,index为Series的标签(如果Series没有指定标签,则是默认数字标签)
# Series可以长度不一样,生成的Dataframe会出现NaN值

{'two': 0    0.331382
1    0.508265
2    0.615997
dtype: float64, 'one': 0    0.857739
1    0.165800
dtype: float64}
{'two': a    0.826446
b    0.983392
c    0.187749
dtype: float64, 'one': a    0.920073
b    0.215178
dtype: float64}
        one       two
0  0.857739  0.331382
1  0.165800  0.508265
2       NaN  0.615997
        one       two
a  0.920073  0.826446
b  0.215178  0.983392
c       NaN  0.187749

Dataframe 创建方法三:通过二维数组直接创建

ar = np.random.rand(9).reshape(3,3)
print(ar)
df1 = pd.DataFrame(ar)
df2 = pd.DataFrame(ar, index = ['a', 'b', 'c'], columns = ['one','two','three'])  # 可以尝试一下index或columns长度不等于已有数组的情况
print(df1)
print(df2)
# 通过二维数组直接创建Dataframe,得到一样形状的结果数据,如果不指定index和columns,两者均返回默认数字格式
# index和colunms指定长度与原数组保持一致

[[0.33940056 0.77384698 0.25308293]
 [0.28151251 0.02875986 0.7516066 ]
 [0.34746659 0.25245068 0.68979615]]
          0         1         2
0  0.339401  0.773847  0.253083
1  0.281513  0.028760  0.751607
2  0.347467  0.252451  0.689796
        one       two     three
a  0.339401  0.773847  0.253083
b  0.281513  0.028760  0.751607
c  0.347467  0.252451  0.689796

索引与切片(行用 (i)loc,列不用。单列字符串,多列就列表)

Series和Datafram索引的原理一样,我们以Dataframe的索引为主来学习
列索引:df[‘列名’] (Series不存在列索引)
行索引:df.loc[]、df.iloc[]
选择列 / 选择行 / 切片 / 布尔判断

import numpy as np
import pandas as pd  
# 导入numpy、pandas模块

选择行与列

df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
                   index = ['one','two','three'],
                   columns = ['a','b','c','d'])
print(df)

data1 = df['a']
data2 = df[['a','c']]
print(data1,type(data1))
print(data2,type(data2))
print('-----')
# 按照列名选择列,只选择一列输出Series,选择多列输出Dataframe

data3 = df.loc['one']
data4 = df.loc[['one','two']]
print(data3,type(data3))
print(data4,type(data4))
# 按照index选择行,只选择一行输出Series,选择多行输出Dataframe




              a          b          c          d
one    12.091343   5.285528  50.926279  43.411861
two    49.502460  68.476758  96.639658  13.522464
three   2.368430  72.006476  65.543572  95.790480
one      12.091343
two      49.502460
three     2.368430
Name: a, dtype: float64 <class 'pandas.core.series.Series'>
               a          c
one    12.091343  50.926279
two    49.502460  96.639658
three   2.368430  65.543572 <class 'pandas.core.frame.DataFrame'>
-----
a    12.091343
b     5.285528
c    50.926279
d    43.411861
Name: one, dtype: float64 <class 'pandas.core.series.Series'>
             a          b          c          d
one  12.091343   5.285528  50.926279  43.411861
two  49.502460  68.476758  96.639658  13.522464 <class 'pandas.core.frame.DataFrame'>


# df[]默认选择列,[]中写列名(所以一般数据colunms都会单独制定,不会用默认数字列名,以免和index冲突)
# 单选列为Series,print结果为Series格式
# 多选列为Dataframe,print结果为Dataframe格式

# 核心笔记:df[col]一般用于选择列,[]中写列名
 #核心笔记:df.loc[label]主要针对index选择行,同时支持指定index,及默认数字index
 #loc与iloc 的区别,前者末端包括,后者不包括,前者’自闭‘

布尔型索引

# 多用于索引行

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print(df)
print('------')

b1 = df < 20
print(b1,type(b1))
print(df[b1])  # 也可以书写为 df[df < 20]
print('------')
# 不做索引则会对数据每个值进行判断
# 索引结果保留 所有数据:True返回原数据,False返回值为NaN

b2 = df['a'] > 50
print(b2,type(b2))
print(df[b2])  # 也可以书写为 df[df['a'] > 50]
print('------')
# 单列做判断
# 索引结果保留 单列判断为True的行数据,包括其他列

b3 = df[['a','b']] > 50
print(b3,type(b3))
print(df[b3])  # 也可以书写为 df[df[['a','b']] > 50]
print('------')
# 多列做判断
# 索引结果保留 所有数据:True返回原数据,False返回值为NaN
# 注意这里报错的话,更新一下pandas → conda update pandas

b4 = df.loc[['one','three']] < 50
print(b4,type(b4))
print(df[b4])  # 也可以书写为 df[df.loc[['one','three']] < 50]
print('------')
# 多行做判断
# 索引结果保留 所有数据:True返回原数据,False返回值为NaN


            a          b          c          d
one     9.970915  10.403036  61.548723  33.807531
two    64.147530  20.837293  47.027831  24.937798
three  79.262523   6.433300   6.151607  96.142251
four   97.723110   8.283644  70.768830   6.523142
------
           a      b      c      d
one     True   True  False  False
two    False  False  False  False
three  False   True   True  False
four   False   True  False   True <class 'pandas.core.frame.DataFrame'>
              a          b         c         d
one    9.970915  10.403036       NaN       NaN
two         NaN        NaN       NaN       NaN
three       NaN   6.433300  6.151607       NaN
four        NaN   8.283644       NaN  6.523142
------
one      False
two       True
three     True
four      True
Name: a, dtype: bool <class 'pandas.core.series.Series'>
               a          b          c          d
two    64.147530  20.837293  47.027831  24.937798
three  79.262523   6.433300   6.151607  96.142251
four   97.723110   8.283644  70.768830   6.523142
------
           a      b
one    False  False
two     True  False
three   True  False
four    True  False <class 'pandas.core.frame.DataFrame'>
               a   b   c   d
one          NaN NaN NaN NaN
two    64.147530 NaN NaN NaN
three  79.262523 NaN NaN NaN
four   97.723110 NaN NaN NaN
------
           a     b      c      d
one     True  True  False   True
three  False  True   True  False <class 'pandas.core.frame.DataFrame'>
              a          b         c          d
one    9.970915  10.403036       NaN  33.807531
two         NaN        NaN       NaN        NaN
three       NaN   6.433300  6.151607        NaN
four        NaN        NaN       NaN        NaN
------

多重索引:比如同时索引行和列

# 先选择列再选择行 —— 相当于对于一个数据,先筛选字段,再选择数据量

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print(df)
print('------')

print(df['a'].loc[['one','three']])   # 选择a列的one,three行
print(df[['b','c','d']].iloc[::2])   # 选择b,c,d列的one,three行
print(df[df['a'] < 50].iloc[:2])   # 选择满足判断索引的前两行数据


               a          b          c          d
one    98.661560  29.514835  54.308770  85.895547
two    27.937505   7.272639  38.820131  93.830862
three  75.479305  80.195558  16.024623  63.068741
four   48.927145  38.935594  18.076788  48.773935
------
one      98.661560
three    75.479305
Name: a, dtype: float64
               b          c          d
one    29.514835  54.308770  85.895547
three  80.195558  16.024623  63.068741
              a          b          c          d
two   27.937505   7.272639  38.820131  93.830862
four  48.927145  38.935594  18.076788  48.773935

基本技巧
数据查看、转置 / 添加、修改、删除值 / 对齐 / 排序

数据查看、转置

df = pd.DataFrame(np.random.rand(16).reshape(8,2)*100,
                   columns = ['a','b'])
print(df.head(2))
print(df.tail())
# .head()查看头部数据
# .tail()查看尾部数据
# 默认查看5条

print(df.T)
# .T 转置

           a          b
0  64.231620  24.222954
1   3.004779  92.549576
           a          b
3  54.787062  17.264577
4  13.106864   5.500618
5   8.631310  79.109355
6  22.107241  94.901685
7  29.034599  54.156278
           0          1          2          3          4          5  \
a  64.231620   3.004779  25.002825  54.787062  13.106864   8.631310   
b  24.222954  92.549576  87.818090  17.264577   5.500618  79.109355   

           6          7  
a  22.107241  29.034599  
b  94.901685  54.156278  

添加与修改

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   columns = ['a','b','c','d'])
print(df)

df['e'] = 10
df.loc[4] = 20
print(df)
# 新增列/行并赋值

df['e'] = 20
df[['a','c']] = 100
print(df)
# 索引后直接修改值

          a          b          c          d
0  14.342082  52.604100  26.561995  60.441731
1  20.331108  43.537490   1.020098   7.171418
2  35.226542   9.573718  99.273254   0.867227
3  47.511549  56.783730  47.580639  67.007725
           a          b          c          d   e
0  14.342082  52.604100  26.561995  60.441731  10
1  20.331108  43.537490   1.020098   7.171418  10
2  35.226542   9.573718  99.273254   0.867227  10
3  47.511549  56.783730  47.580639  67.007725  10
4  20.000000  20.000000  20.000000  20.000000  20
     a          b    c          d   e
0  100  52.604100  100  60.441731  20
1  100  43.537490  100   7.171418  20
2  100   9.573718  100   0.867227  20
3  100  56.783730  100  67.007725  20
4  100  20.000000  100  20.000000  20

删除 del 列/ drop()行

df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   columns = ['a','b','c','d'])
print(df)

del df['a']
print(df)
print('-----')
# del语句 - 删除列

print(df.drop(0))
print(df.drop([1,2]))
print(df)
print('-----')
# drop()删除行,inplace=False → 删除后生成新的数据,不改变原数据

print(df.drop(['d'], axis = 1))
print(df)
# drop()删除列,需要加上axis = 1,inplace=False → 删除后生成新的数据,不改变原数据

           a          b          c          d
0  71.238538   6.121303  77.988034  44.047009
1  34.018365  78.192855  50.467246  81.162337
2  86.311980  44.341469  49.789445  35.657665
3  78.073272  31.457479  74.385014  24.655976
           b          c          d
0   6.121303  77.988034  44.047009
1  78.192855  50.467246  81.162337
2  44.341469  49.789445  35.657665
3  31.457479  74.385014  24.655976
-----
           b          c          d
1  78.192855  50.467246  81.162337
2  44.341469  49.789445  35.657665
3  31.457479  74.385014  24.655976
           b          c          d
0   6.121303  77.988034  44.047009
3  31.457479  74.385014  24.655976
           b          c          d
0   6.121303  77.988034  44.047009
1  78.192855  50.467246  81.162337
2  44.341469  49.789445  35.657665
3  31.457479  74.385014  24.655976
-----
           b          c
0   6.121303  77.988034
1  78.192855  50.467246
2  44.341469  49.789445
3  31.457479  74.385014
           b          c          d
0   6.121303  77.988034  44.047009
1  78.192855  50.467246  81.162337
2  44.341469  49.789445  35.657665
3  31.457479  74.385014  24.655976

对齐

df1 = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])
print(df1 + df2)
# DataFrame对象之间的数据自动按照列和索引(行标签)对齐

          A         B         C   D
0  2.815874 -0.151485 -0.309208 NaN
1  2.294022 -0.355077  1.810512 NaN
2  1.947371 -0.550546  0.782233 NaN
3  0.893228 -0.961349  0.901582 NaN
4 -0.256011 -0.383676 -0.699367 NaN
5  1.343078 -0.797174  1.437726 NaN
6  0.727636  0.894328 -2.924819 NaN
7       NaN       NaN       NaN NaN
8       NaN       NaN       NaN NaN
9       NaN       NaN       NaN NaN

排序1 - 按值排序 .sort_values

# 同样适用于Series

df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                   columns = ['a','b','c','d'])
print(df1)
print(df1.sort_values(['a'], ascending = True))  # 升序
print(df1.sort_values(['a'], ascending = False))  # 降序
print('------')
# ascending参数:设置升序降序,默认升序
# 单列排序

df2 = pd.DataFrame({'a':[1,1,1,1,2,2,2,2],
                  'b':list(range(8)),
                  'c':list(range(8,0,-1))})
print(df2)
print(df2.sort_values(['a','c']))
# 多列排序,按列顺序排序
# 注意inplace参数

           a          b          c          d
0   3.570340  86.589786  98.001894  50.984826
1  46.210527  83.412502  74.492003  73.251724
2  91.115360   6.083352  34.314697  56.784097
3  24.781722  67.813376  65.004312  31.788198
           a          b          c          d
0   3.570340  86.589786  98.001894  50.984826
3  24.781722  67.813376  65.004312  31.788198
1  46.210527  83.412502  74.492003  73.251724
2  91.115360   6.083352  34.314697  56.784097
           a          b          c          d
2  91.115360   6.083352  34.314697  56.784097
1  46.210527  83.412502  74.492003  73.251724
3  24.781722  67.813376  65.004312  31.788198
0   3.570340  86.589786  98.001894  50.984826
------
   a  b  c
0  1  0  8
1  1  1  7
2  1  2  6
3  1  3  5
4  2  4  4
5  2  5  3
6  2  6  2
7  2  7  1
   a  b  c
3  1  3  5
2  1  2  6
1  1  1  7
0  1  0  8
7  2  7  1
6  2  6  2
5  2  5  3
4  2  4  4

排序2 - 索引排序 .sort_index

df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                  index = [5,4,3,2],
                   columns = ['a','b','c','d'])
df2 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                  index = ['h','s','x','g'],
                   columns = ['a','b','c','d'])
print(df1)
print(df1.sort_index())
print(df2)
print(df2.sort_index())
# 按照index排序
# 默认 ascending=True, inplace=False

           a          b          c          d
5  80.932585  71.991854  64.582943  23.443231
4  82.054030  87.459058  12.108433  83.047490
3  56.329863  14.926822  47.884418  59.880352
2   0.347007  69.794103  74.375345  12.736429
           a          b          c          d
2   0.347007  69.794103  74.375345  12.736429
3  56.329863  14.926822  47.884418  59.880352
4  82.054030  87.459058  12.108433  83.047490
5  80.932585  71.991854  64.582943  23.443231
           a          b          c          d
h  53.041921  93.834097  13.423132  82.702020
s   0.003814  75.721426  73.086606  20.597472
x  32.678307  58.369155  70.487505  24.833117
g  46.232889  19.365147   9.872537  98.246438
           a          b          c          d
g  46.232889  19.365147   9.872537  98.246438
h  53.041921  93.834097  13.423132  82.702020
s   0.003814  75.721426  73.086606  20.597472
x  32.678307  58.369155  70.487505  24.833117

数值计算和统计基础
常用数学、统计方法
基本参数:axis(默认0,默认列)、skipna(默认True,默认逃NaN)

import numpy as np
import pandas as pd

df = pd.DataFrame({'key1':[4,5,3,np.nan,2],
                 'key2':[1,2,np.nan,4,5],
                 'key3':[1,2,3,'j','k']},
                 index = ['a','b','c','d','e'])
print(df)
print(df['key1'].dtype,df['key2'].dtype,df['key3'].dtype)
print('-----')

m1 = df.mean()
print(m1,type(m1))
print('单独统计一列:',df['key2'].mean())
print('-----')
# np.nan :空值
# .mean()计算均值
# 只统计数字列
# 可以通过索引单独统计一列

m2 = df.mean(axis=1)
print(m2)
print('-----')
# axis参数:默认为0,以列来计算,axis=1,以行来计算,这里就按照行来汇总了

m3 = df.mean(skipna=False)
print(m3)
print('-----')
# skipna参数:是否忽略NaN,默认True,如False,有NaN的列统计结果仍未NaN

  key1  key2 key3
a   4.0   1.0    1
b   5.0   2.0    2
c   3.0   NaN    3
d   NaN   4.0    j
e   2.0   5.0    k
float64 float64 object
-----
key1    3.5
key2    3.0
dtype: float64 <class 'pandas.core.series.Series'>
单独统计一列: 3.0
-----
a    2.5
b    3.5
c    3.0
d    4.0
e    3.5
dtype: float64
-----
key1   NaN
key2   NaN
dtype: float64
-----

主要数学计算方法,可用于Series和DataFrame(1)

df = pd.DataFrame({'key1':np.arange(10),
                  'key2':np.random.rand(10)*10})
print(df)
print('-----')

print(df.count(),'→ count统计非Na值的数量\n')
print(df.min(),'→ min统计最小值\n',df['key2'].max(),'→ max统计最大值\n')
print(df.quantile(q=0.75),'→ quantile统计分位数,参数q确定位置\n')
print(df.sum(),'→ sum求和\n')
print(df.mean(),'→ mean求平均值\n')
print(df.median(),'→ median求算数中位数,50%分位数\n')
print(df.std(),'\n',df.var(),'→ std,var分别求标准差,方差\n')
print(df.skew(),'→ skew样本的偏度\n')
print(df.kurt(),'→ kurt样本的峰度\n')

 key1      key2
0     0  6.792638
1     1  1.049023
2     2  5.441224
3     3  4.667631
4     4  2.053692
5     5  9.813006
6     6  5.074884
7     7  1.526651
8     8  8.519215
9     9  3.543486
-----
key1    10
key2    10
dtype: int64 → count统计非Na值的数量

key1    0.000000
key2    1.049023
dtype: float64 → min统计最小值
 9.81300585173231 → max统计最大值

key1    6.750000
key2    6.454785
Name: 0.75, dtype: float64 → quantile统计分位数,参数q确定位置

key1    45.00000
key2    48.48145
dtype: float64 → sum求和

key1    4.500000
key2    4.848145
dtype: float64 → mean求平均值

key1    4.500000
key2    4.871257
dtype: float64 → median求算数中位数,50%分位数

key1    3.027650
key2    2.931062
dtype: float64 
 key1    9.166667
key2    8.591127
dtype: float64 → std,var分别求标准差,方差

key1    0.000000
key2    0.352466
dtype: float64 → skew样本的偏度

key1   -1.20000
key2   -0.79798
dtype: float64 → kurt样本的峰度

主要数学计算方法,可用于Series和DataFrame(2)

df['key1_s'] = df['key1'].cumsum()
df['key2_s'] = df['key2'].cumsum()
print(df,'→ cumsum样本的累计和\n')

df['key1_p'] = df['key1'].cumprod()
df['key2_p'] = df['key2'].cumprod()
print(df,'→ cumprod样本的累计积\n')

print(df.cummax(),'\n',df.cummin(),'→ cummax,cummin分别求累计最大值,累计最小值\n')
# 会填充key1,和key2的值

  key1      key2  key1_s     key2_s
0     0  6.792638       0   6.792638
1     1  1.049023       1   7.841661
2     2  5.441224       3  13.282885
3     3  4.667631       6  17.950515
4     4  2.053692      10  20.004208
5     5  9.813006      15  29.817213
6     6  5.074884      21  34.892097
7     7  1.526651      28  36.418749
8     8  8.519215      36  44.937963
9     9  3.543486      45  48.481450 → cumsum样本的累计和

   key1      key2  key1_s     key2_s  key1_p         key2_p
0     0  6.792638       0   6.792638       0       6.792638
1     1  1.049023       1   7.841661       0       7.125633
2     2  5.441224       3  13.282885       0      38.772160
3     3  4.667631       6  17.950515       0     180.974131
4     4  2.053692      10  20.004208       0     371.665151
5     5  9.813006      15  29.817213       0    3647.152301
6     6  5.074884      21  34.892097       0   18508.874743
7     7  1.526651      28  36.418749       0   28256.595196
8     8  8.519215      36  44.937963       0  240724.006055
9     9  3.543486      45  48.481450       0  853002.188425 → cumprod样本的累计积

   key1      key2  key1_s     key2_s  key1_p         key2_p
0   0.0  6.792638     0.0   6.792638     0.0       6.792638
1   1.0  6.792638     1.0   7.841661     0.0       7.125633
2   2.0  6.792638     3.0  13.282885     0.0      38.772160
3   3.0  6.792638     6.0  17.950515     0.0     180.974131
4   4.0  6.792638    10.0  20.004208     0.0     371.665151
5   5.0  9.813006    15.0  29.817213     0.0    3647.152301
6   6.0  9.813006    21.0  34.892097     0.0   18508.874743
7   7.0  9.813006    28.0  36.418749     0.0   28256.595196
8   8.0  9.813006    36.0  44.937963     0.0  240724.006055
9   9.0  9.813006    45.0  48.481450     0.0  853002.188425 
    key1      key2  key1_s    key2_s  key1_p    key2_p
0   0.0  6.792638     0.0  6.792638     0.0  6.792638
1   0.0  1.049023     0.0  6.792638     0.0  6.792638
2   0.0  1.049023     0.0  6.792638     0.0  6.792638
3   0.0  1.049023     0.0  6.792638     0.0  6.792638
4   0.0  1.049023     0.0  6.792638     0.0  6.792638
5   0.0  1.049023     0.0  6.792638     0.0  6.792638
6   0.0  1.049023     0.0  6.792638     0.0  6.792638
7   0.0  1.049023     0.0  6.792638     0.0  6.792638
8   0.0  1.049023     0.0  6.792638     0.0  6.792638
9   0.0  1.049023     0.0  6.792638     0.0  6.792638 → cummax,cummin分别求累计最大值,累计最小值

唯一值:.unique()

s = pd.Series(list('asdvasdcfgg'))
sq = s.unique()
print(s)
print(sq,type(sq))
print(pd.Series(sq))
# 得到一个唯一值数组
# 通过pd.Series重新变成新的Series

sq.sort()
print(sq)
# 重新排序

0     a
1     s
2     d
3     v
4     a
5     s
6     d
7     c
8     f
9     g
10    g
dtype: object
['a' 's' 'd' 'v' 'c' 'f' 'g'] <class 'numpy.ndarray'>
0    a
1    s
2    d
3    v
4    c
5    f
6    g
dtype: object
['a' 'c' 'd' 'f' 'g' 's' 'v']

值计数:.value_counts()

sc = s.value_counts(sort = False)  # 也可以这样写:pd.value_counts(sc, sort = False)
print(sc)
# 得到一个新的Series,计算出不同值出现的频率
# sort参数:排序,默认为True

d    2
a    2
s    2
c    1
f    1
g    2
v    1
dtype: int64

成员资格:.isin()

s = pd.Series(np.arange(10,15))
df = pd.DataFrame({'key1':list('asdcbvasd'),
                  'key2':np.arange(4,13)})
print(s)
print(df)
print('-----')

print(s.isin([5,14]))
print(df.isin(['a','bc','10',8]))
# 用[]表示
# 得到一个布尔值的Series或者Dataframe

0    10
1    11
2    12
3    13
4    14
dtype: int32
  key1  key2
0    a     4
1    s     5
2    d     6
3    c     7
4    b     8
5    v     9
6    a    10
7    s    11
8    d    12
-----
0    False
1    False
2    False
3    False
4     True
dtype: bool
    key1   key2
0   True  False
1  False  False
2  False  False
3  False  False
4  False   True
5  False  False
6   True  False
7  False  False
8  False  False

文本数据(.str)
Pandas针对字符串配备的一套方法,使其易于对数组的每个元素进行操作

通过str访问,且自动排除丢失/ NA值

s = pd.Series(['A','b','C','bbhello','123',np.nan,'hj'])
df = pd.DataFrame({'key1':list('abcdef'),
                  'key2':['hee','fv','w','hija','123',np.nan]})
print(s)
print(df)
print('-----')

print(s.str.count('b'))
print(df['key2'].str.upper())
print('-----')
# 直接通过.str调用字符串方法
# 可以对Series、Dataframe使用
# 自动过滤NaN值

df.columns = df.columns.str.upper()
print(df)
# df.columns是一个Index对象,也可使用.str

0          A
1          b
2          C
3    bbhello
4        123
5        NaN
6         hj
dtype: object
  key1  key2
0    a   hee
1    b    fv
2    c     w
3    d  hija
4    e   123
5    f   NaN
-----
0    0.0
1    1.0
2    0.0
3    2.0
4    0.0
5    NaN
6    0.0
dtype: float64
0     HEE
1      FV
2       W
3    HIJA
4     123
5     NaN
Name: key2, dtype: object
-----
  KEY1  KEY2
0    a   hee
1    b    fv
2    c     w
3    d  hija
4    e   123
5    f   NaN

字符串常用方法(2) - strip(去空格)

s = pd.Series([' jack', 'jill ', ' jesse ', 'frank'])
df = pd.DataFrame(np.random.randn(3, 2), columns=[' Column A ', ' Column B '],
                  index=range(3))
print(s)
print(df)
print('-----')

print(s.str.strip())  # 去除字符串中的空格
print(s.str.lstrip())  # 去除字符串中的左空格
print(s.str.rstrip())  # 去除字符串中的右空格

df.columns = df.columns.str.strip()
print(df)
# 这里去掉了columns的前后空格,但没有去掉中间空格

0       jack
1      jill 
2     jesse 
3      frank
dtype: object
    Column A    Column B 
0   -1.110964   -0.607590
1    2.043887    0.713886
2    0.840672   -0.854777
-----
0     jack
1     jill
2    jesse
3    frank
dtype: object
0      jack
1     jill 
2    jesse 
3     frank
dtype: object
0      jack
1      jill
2     jesse
3     frank
dtype: object
   Column A  Column B
0 -1.110964 -0.607590
1  2.043887  0.713886
2  0.840672 -0.854777

字符串常用方法(3) - replace(替换)

df = pd.DataFrame(np.random.randn(3, 2), columns=[' Column A ', ' Column B '],
                  index=range(3))
df.columns = df.columns.str.replace(' ','-')
print(df)
# 替换

df.columns = df.columns.str.replace('-','hehe',n=1)
print(df)
# n:替换个数

   -Column-A-  -Column-B-
0   -0.148261   -1.821841
1    0.920267   -2.327533
2    0.083585   -0.613041
   heheColumn-A-  heheColumn-B-
0      -0.148261      -1.821841
1       0.920267      -2.327533
2       0.083585      -0.613041

字符串常用方法(4) - split、rsplit(分裂,后者从右往左)

s = pd.Series(['a,b,c','1,2,3',['a,,,c'],np.nan])
print(s.str.split(','))
print('-----')
# 类似字符串的split

print(s.str.split(',')[0])
print('-----')
# 直接索引得到一个list

print(s.str.split(',').str[0])#只要第一个字符
print(s.str.split(',').str.get(1))
print('-----')
# 可以使用get或[]符号访问拆分列表中的元素

print(s.str.split(',', expand=True))
print(s.str.split(',', expand=True, n = 1))
print(s.str.rsplit(',', expand=True, n = 1))
print('-----')
# 可以使用expand可以轻松扩展此操作以返回DataFrame
# n参数限制分割数
# rsplit类似于split,反向工作,即从字符串的末尾到字符串的开头

df = pd.DataFrame({'key1':['a,b,c','1,2,3',[':,., ']],
                  'key2':['a-b-c','1-2-3',[':-.- ']]})
print('-----')
print(df)
print(df['key2'].str.split('-'))
# Dataframe使用split

0    [a, b, c]
1    [1, 2, 3]
2          NaN
3          NaN
dtype: object
-----
['a', 'b', 'c']
-----
0      a
1      1
2    NaN
3    NaN
dtype: object
0      b
1      2
2    NaN
3    NaN
dtype: object
-----
     0    1    2
0    a    b    c
1    1    2    3
2  NaN  NaN  NaN
3  NaN  NaN  NaN
     0    1
0    a  b,c
1    1  2,3
2  NaN  NaN
3  NaN  NaN
     0    1
0  a,b    c
1  1,2    3
2  NaN  NaN
3  NaN  NaN
-----
-----
      key1     key2
0    a,b,c    a-b-c
1    1,2,3    1-2-3
2  [:,., ]  [:-.- ]
0    [a, b, c]
1    [1, 2, 3]
2          NaN
Name: key2, dtype: object

字符串索引

s = pd.Series(['A','b','C','bbhello','123',np.nan,'hj'])
df = pd.DataFrame({'key1':list('abcdef'),
                  'key2':['hee','fv','w','hija','123',np.nan]})
print("-----")
print(s.str[0])  # 取第一个字符串
print("-----")
print(s.str[:2])  # 取前两个字符串
print(df['key2'].str[0]) 
# str之后和字符串本身索引方式相同

-----
0      A
1      b
2      C
3      b
4      1
5    NaN
6      h
dtype: object
-----
0      A
1      b
2      C
3     bb
4     12
5    NaN
6     hj
dtype: object
0      h
1      f
2      w
3      h
4      1
5    NaN
Name: key2, dtype: object

合并、连接、去重、替换
Pandas具有全功能的,高性能内存中连接操作,与SQL等关系数据库非常相似

pd.merge(left, right, how=‘inner’, on=None, left_on=None, right_on=None,left_index=False, right_index=False, sort=True,suffixes=(‘x’, ‘y’), copy=True, indicator=False)
pd.concat(objs, axis=0, join=‘outer’, join_axes=None, ignore_index=False,keys=None, levels=None, names=None, verify_integrity=False,copy=True)
.replace()
.duplicated()

merge合并 → 类似excel的vlookup

df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})
df3 = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                    'key2': ['K0', 'K1', 'K0', 'K1'],
                    'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})
df4 = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                    'key2': ['K0', 'K0', 'K0', 'K0'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})
print(pd.merge(df1, df2, on='key'))
print('------')
# left:第一个df
# right:第二个df
# on:参考键

print(pd.merge(df3, df4, on=['key1','key2']))
# 多个链接键

 A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K2  C2  D2
3  A3  B3  K3  C3  D3
------
    A   B key1 key2   C   D
0  A0  B0   K0   K0  C0  D0
1  A2  B2   K1   K0  C1  D1
2  A2  B2   K1   K0  C2  D2

参数how → 合并方式

print(pd.merge(df3, df4,on=['key1','key2'], how = 'inner'))  
print('------')
# inner:默认,取交集

print(pd.merge(df3, df4, on=['key1','key2'], how = 'outer'))  
print('------')
# outer:取并集,数据缺失范围NaN

print(pd.merge(df3, df4, on=['key1','key2'], how = 'left'))  
print('------')
# left:按照df3为参考合并,数据缺失范围NaN

print(pd.merge(df3, df4, on=['key1','key2'], how = 'right'))  
# right:按照df4为参考合并,数据缺失范围NaN

  A   B key1 key2   C   D
0  A0  B0   K0   K0  C0  D0
1  A2  B2   K1   K0  C1  D1
2  A2  B2   K1   K0  C2  D2
------
     A    B key1 key2    C    D
0   A0   B0   K0   K0   C0   D0
1   A1   B1   K0   K1  NaN  NaN
2   A2   B2   K1   K0   C1   D1
3   A2   B2   K1   K0   C2   D2
4   A3   B3   K2   K1  NaN  NaN
5  NaN  NaN   K2   K0   C3   D3
------
    A   B key1 key2    C    D
0  A0  B0   K0   K0   C0   D0
1  A1  B1   K0   K1  NaN  NaN
2  A2  B2   K1   K0   C1   D1
3  A2  B2   K1   K0   C2   D2
4  A3  B3   K2   K1  NaN  NaN
------
     A    B key1 key2   C   D
0   A0   B0   K0   K0  C0  D0
1   A2   B2   K1   K0  C1  D1
2   A2   B2   K1   K0  C2  D2
3  NaN  NaN   K2   K0  C3  D3

参数 left_on, right_on, left_index, right_index → 当键不为一个列时,可以单独设置左键与右键

df1 = pd.DataFrame({'lkey':list('bbacaab'),
                   'data1':range(7)})
df2 = pd.DataFrame({'rkey':list('abd'),
                   'date2':range(3)})
print(pd.merge(df1, df2, left_on='lkey', right_on='rkey'))
print('------')
# df1以‘lkey’为键,df2以‘rkey’为键

df1 = pd.DataFrame({'key':list('abcdfeg'),
                   'data1':range(7)})
df2 = pd.DataFrame({'date2':range(100,105)},
                  index = list('abcde'))
print(pd.merge(df1, df2, left_on='key', right_index=True))
# df1以‘key’为键,df2以index为键
# left_index:为True时,第一个df以index为键,默认False
# right_index:为True时,第二个df以index为键,默认False

# 所以left_on, right_on, left_index, right_index可以相互组合:
# left_on + right_on, left_on + right_index, left_index + right_on, left_index + right_index

 lkey  data1 rkey  date2
0    b      0    b      1
1    b      1    b      1
2    b      6    b      1
3    a      2    a      0
4    a      4    a      0
5    a      5    a      0
------
  key  data1  date2
0   a      0    100
1   b      1    101
2   c      2    102
3   d      3    103
5   e      5    104

连接:concat

s1 = pd.Series([1,2,3])
s2 = pd.Series([2,3,4])
print(pd.concat([s1,s2]))
print('-----')
# 默认axis=0,行+行 上下相连

s3 = pd.Series([1,2,3],index = ['a','c','h'])
s4 = pd.Series([2,3,4],index = ['b','e','d'])
print(pd.concat([s3,s4]).sort_index())
print(pd.concat([s3,s4], axis=1))
print('-----')
# axis=1,列+列,成为一个Dataframe 左右相连

0    1
1    2
2    3
0    2
1    3
2    4
dtype: int64
-----
a    1
b    2
c    2
d    4
e    3
h    3
dtype: int64
     0    1
a  1.0  NaN
b  NaN  2.0
c  2.0  NaN
d  NaN  4.0
e  NaN  3.0
h  3.0  NaN
-----

去重 .duplicated

s = pd.Series([1,1,1,1,2,2,2,3,4,5,5,5,5])
print(s.duplicated())
print(s[s.duplicated() == False])
print('-----')
# 判断是否重复
# 通过布尔判断,得到不重复的值

s_re = s.drop_duplicates()
print(s_re)
print('-----')
# drop.duplicates移除重复
# inplace参数:是否替换原值,默认False

df = pd.DataFrame({'key1':['a','a',3,4,5],
                  'key2':['a','a','b','b','c']})
print(df.duplicated())
print(df['key2'].duplicated())
# Dataframe中使用duplicated

0     False
1      True
2      True
3      True
4     False
5      True
6      True
7     False
8     False
9     False
10     True
11     True
12     True
dtype: bool
0    1
4    2
7    3
8    4
9    5
dtype: int64
-----
0    1
4    2
7    3
8    4
9    5
dtype: int64
-----
0    False
1     True
2    False
3    False
4    False
dtype: bool
0    False
1     True
2    False
3     True
4    False
Name: key2, dtype: bool

替换 .replace

s = pd.Series(list('ascaazsd'))
print(s.replace('a', np.nan))
print(s.replace(['a','s'] ,np.nan))
print(s.replace({'a':'hello world!','s':123}))
# 可一次性替换一个值或多个值
# 可传入列表或字典

0    NaN
1      s
2      c
3    NaN
4    NaN
5      z
6      s
7      d
dtype: object
0    NaN
1    NaN
2      c
3    NaN
4    NaN
5      z
6    NaN
7      d
dtype: object
0    hello world!
1             123
2               c
3    hello world!
4    hello world!
5               z
6             123
7               d
dtype: object

数据分组
分组统计 - groupby功能

  • 根据某些条件将数据拆分成组
  • 对每个组独立应用函数
  • 将结果合并到一个数据结构中

Dataframe在行(axis=0)或列(axis=1)上进行分组,将一个函数应用到各个分组并产生一个新值,然后函数执行结果被合并到最终的结果对象中。

df.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)

分组

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
print(df)
print('------')

print(df.groupby('A'), type(df.groupby('A')))
print('------')
# 直接分组得到一个groupby对象,是一个中间数据,没有进行计算

a = df.groupby('A').mean()
b = df.groupby(['A','B']).mean()
c = df.groupby(['A'])['D'].mean()  # 以A分组,算D的平均值
print(a,type(a),'\n',a.columns)
print(b,type(b),'\n',b.columns)
print(c,type(c))
# 通过分组后的计算,得到一个新的dataframe
# 默认axis = 0,以行来分组
# 可单个或多个([])列分组

     A      B         C         D
0  foo    one  1.135465  0.129953
1  bar    one  0.369776 -0.764419
2  foo    two  0.127738  0.094981
3  bar  three -1.089670  0.840816
4  foo    two  0.093698 -0.559945
5  bar    two -0.616583 -0.363614
6  foo    one  0.670724  0.301891
7  foo  three  1.034713 -0.132406
------
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000219EFD56470> <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
------
            C         D
A                      
bar -0.445493 -0.095739
foo  0.612468 -0.033105 <class 'pandas.core.frame.DataFrame'> 
 Index(['C', 'D'], dtype='object')
                  C         D
A   B                        
bar one    0.369776 -0.764419
    three -1.089670  0.840816
    two   -0.616583 -0.363614
foo one    0.903094  0.215922
    three  1.034713 -0.132406
    two    0.110718 -0.232482 <class 'pandas.core.frame.DataFrame'> 
 Index(['C', 'D'], dtype='object')
A
bar   -0.095739
foo   -0.033105
Name: D, dtype: float64 <class 'pandas.core.series.Series'>

分组 - 可迭代对象

df = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
print(df)
print(df.groupby('X'), type(df.groupby('X')))
print('-----')

print(list(df.groupby('X')), '→ 可迭代对象,直接生成list\n')
print(list(df.groupby('X'))[0], '→ 以元祖形式显示\n')
for n,g in df.groupby('X'):
    print(n)
    print(g)
    print('###')
print('-----')
# n是组名,g是分组后的Dataframe

print(df.groupby(['X']).get_group('A'),'\n')
print(df.groupby(['X']).get_group('B'),'\n')
print('-----')
# .get_group()提取分组后的组

grouped = df.groupby(['X'])
print(grouped.groups)
print(grouped.groups['A'])  # 也可写:df.groupby('X').groups['A']
print('-----')
# .groups:将分组后的groups转为dict
# 可以字典索引方法来查看groups里的元素

sz = grouped.size()
print(sz,type(sz))
print('-----')
# .size():查看分组后的长度

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
grouped = df.groupby(['A','B']).groups
print(df)
print(grouped)
print(grouped[('foo', 'three')])
# 按照两个列进行分组

  X  Y
0  A  1
1  B  4
2  A  3
3  B  2
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000219EFD56B38> <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
-----
[('A',    X  Y
0  A  1
2  A  3), ('B',    X  Y
1  B  4
3  B  2)] → 可迭代对象,直接生成list

('A',    X  Y
0  A  1
2  A  3) → 以元祖形式显示

A
   X  Y
0  A  1
2  A  3
###
B
   X  Y
1  B  4
3  B  2
###
-----
   X  Y
0  A  1
2  A  3 

   X  Y
1  B  4
3  B  2 

-----
{'A': Int64Index([0, 2], dtype='int64'), 'B': Int64Index([1, 3], dtype='int64')}
Int64Index([0, 2], dtype='int64')
-----
X
A    2
B    2
dtype: int64 <class 'pandas.core.series.Series'>
-----
     A      B         C         D
0  foo    one -0.304245  0.660739
1  bar    one  1.264029 -1.135984
2  foo    two  0.305469  0.865855
3  bar  three  0.003346 -0.804999
4  foo    two  0.584652  0.289705
5  bar    two  0.903023 -1.586324
6  foo    one  0.260300  0.994691
7  foo  three  0.293223  0.864264
{('bar', 'one'): Int64Index([1], dtype='int64'), ('bar', 'three'): Int64Index([3], dtype='int64'), ('bar', 'two'): Int64Index([5], dtype='int64'), ('foo', 'one'): Int64Index([0, 6], dtype='int64'), ('foo', 'three'): Int64Index([7], dtype='int64'), ('foo', 'two'): Int64Index([2, 4], dtype='int64')}
Int64Index([7], dtype='int64')

分组计算函数方法

s = pd.Series([1, 2, 3, 10, 20, 30], index = [1, 2, 3, 1, 2, 3])
grouped = s.groupby(level=0)  # 唯一索引用.groupby(level=0),将同一个index的分为一组
print(grouped)
print(grouped.first(),'→ first:非NaN的第一个值\n')
print(grouped.last(),'→ last:非NaN的最后一个值\n')
print(grouped.sum(),'→ sum:非NaN的和\n')
print(grouped.mean(),'→ mean:非NaN的平均值\n')
print(grouped.median(),'→ median:非NaN的算术中位数\n')
print(grouped.count(),'→ count:非NaN的值\n')
print(grouped.min(),'→ min、max:非NaN的最小值、最大值\n')
print(grouped.std(),'→ std,var:非NaN的标准差和方差\n')
print(grouped.prod(),'→ prod:非NaN的积\n')

<pandas.core.groupby.groupby.SeriesGroupBy object at 0x000002B38F39B4E0>
1    1
2    2
3    3
dtype: int64 → first:非NaN的第一个值

1    10
2    20
3    30
dtype: int64 → last:非NaN的最后一个值

1    11
2    22
3    33
dtype: int64 → sum:非NaN的和

1     5.5
2    11.0
3    16.5
dtype: float64 → mean:非NaN的平均值

1     5.5
2    11.0
3    16.5
dtype: float64 → median:非NaN的算术中位数

1    2
2    2
3    2
dtype: int64 → count:非NaN的值

1    1
2    2
3    3
dtype: int64 → min、max:非NaN的最小值、最大值

1     6.363961
2    12.727922
3    19.091883
dtype: float64 → std,var:非NaN的标准差和方差

1    10
2    40
3    90
dtype: int64 → prod:非NaN的积

多函数计算:agg()

df = pd.DataFrame({'a':[1,1,2,2],
                  'b':np.random.rand(4),
                  'c':np.random.rand(4),
                  'd':np.random.rand(4),})
print(df)
print(df.groupby('a').agg(['mean',np.sum]))
print(df.groupby('a')['b'].agg({'result1':np.mean,
                               'result2':np.sum}))
# 函数写法可以用str,或者np.方法
# 可以通过list,dict传入,当用dict时,key名为columns → 更新pandas后会出现警告
# 尽量用list传入

  a         b         c         d
0  1  0.758848  0.375900  0.962917
1  1  0.430484  0.322437  0.402809
2  2  0.285699  0.230663  0.525483
3  2  0.676740  0.191693  0.874899
          b                   c                   d          
       mean       sum      mean       sum      mean       sum
a                                                            
1  0.594666  1.189331  0.349169  0.698337  0.682863  1.365727
2  0.481219  0.962438  0.211178  0.422356  0.700191  1.400382
    result1   result2
a                    
1  0.594666  1.189331
2  0.481219  0.962438

数据读取
读取普通分隔数据:read_table

# 可以读取txt,csv

import os
os.chdir('C:/Users/数据分析库pandas')

data1 = pd.read_table('data1.txt', delimiter=',',header = 0, index_col=2)
print(data1)
# delimiter:用于拆分的字符,也可以用sep:sep = ','
# header:用做列名的序号,默认为0(第一行)
# index_col:指定某列为行索引,否则自动索引0, 1, .....

# read_table主要用于读取简单的数据,txt/csv

 va1  va2  va4
va3               
3      1    2    4
4      2    3    5
5      3    4    6
6      4    5    7

读取csv数据:read_csv

# 先熟悉一下excel怎么导出csv

data2 = pd.read_csv('data2.csv',encoding = 'utf-8')
print(data2.head())
# encoding:指定字符集类型,即编码,通常指定为'utf-8'

# 大多数情况先将excel导出csv,再读取

 省级政区代码 省级政区名称  地市级政区代码 地市级政区名称    年份 党委书记姓名  出生年份  出生月份  籍贯省份代码 籍贯省份名称  \
0  130000    河北省   130100    石家庄市  2000    陈来立   NaN   NaN     NaN    NaN   
1  130000    河北省   130100    石家庄市  2001    吴振华   NaN   NaN     NaN    NaN   
2  130000    河北省   130100    石家庄市  2002    吴振华   NaN   NaN     NaN    NaN   
3  130000    河北省   130100    石家庄市  2003    吴振华   NaN   NaN     NaN    NaN   
4  130000    河北省   130100    石家庄市  2004    吴振华   NaN   NaN     NaN    NaN   

   ...   民族  教育 是否是党校教育(是=1,否=0) 专业:人文 专业:社科  专业:理工  专业:农科  专业:医科  入党年份  工作年份  
0  ...  NaN  硕士              1.0   NaN   NaN    NaN    NaN    NaN   NaN   NaN  
1  ...  NaN  本科              0.0   0.0   0.0    1.0    0.0    0.0   NaN   NaN  
2  ...  NaN  本科              0.0   0.0   0.0    1.0    0.0    0.0   NaN   NaN  
3  ...  NaN  本科              0.0   0.0   0.0    1.0    0.0    0.0   NaN   NaN  
4  ...  NaN  本科              0.0   0.0   0.0    1.0    0.0    0.0   NaN   NaN  

[5 rows x 23 columns]

读取excel数据:read_excel

data3 = pd.read_excel('地市级党委书记数据库(2000-10).xlsx',sheet_name='中国人民共和国地市级党委书记数据库(2000-10)',header=0)
print(data3.head())
# io :文件路径。
# sheetname:返回多表使用sheetname=[0,1],若sheetname=None是返回全表 → ① int/string 返回的是dataframe ②而none和list返回的是dict
# header:指定列名行,默认0,即取第一行
# index_col:指定列为索引列,也可以使用u”strings”

 省级政区代码 省级政区名称  地市级政区代码 地市级政区名称    年份 党委书记姓名  出生年份  出生月份  籍贯省份代码 籍贯省份名称  \
0  130000    河北省   130100    石家庄市  2000    陈来立   NaN   NaN     NaN    NaN   
1  130000    河北省   130100    石家庄市  2001    吴振华   NaN   NaN     NaN    NaN   
2  130000    河北省   130100    石家庄市  2002    吴振华   NaN   NaN     NaN    NaN   
3  130000    河北省   130100    石家庄市  2003    吴振华   NaN   NaN     NaN    NaN   
4  130000    河北省   130100    石家庄市  2004    吴振华   NaN   NaN     NaN    NaN   

   ...   民族  教育 是否是党校教育(是=1,否=0) 专业:人文 专业:社科  专业:理工  专业:农科  专业:医科  入党年份  工作年份  
0  ...  NaN  硕士              1.0   NaN   NaN    NaN    NaN    NaN   NaN   NaN  
1  ...  NaN  本科              0.0   0.0   0.0    1.0    0.0    0.0   NaN   NaN  
2  ...  NaN  本科              0.0   0.0   0.0    1.0    0.0    0.0   NaN   NaN  
3  ...  NaN  本科              0.0   0.0   0.0    1.0    0.0    0.0   NaN   NaN  
4  ...  NaN  本科              0.0   0.0   0.0    1.0    0.0    0.0   NaN   NaN  

[5 rows x 23 columns]
发布了43 篇原创文章 · 获赞 80 · 访问量 2万+

猜你喜欢

转载自blog.csdn.net/qq_43868654/article/details/98358652