Python - pandas库的使用

pandas简介

Numpy在向量化的数值计算中优势明显，但是在处理较为复杂的数据，例如标签化的数据中表现力不从心，而基于Numpy库进行开发的Pandas提供了使得数据分析变得更简单的高级数据结构和操作工具。
由于pandas基于Numpy开发，所以pandas的向量化和矩阵运算与numpy基本相同。
但是对于纯粹的计算，Numpy要比pandas更加快速。

调用方法

import pandas as pd

关于pd.Series

Series是带有标签数据的一维数组。

创建series对象

pd.Series(data,index=,dtype)
# data可以是列表，字典或np数组， index是索引，为可选参数，dtype同

通过列表创建

>>> a = pd.Series([1,2,3,4])
>>> a
0    1
1    2
2    3
3    4
dtype: int64
# 数据为标量的时候
>>> a = pd.Series(1,index=['a','b'])
>>> a
a    1
b    1
dtype: int64

通过np数组创建


>>> a = pd.Series(np.arange(4),index=['a','b','c','d'])
>>> a
a    1
b    2
c    3
d    4
dtype: int32
>>> a['a']
1

通过字典创建

>>> dict = {
    
    'a':1,'b':2,'c':3,'d':4}
>>> pd.Series(dict,index=['a','b','c','d','e'])	# 如果有指定标签，那么会根据index以及字典的key进行比对，比对成功则输出，没有找到输出NaN。
a    1.0
b    2.0
c    3.0
d    4.0
e    NaN
dtype: float64

关于pd.DataFrame

DataFrame是带有标签的多维数组

创建DaraFrame对象

pd.DataFrame(data, index=, columns=)	# columns为列标签，可选参数!

通过Series对象创建

>>> a
a    1
b    2
c    3
d    4
dtype: int64
>>> pd.DataFrame(a, columns = ["num"])
   num
a    1
b    2
c    3
d    4

通过字典对象进行创建

>>> dict
{
    
    'a': [1, 2, 3], 'b': [4, 5, 6], 'c': 3, 'd': 4}
>>> pd.DataFrame(dict)
   a  b  c  d
0  1  4  3  4
1  2  5  3  4
2  3  6  3  4

通过字典Series对象创建

>>> b
a     97
b     98
c     99
d    100
dtype: int64

>>> pd.DataFrame({
    
    "num":a, "ascii":b, "isdigit": "not"})
   num  ascii isdigit
a    1     97     not
b    2     98     not
c    3     99     not
d    4    100     not

通过字典list对象创建

>>> a = [{
    
    'a':i, 'b':2*i} for i in range(3)]
>>> a.append({
    
    'a':3,'c':1})
>>> a
[{
    
    'a': 0, 'b': 0}, {
    
    'a': 1, 'b': 2}, {
    
    'a': 2, 'b': 4}, {
    
    'a': 3, 'c': 1}]
>>> pd.DataFrame(a)
   a    b    c
0  0  0.0  NaN
1  1  2.0  NaN
2  2  4.0  NaN
3  3  NaN  1.0

通过Numpy二维数组创建

>>> pd.DataFrame(np.random.randint(10,size=(3,2)),columns=['a','b'],index=[1,2,3])
   a  b
1  7  0
2  9  6
3  2  3

读取外部文件(.csv, .xls)作为DataFrame

df = pd.DataFrame(pd.read_csv('name.csv',header=))
# header代表表头，默认为第0行，header = None表示没有表头
df = pd.DataFrame(pd.read_excel('name.xlsx'))

获取DataFrame的性质

属性(a表示DataFrame对象）	作用
a.values	返回numpy数组，只返回值，可以切片后输出
a.index / a.columns	返回行索引/列索引
a.shape	返回形状，不包含表头
a.size	返回值的数量
a.dtypes	返回每一列的数据类型
a[‘a’].unique()	查看某一列的唯一值

获取DataFrame的内容

行切片

# 列表式
>>> a[1:2]
   a  b  c
2  5  3  9
# 绝对索引 a.loc，根据标签进行切片
>>> a.loc[1]	# 第一行
a    7
b    2
c    6
Name: 1, dtype: int32
>>> a.loc[1:2]	# 1和2不是数组下标，而是行标签index
   a  b  c
1  7  2  6
2  5  3  9
# 相对索引，根据顺序进行切片
>>> a.iloc[[1,2]]
   a  b  c
2  5  3  9
3  4  6  0
>>> a.iloc[1:2]
   a  b  c
2  5  3  9

列切片

>>> a
   a  b  c
1  7  2  6
2  5  3  9
3  4  6  0
# 字典方式
>>> a['b']
1    2
2    3
3    6
Name: b, dtype: int32
>>> a[['b','a']]
   b  a
1  2  7
2  3  5
3  6  4

# 对象属性式
>>> a.a
1    7
2    5
3    4
Name: a, dtype: int32

# 或者使用loc或iloc，如
>>> a.iloc[:, 1:]
   b  c
1  2  6
2  3  9
3  6  0

行列切片

>>> a.loc[1:2, ['a','c']] 		# 行连续，列分散
   a  c
1  7  6
2  5  9

布尔
一般的布尔比较运算同Numpy.知识点2中说明的一样
这里讲一下isin函数

>>> a.isin([2,4])
       a      b      c
1  False   True  False
2  False  False  False
3   True  False  False
>>> a.c.isin([3])
1    False
2    False
3    False
Name: c, dtype: bool
# 作为掩码
>>> a[a.isin([2,4])]
     a    b   c
1  NaN  2.0 NaN
2  NaN  NaN NaN
3  4.0  NaN NaN

修改DataFrame的内容

# 插入新的一列
# 字典型
>>> a['d'] = [3,6,6]
>>> a
   a  b  c  d
1  7  2  6  3
2  5  3  9  6
3  4  6  0  6
# 增加新的series
>>> b = pd.Series([3,6,8],index=[1,2,3])
>>>> a['e'] = b
>>> a
   a  b  c  d  e
1  7  2  6  3  3
2  5  3  9  6  6
3  4  6  0  6  8

操作函数

函数(a或b表示对象）	作用
a.describe()	获得a的大部分统计数据
a.head(n) / a.tail(n)	查看前/后n行，默认查看前/后5行
a.sample(n)	随机显示n条数据, 默认显示一条
a.info()	查看对象的信息
a.count()	统计非空个数
a（列或行）.value_counts()	统计某一列或某一行的数据
a.sum()	对列求和(axis=1对行求和)
a.mean() / a.var() / a.std()	求列的均值/方差/标准差
a.max() / a.min()	求每一列的最值(axis=1求每一行)
a.idxmax() / a.idxmin()	求每一列的最值的索引
a.median() / a.mode() / a.quantile(0, 75)	求每一列的中位数 / 众数 / 75%的分位数
a.corr() / a.corrwith(b)	a(和b)的相关性系数
a.sort_values(by = “columns”，ascending = True)	对’columns’进行列升序排序
a.sort_index()	对行头进行排序
a.sort_index(axis = 1)	对列头进行排序
a+b / a.add(b, fillvalue=)	pandas会自动对其索引，没有的会用NaN或fillvalue代替
pd.concat([a,b],axis = )¹	垂直合并a，b。 axis = 1水平合并
pd.merge(a,b)²	合并A，B, 并且相同列会同步（可能会更改某列的行排列）

apply(method)函数

使用mehod方法默认对每一列进行相应的操作。

>>> a
    a  b  d  c
1   1  2  2  3
2  23  5  3  5
3   4  7  4  6
>>> a.apply(np.cumsum)
    a   b  d   c
1   1   2  2   3
2  24   7  5   8
3  28  14  9  14
>>> a.apply(sum)
a    28
b    14
d     9
c    14
dtype: int64
>>> def addcol(a):
...     a['e'] = [2,4,5]
>>> a.apply(addcol)
a    None
b    None
d    None
c    None
e    None
dtype: object
>>> a
    a  b  d  c  e
1   1  2  2  3  2
2  23  5  3  5  4
3   4  7  4  6  5

分组和数据透视表

分组

>>> a
   a  b  c
0  2  1  2
1  4  3  5
2  2  5  9
>>> a.groupby('a')	# 分组并做延迟计算
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x0000019BD0AEB630>
>>> a.groupby('a').sum()
   b   c
a
2  6  11
4  3   5
>>> a.groupby('a').mean()
     b    c
a
2  3.0  5.5
4  3.0  5.0

数据透视表

>>> a.pivot_table("列", index = "", columns = "")

缺失值处理


>>> a = pd.DataFrame(np.array([[1,np.nan,2],[np.nan,3,4],[np.nan,5,None]]), columns = list("abc"))
>>> a
     a    b     c
0    1  NaN     2
1  NaN    3     4
2  NaN    5  None
>>> a.dtypes
a    object
b    object
c    object
dtype: object

说明³

函数	作用
a.isnull() / a.notnull()	判断缺失值
a.dropna（）	删除包含缺失值的行⁴
a.dropna（axis =“columns”）	删除包含缺失值的列⁴
a.fillna(value = )	对缺失值进行填充

concat()的可选参数ignore_index = True的时候会重新设置行标签. ↩︎
merge()的可选参数how="outer"的时候，会自动添加缺失值NaN，否则如果缺失，该行和列会被丢弃。 ↩︎
np.nan是特殊的浮点数，如果存在None或字符串，就会导致数据类型全部变为object，比int和float更加耗费资源。 ↩︎
dropna()的可选参数how = any / all， any即存在缺失值就删除整行或整列， all则要整行/列都缺失才删除。 ↩︎ ↩︎

Python - pandas库的使用

Python - pandas库的使用

pandas简介

调用方法

关于pd.Series

创建series对象

关于pd.DataFrame

创建DaraFrame对象

获取DataFrame的性质

获取DataFrame的内容

修改DataFrame的内容

操作函数

apply(method)函数

分组和数据透视表

缺失值处理

猜你喜欢