[雪峰磁针石博客] python模块介绍-pandas入门－大数据处理利器

本文代码地址：https://github.com/xurongzhong/mobile_data/

交流QQ群： Python数据分析pandas Excel 630011153python 测试开发 144081101

pandas入门

简介

pandas包含的数据结构和操作工具能快速简单地清洗和分析数据。

pandas经常与NumPy和SciPy这样的数据计算工具，statsmodels和scikit-learn之类的分析库及数据可视化库（如matplotlib）等一起用使用。pandas基于NumPy的数组，经常可以不使用循环就能处理好大量数据。

pandas适合处理表格数据或巨量数据。NumPy则适合处理巨量的数值数组数据。

这里约定导入方式：

import pandas as pd

pandas数据结构介绍

主要数据结构:Series和DataFrame。

Series

Series类似于一维数组的对象,它由一组数据(NumPy类似数据类型)以及相关的数据标签(即索引)组成。仅由一组数据即可产生最简单的Series:

In [2]: import pandas as pd

In [3]: obj = pd.Series([4, 7, -5, 3])

In [4]: obj
Out[4]: 
0    4
1    7
2   -5
3    3
dtype: int64

In [5]: obj.values
Out[5]: array([ 4,  7, -5,  3])

In [6]: obj.index
Out[6]: Int64Index([0, 1, 2, 3], dtype='int64')

制定索引:

In [2]: obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In [3]: obj2
Out[3]: 
d    4
b    7
a   -5
c    3
dtype: int64

In [4]: obj2.index
Out[4]: Index(['d', 'b', 'a', 'c'], dtype='object')

In [10]: obj2['a']
Out[10]: -5

In [11]: obj2['d'] = 6

In [12]: obj2[['c', 'a', 'd']]
Out[12]: 
c    3
a   -5
d    6
dtype: int64

可见与普通NumPy数组相比,你还可以通过索引的方式选取Series中的值。

NumPy函数或类似操作，如根据布尔型数组进行过滤、标量乘法、应用数学函数等)都会保留索引和值之间的链接:

In [13]: obj2[obj2 > 0]
Out[13]: 
d    6
b    7
c    3
dtype: int64

In [14]: obj2 * 2
Out[14]: 
d    12
b    14
a   -10
c     6
dtype: int64

In [15]: obj2
Out[15]: 
d    6
b    7
a   -5
c    3
dtype: int64

In [17]: import numpy as np

In [18]: np.exp(obj2)
Out[18]: 
d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [19]: 'b' in obj2
Out[19]: True

In [20]: 'e' in obj2
Out[20]: False

可见可以吧Series看成是定长的有序字典。也可由字典创建Series:

In [21]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [22]: obj3 = pd.Series(sdata)

In [23]: obj3
Out[23]: 
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [24]: states = ['California', 'Ohio', 'Oregon', 'Texas']

In [25]: obj4 = pd.Series(sdata, index=states)

In [26]: obj4
Out[26]: 
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [27]: pd.isnull(obj4)
Out[27]: 
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [28]: pd.notnull(obj4)
Out[28]: 
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [29]: obj4.isnull()
Out[29]: 
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [32]: obj4.notnull()
Out[32]: 
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

相加

In [33]: obj3
Out[33]: 
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [34]: obj4
Out[34]: 
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [35]: obj3 + obj4
Out[35]: 
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [36]: obj4.name = 'population'

In [37]: obj4.index.name = 'state'

In [38]: obj4
Out[38]: 
state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [40]: obj = pd.Series([4, 7, -5, 3])

In [41]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

In [42]: obj
Out[42]: 
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

本文代码地址：https://github.com/xurongzhong/mobile_data/

本文最新版本地址：http://t.cn/R8tJ9JH

交流QQ群： Python数据分析pandas Excel 630011153python 测试开发 144081101

wechat： pythontesting