pandas基本操作(一)

一、pandas数据结构介绍

1. Series

Series是一种一维的数组性对象,它包含了一个值序列,并且包含了数据标签,称为索引。最简单的序列可以仅仅由一个数组形成:

import pandas as pd
import numpy as np
obj = pd.Series([4, 7, -5, 3])
obj
0    4
1    7
2   -5
3    3
dtype: int64

在交互式环境中Series的索引在左边,值在右边。由于我们不为数据制定索引,默认索引都是从0到N-1.你可以通过values属性和index属性分别获得Series对象的值和索引:

obj.values
array([ 4,  7, -5,  3])
obj.index
RangeIndex(start=0, stop=4, step=1)

与numpy数组相比,你可以使用标签来进行索引:

obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2
d    4
b    7
a   -5
c    3
dtype: int64
obj2['a']
-5
obj2[1]
7
obj2[['c', 'a', 'd']]
c    3
a   -5
d    4
dtype: int64

与numpy类似,Series可以使用numpy的函数或者操作,比如布尔值数组进行过滤等:

obj2[obj2 > 0]
d    4
b    7
c    3
dtype: int64
obj2 * 2
d     8
b    14
a   -10
c     6
dtype: int64
np.exp(obj2)
d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

其实,Series可以看成是一个长度固定且有序的字典:

'b' in obj2
True
sdata = {'Ohio' : 3500, 'Texas' : 71000, 'Oregon' : 16000, 'Utah' : 5000}
obj3 = pd.Series(sdata)
obj3
Ohio       3500
Oregon    16000
Texas     71000
Utah       5000
dtype: int64
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4
California        NaN
Ohio           3500.0
Oregon        16000.0
Texas         71000.0
dtype: float64

pandas利用isnull和notnull来检查缺失数据:

pd.isnull(obj4)
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
pd.notnull(obj4)
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool
obj4.isnull()
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

在很多应用中,数学操作自动对齐索引是Series的一个非常有用的特性:

obj3
Ohio       3500
Oregon    16000
Texas     71000
Utah       5000
dtype: int64
obj4
California        NaN
Ohio           3500.0
Oregon        16000.0
Texas         71000.0
dtype: float64
obj3 + obj4
California         NaN
Ohio            7000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Series对象自身和其索引都有name属性,这个特性与pandas其他重要功能集成在一起:

obj4.index.name = 'states'
obj4
states
California        NaN
Ohio           3500.0
Oregon        16000.0
Texas         71000.0
dtype: float64

2. Dataframe

dataframe既有行索引也有列索引,它可以被视为一个共享相同索引的Series。尽管dataframe是二维的,但是你可以利用分层索引在dataframe中展现更高维度的数据。

有多种方式可以创建Dataframe,其中最常用的方式是利用包含等长度列表或numpy数组的字典来形成dataframe:

data = {'state' : ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 'year' : [2000, 2001, 2002, 2001, 2002, 2003], 'pop' : [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
5 3.2 Nevada 2003

如果你指定了列的顺序,Dataframe的列将会按照指定的顺序排列:

pd.DataFrame(data, columns=['year', 'state', 'pop'])
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
5 2003 Nevada 3.2
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                      index=['one', 'two', 'three', 'four',
                             'five', 'six'])
print(frame2)
frame2.columns
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN





Index(['year', 'state', 'pop', 'debt'], dtype='object')

dataframe中的一列,可以按字典型标记或属性那样检索为Series:

frame2['state']
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object
frame2.state
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

frame[colunm]对于任意列名均有效,但是frame2.column只会在列名是有效的python变量名时有效
请注意,返回的Series与原dataframe有相同的索引,且Series的name属性也会被合理地设置。

行也可以通过位置或特殊属性loc进行选取:

frame2.loc['three']
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object
frame2.loc['three', 'state']
'Ohio'

列的引用是可以修改的:

frame2['debt'] = 16.5
frame2
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
six 2003 Nevada 3.2 16.5
frame2['debt'] = np.arange(6)
frame2
year state pop debt
one 2000 Ohio 1.5 0
two 2001 Ohio 1.7 1
three 2002 Ohio 3.6 2
four 2001 Nevada 2.4 3
five 2002 Nevada 2.9 4
six 2003 Nevada 3.2 5
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
six 2003 Nevada 3.2 NaN

可以用del关键字删除dataframe的一列,首先我们增加一列:

frame2['eastern'] = frame2.state == 'Hhio'
frame2
year state pop debt reastern eastern
one 2000 Ohio 1.5 NaN False False
two 2001 Ohio 1.7 -1.2 False False
three 2002 Ohio 3.6 NaN False False
four 2001 Nevada 2.4 -1.5 False False
five 2002 Nevada 2.9 -1.7 False False
six 2003 Nevada 3.2 NaN False False
del frame2['eastern']
frame2
year state pop debt reastern
one 2000 Ohio 1.5 NaN False
two 2001 Ohio 1.7 -1.2 False
three 2002 Ohio 3.6 NaN False
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
six 2003 Nevada 3.2 NaN False

从dataframe中选取的列是数据的视图,而不是拷贝。因此,对Series的修改会映射到dataframe中。如果需要复制,则应当显式地使用Series的copy方法。

你可以用类似numpy的语法对dataframe进行转置操作:

pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

frame3 = pd.DataFrame(pop)
frame3
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
frame3.T
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6

包含Series的字典也可以用于构造dataframe:

pdata = {'Ohio': frame3['Ohio'][:-1],
         'Nevada': frame3['Nevada'][:2]}
pd.DataFrame(pdata)
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7

设置name属性

frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6

dataframe的values属性:

frame3.values
array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])
frame2.values
array([[2000, 'Ohio', 1.5, nan, False],
       [2001, 'Ohio', 1.7, -1.2, False],
       [2002, 'Ohio', 3.6, nan, False],
       [2001, 'Nevada', 2.4, -1.5, False],
       [2002, 'Nevada', 2.9, -1.7, False],
       [2003, 'Nevada', 3.2, nan, False]], dtype=object)

3. 索引对象

在构造Series或dataframe时,你所使用的任意数组或标签序列都可以在内部转换为索引对象:

obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index
Index(['a', 'b', 'c'], dtype='object')
index[1:]
Index(['b', 'c'], dtype='object')

索引对象是不可变的,因此用户是无法修改索引对象的:

index[1] = 'd'
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-70-676fdeb26a68> in <module>()
----> 1 index[1] = 'd'


~/.conda/envs/python36/lib/python3.6/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   1668 
   1669     def __setitem__(self, key, value):
-> 1670         raise TypeError("Index does not support mutable operations")
   1671 
   1672     def __getitem__(self, key):


TypeError: Index does not support mutable operations

不可变性使得在多种数据结构中分享索引对象更为安全:

labels = pd.Index(np.arange(3))
labels
Int64Index([0, 1, 2], dtype='int64')
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2
0    1.5
1   -2.5
2    0.0
dtype: float64
obj2.index is labels
True

除了类似数组,索引对象也像一个固定大小的集合:

frame3
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
frame3.columns
Index(['Nevada', 'Ohio'], dtype='object', name='state')
'Ohio' in frame3.columns
True
2003 in frame3.index
False

一些索引对象的方法和属性:

方法 描述
append 将额外的索引对象粘贴到原索引后,产生一个新的索引
difference 计算两个索引的差集
intersrction 计算两个索引的交集
union 计算两个索引的并集
isin 计算表示每一个值是否在传值容器中的布尔数组
delete 将位置i的元素删除,并产生新的索引
drop 根据传参删除指定索引值,并产生新的索引
insert 在位置i插入元素,并产生新的索引
is_monotonic 如果索引序列递增则返回True
is_unique 如果索引序列唯一则返回True
unique 计算索引的唯一值序列
发布了100 篇原创文章 · 获赞 10 · 访问量 3395

猜你喜欢

转载自blog.csdn.net/qq_44315987/article/details/104079819