pandas Series data types and DataFrame

A, Series

Pandas are the three core data structures: Series, DataFrame and Index. The vast majority of operations are carried out around these three structures.

Series is a one-dimensional array of objects, which contains a sequence of values ​​and a corresponding sequence index. Numpy obtain one-dimensional array element values ​​by an integer index implicitly defined, and the index is associated with one Series with elements explicitly defined. Explicit index so that the object has a stronger ability Series, the index is no longer just an integer, may also be other types, such as string, the index does not require continuous, may also be repeated, very high degree of freedom.

The most basic way is used to generate the constructor Series:

import pandas as pd
s = pd.Series([7,-3,4,-2])
s
Out[5]:
0    7
1   -3
2    4
3   -2
dtype: int64

When printing, the automatic alignment, looks beautiful. Left index, right is the corresponding actual value. Default index from 0 to N-1 (N is the length of the data). Series and the index value may be acquired by the object and the index attribute values, respectively:

In [5]: s.dtype
Out[5]: dtype('int64')
In [6]: s.values
Out[6]: array([ 7, -3,  4, -2], dtype=int64)
In [7]: s.index
Out[7]: RangeIndex(start=0, stop=4, step=1)

You can specify an index when creating an object Series:

In [8]: s2 = pd.Series([7,-3,4,-2], index=['d','b','a','c'])
In [9]: s2
Out[9]:
d    7
b   -3
a    4
c   -2
dtype: int64
In [10]: s2.index
Out[10]: Index(['d', 'b', 'a', 'c'], dtype='object')
In [4]: pd.Series(5, index=list('abcde'))
Out[4]:
a    5
b    5
c    5
d    5
e    5
dtype: int64
In [5]: pd.Series({2:'a',1:'b',3:'c'}, index=[3,2]) # 通过index筛选结果
Out[5]:
3    c
2    a
dtype: object

The latter can also directly modify the index:

In [33]: s
Out[33]:
0    7
1   -3
2    4
3   -2
dtype: int64
In [34]: s.index = ['a','b','c','d']
In [35]: s
Out[35]:
a    7
b   -3
c    4
d   -2
dtype: int64

类似Python的列表和Numpy的数组,Series也可以通过索引获取对应的值:

In [11]: s2['a']
Out[11]: 4
In [12]: s2[['c','a','d']]
Out[12]:
c   -2
a    4
d    7
dtype: int64

也可以对Seires执行一些类似Numpy的通用函数操作:

In [13]: s2[s2>0]
Out[13]:
d    7
a    4
dtype: int64
In [14]: s2*2
Out[14]:
d    14
b    -6
a     8
c    -4
dtype: int64
In [15]: import numpy as np
In [16]: np.exp(s2)
Out[16]:
d    1096.633158
b       0.049787
a      54.598150
c       0.135335
dtype: float64

因为索引可以是字符串,所以从某个角度看,Series又比较类似Python的有序字典,所以可以使用in操作:

In [17]: 'b' in s2
Out[17]: True
In [18]: 'e'in s2
Out[18]: False

自然,我们也会想到使用Python的字典来创建Series:

In [19]: dic = {'beijing':35000,'shanghai':71000,'guangzhou':16000,'shenzhen':5000}
In [20]: s3=pd.Series(dic)
In [21]: s3
Out[21]:
beijing      35000
shanghai     71000
guangzhou    16000
shenzhen     5000
dtype: int64
In [14]: s3.keys() # 自然,具有类似字典的方法
Out[14]: Index(['beijing', 'shanghai', 'guangzhou', 'shenzhen'], dtype='object')
In [15]: s3.items()
Out[15]: <zip at 0x1a5c2d88c88>
In [16]: list(s3.items())
Out[16]:
[('beijing', 35000),
 ('shanghai', 71000),
 ('guangzhou', 16000),
 ('shenzhen', 5000)]
In [18]: s3['changsha'] = 20300

看下面的例子:

In [22]: city = ['nanjing', 'shanghai','guangzhou','beijing']
In [23]: s4=pd.Series(dic, index=city)
In [24]: s4
Out[24]:
nanjing          NaN
shanghai     71000.0
guangzhou    16000.0
beijing      35000.0
dtype: float64

city列表中,多了‘nanjing’,但少了‘shenzhen’。Pandas会依据city中的关键字去dic中查找对应的值,因为dic中没有‘nanjing’,这个值缺失,所以以专门的标记值NaN表示。因为city中没有‘shenzhen’,所以在s4中也不会存在‘shenzhen’这个条目。可以看出,索引很关键,在这里起到了决定性的作用。

在Pandas中,可以使用isnull和notnull函数来检查缺失的数据:

In [25]: pd.isnull(s4)
Out[25]:
nanjing       True
shanghai     False
guangzhou    False
beijing      False
dtype: bool
In [26]: pd.notnull(s4)
Out[26]:
nanjing      False
shanghai      True
guangzhou     True
beijing       True
dtype: bool
In [27]: s4.isnull()
Out[27]:
nanjing       True
shanghai     False
guangzhou    False
beijing      False
dtype: bool

可以为Series对象和其索引设置name属性,这有助于标记识别:

In [29]: s4.name = 'people'
In [30]: s4.index.name= 'city'
In [31]: s4
Out[31]:
city
nanjing          NaN
shanghai     71000.0
guangzhou    16000.0
beijing      35000.0
Name: people, dtype: float64
In [32]: s4.index
Out[32]: Index(['nanjing', 'shanghai', 'guangzhou', 'beijing'], dtype='object', name='city')

二、DataFrame

DataFrame是Pandas的核心数据结构,表示的是二维的矩阵数据表,类似关系型数据库的结构,每一列可以是不同的值类型,比如数值、字符串、布尔值等等。DataFrame既有行索引,也有列索引,它可以被看做为一个共享相同索引的Series的字典。

创建DataFrame对象的方法有很多,最常用的是利用包含等长度列表或Numpy数组的字典来生成。可以查看DataFrame对象的columns和index属性。

In [37]: data = {'state':['beijing','beijing','beijing','shanghai','shanghai','shanghai'],
    ...: 'year':[2000,2001,2002,2001,2002,2003],
    ...: 'pop':[1.5, 1.7,3.6,2.4,2.9,3.2
    ...: ]}
In [38]: f = pd.DataFrame(data)
In [39]: f
Out[39]:
      state  year  pop
0   beijing  2000  1.5
1   beijing  2001  1.7
2   beijing  2002  3.6
3  shanghai  2001  2.4
4  shanghai  2002  2.9
5  shanghai  2003  3.2
In [61]: f.columns
Out[61]: Index(['state', 'year', 'pop'], dtype='object')
In [62]: f.index
Out[62]: RangeIndex(start=0, stop=6, step=1)
In [10]: f.dtypes
Out[10]:
state     object
year       int64
pop      float64
dtype: object
In [11]: f.values  # 按行查看
Out[11]:
array([['beijing', 2000, 1.5],
       ['beijing', 2001, 1.7],
       ['beijing', 2002, 3.6],
       ['shanghai', 2001, 2.4],
       ['shanghai', 2002, 2.9],
       ['shanghai', 2003, 3.2]], dtype=object)

 上面自动生成了0-5的索引。

可以使用head方法查看DataFrame对象的前5行,用tail方法查看后5行。或者head(3),tail(3)指定查看行数:

In [40]: f.head()
Out[40]:
      state  year  pop
0   beijing  2000  1.5
1   beijing  2001  1.7
2   beijing  2002  3.6
3  shanghai  2001  2.4
4  shanghai  2002  2.9
In [41]: f.tail()
Out[41]:
      state  year  pop
1   beijing  2001  1.7
2   beijing  2002  3.6
3  shanghai  2001  2.4
4  shanghai  2002  2.9
5  shanghai  2003  3.2

DataFrame对象中的state/year/pop,其实就是列索引,可以在创建的时候使用参数名columns指定它们的先后顺序:

In [44]: pd.DataFrame(data, columns=['year','state','pop'])
Out[44]:
   year     state  pop
0  2000   beijing  1.5
1  2001   beijing  1.7
2  2002   beijing  3.6
3  2001  shanghai  2.4
4  2002  shanghai  2.9
5  2003  shanghai  3.2

当然,也可以使用参数名index指定行索引:

In [45]: f2 = pd.DataFrame(data, columns=['year','state','pop'],index=['a','b','c','d','e','f'])
In [47]: f2
Out[47]:
   year     state  pop
a  2000   beijing  1.5
b  2001   beijing  1.7
c  2002   beijing  3.6
d  2001  shanghai  2.4
e  2002  shanghai  2.9
f  2003  shanghai  3.2

可以使用columns列索引来检索一列:

In [49]: f2['year']
Out[49]:
a    2000
b    2001
c    2002
d    2001
e    2002
f    2003
Name: year, dtype: int64
In [52]: f2.state  # 属性的形式来检索。这种方法bug多,比如属性名不是纯字符串,或者与其它方法同名
Out[52]:
a     beijing
b     beijing
c     beijing
d    shanghai
e    shanghai
f    shanghai
Name: state, dtype: object

但是检索一行却不能通过f2['a']这种方式,而是需要通过loc方法进行选取:

In [53]: f2.loc['a']
Out[53]:
year        2000
state    beijing
pop          1.5
Name: a, dtype: object

当然,可以给DataFrame对象追加列:

In [54]: f2['debt'] = 12
In [55]: f2
Out[55]:
   year     state  pop  debt
a  2000   beijing  1.5    12
b  2001   beijing  1.7    12
c  2002   beijing  3.6    12
d  2001  shanghai  2.4    12
e  2002  shanghai  2.9    12
f  2003  shanghai  3.2    12
In [56]: f2['debt'] = np.arange(1,7)
In [57]: f2
Out[57]:
   year     state  pop  debt
a  2000   beijing  1.5     1
b  2001   beijing  1.7     2
c  2002   beijing  3.6     3
d  2001  shanghai  2.4     4
e  2002  shanghai  2.9     5
f  2003  shanghai  3.2     6
In [58]: val = pd.Series([1,2,3],index = ['c','d','f'])
In [59]: f2['debt'] = val
In [60]: f2  # 缺失值以NaN填补
Out[60]:
   year     state  pop  debt
a  2000   beijing  1.5   NaN
b  2001   beijing  1.7   NaN
c  2002   beijing  3.6   1.0
d  2001  shanghai  2.4   2.0
e  2002  shanghai  2.9   NaN
f  2003  shanghai  3.2   3.0

那么如何给DataFrame追加行呢?

>>> data = {'state':['beijing','beijing','beijing','shanghai','shanghai','shanghai'],
    ...: 'year':[2000,2001,2002,2001,2002,2003],
    ...: 'pop':[1.5, 1.7,3.6,2.4,2.9,3.2
    ...: ]}
>>> df = pd.DataFrame(data,index=list('abcdef'))
>>> df
    state   year    pop
a   beijing 2000    1.5
b   beijing 2001    1.7
c   beijing 2002    3.6
d   shanghai    2001    2.4
e   shanghai    2002    2.9
f   shanghai    2003    3.2
>>> df1 = df.loc['a']
>>> df1
state    beijing
year        2000
pop          1.5
Name: a, dtype: object
>>> df.append(df1)
state   year    pop
a   beijing 2000    1.5
b   beijing 2001    1.7
c   beijing 2002    3.6
d   shanghai    2001    2.4
e   shanghai    2002    2.9
f   shanghai    2003    3.2
a   beijing 2000    1.5

可以使用del方法删除指定的列:

In [63]: f2['new'] = f2.state=='beijing'
In [64]: f2
Out[64]:
   year  state  pop  debt    new
a  2000   beijing  1.5   NaN   True
b  2001   beijing  1.7   NaN   True
c  2002   beijing  3.6   1.0    True
d  2001  shanghai  2.4   2.0   False
e  2002  shanghai  2.9   NaN  False
f  2003  shanghai  3.2   3.0   False
In [65]: del f2.new   # 要注意的是我们有时候不能这么调用f2的某个列,并执行某个操作。这是个坑。
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-65-03e4ec812cdb> in <module>()
----> 1 del f2.new
AttributeError: new
In [66]: del f2['new']
In [67]: f2.columns
Out[67]: Index(['year', 'state', 'pop', 'debt'], dtype='object')

需要注意的是:从DataFrame中选取的列是数据的视图,而不是拷贝。因此,对选取列的修改会反映到DataFrame上。如果需要复制,应当使用copy方法。

可以使用类似Numpy的T属性,将DataFrame进行转置:

In [68]: f2.T
Out[68]:
           a       b        c         d         e        f
year      2000     2001       2002        2001        2002      2003
state  beijing  beijing  beijing  shanghai  shanghai  shanghai
pop        1.5      1.7       3.6         2.4          2.9       3.2
debt       NaN     NaN     1            2           NaN         3

DataFrame对象同样具有列名、索引名,也可以查看values:

In [70]: f2.index.name = 'order';f2.columns.name='key'
In [71]: f2
Out[71]:
key    year     state  pop  debt
order
a      2000   beijing  1.5   NaN
b      2001   beijing  1.7   NaN
c      2002   beijing  3.6   1.0
d      2001  shanghai  2.4   2.0
e      2002  shanghai  2.9   NaN
f      2003  shanghai  3.2   3.0
In [72]: f2.values
Out[72]:
array([[2000, 'beijing', 1.5, nan],
       [2001, 'beijing', 1.7, nan],
       [2002, 'beijing', 3.6, 1.0],
       [2001, 'shanghai', 2.4, 2.0],
       [2002, 'shanghai', 2.9, nan],
       [2003, 'shanghai', 3.2, 3.0]], dtype=object)

最后,DataFrame有一个Series所不具备的方法,那就是info!通过这个方法,可以看到DataFrame的一些整体信息情况:

In [73]: f.info()
Out[73]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
state    6 non-null object
year     6 non-null int64
pop      6 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 224.0+ bytes

Guess you like

Origin www.cnblogs.com/lavender1221/p/12664641.html