Python高薪之路——pandas

pandas是Python中一个很重要的库，它有以下的优点：

是基于Numpy的一种工具库；
提供了大量能使我们快速便捷处理数据的函数和方法；
最初被作为金融数据分析根据被开发出来；
集成了时间序列功能；
对缺失值的灵活处理能力
等

pandas 两种构造数据的方法：Series和DataFrame，Series构造的是一维数据和数据的索引，而DataFrame构造的是多维数据

1、Series

1）构造Series类型的数据

API中的构造方法：class pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)

In [142]: b = pd.Series(['付靖玲','23','male'],index=['name','age','sex'])
In [143]: b
Out[143]:
name     付靖玲
age       23
sex     male
dtype: object

上面通过pandas的Series构造了一个b数组，前面一个[]里的是数据，后面的是对应每一个数据的索引，索引可以默认不写，如果不写的话默认是0，1，2....

2）读取Series数据：既可通过索引读取，也可通过下标读取

In [155]: b[1]

Out[155]: '23'
In [156]: b['age']
Out[156]: '23'

3）Series构造出来的数据和数据字典dict的差别

dict构造的是键值对，只可以通过键来获取，不可以通过下标来获取

In [158]: a = dict(name='fjl',sex='male')
In [159]: a
Out[159]: {'name': 'fjl', 'sex': 'male'}
In [160]: a[0]
---------------------------------------------------------------------------
KeyError                              
    Traceback (most recent call last)
<ipython-input-160-5ccf417d7af1> in <module>()
----> 1 a[0]
KeyError: 0

以上用了索引的方式来读取dict构造的数据，报键值错误，下面通过键来获取值

In [162]: a['name']
Out[162]: 'fjl'

但是，dict也可转化为Series类型

In [166]: c = pd.Series(a)
In [167]: c
Out[167]:
name     fjl
sex     male
dtype: object
In [168]: c[0]
Out[168]: 'fjl'
In [169]: c['name']
Out[169]: 'fjl'

也可以转化指定索引的数据

In [170]: d = pd.Series(a, index=['sex'])
In [171]: d
Out[171]:
sex    male
dtype: object

2、DataFrame

DataFrame构造的是一个表格型数据

1)用dict构造DataFrame类型的数据

构造一个数据字典

In [172]: student = {'name':['Bob','Tom','Fjl','Zzf','ZS'],
   .....: 'age':[23,22,23,24,25],
   .....: 'ID':['s1','s2','s3','s4','s5'],
   .....: 'sex':['male','male','famale','male','famale']}

In [173]: student
Out[173]:
{'ID': ['s1', 's2', 's3', 's4', 's5'],
 'age': [23, 22, 23, 24, 25],
 'name': ['Bob', 'Tom', 'Fjl', 'Zzf', 'ZS'],
 'sex': ['male', 'male', 'famale', 'male', 'famale']}

#通过Series构造Dataframe
In [174]: student_DF=pd.DataFrame(student)

In [175]: student_DF
Out[175]:
   ID  age name     sex
0  s1   23  Bob    male
1  s2   22  Tom    male
2  s3   23  Fjl  famale
3  s4   24  Zzf    male
4  s5   25   ZS  famale

如果dict里的数据不全部用来构造DataFrame,也可通过columns=[]来选择性构造

In [176]: student_DF0 = pd.DataFrame(student,columns=['name'])

In [177]: student_DF0
Out[177]:
  name
0  Bob
1  Tom
2  Fjl
3  Zzf
4   ZS

上面的方法构造出来的数据默认索引都是0，1，2...,如果我嫌弃这种方法，可以用以下的方法构造：

In [178]: student_DF1 = pd.DataFrame(student,index=['a','b','c','d','f'])

In [179]: student_DF1
Out[179]:
   ID  age name     sex
a  s1   23  Bob    male
b  s2   22  Tom    male
c  s3   23  Fjl  famale
d  s4   24  Zzf    male
f  s5   25   ZS  famale

2）给构造好的DataFrame数据表添加一列：有时候我构造好的数据可能是不全的，那么如何给构造好的DataFrame添加一列呢

我给我的student_DF1的所有行添加一列名为“school”的列，统一填充数据为‘BNU’

In [183]: student_DF1['school']='BNU'

In [184]: student_DF1
Out[184]:
   ID  age name     sex school
a  s1   23  Bob    male    BNU
b  s2   22  Tom    male    BNU
c  s3   23  Fjl  famale    BNU
d  s4   24  Zzf    male    BNU
f  s5   25   ZS  famale    BNU

但是，有时候填充的数据并不是统一的，而且有些数据在填充的时候不知道该填什么，应该填充空值，所以有以下方法：

In [186]: hometown = pd.Series(['New York','YunNan','HaiNan'],index=['a','c','d'])

In [187]: hometown
Out[187]:
a    New York
c      YunNan
d      HaiNan
dtype: object

In [188]:student_DF1['hometown'] = hometown

In [189]: student_DF1
Out[189]:
   ID  age name     sex school  hometown
a  s1   23  Bob    male    BNU  New York
b  s2   22  Tom    male    BNU       NaN
c  s3   23  Fjl  famale    BNU    YunNan
d  s4   24  Zzf    male    BNU    HaiNan
f  s5   25   ZS  famale    BNU       NaN

在上面的方法中，我不确定Tom和ZS的家乡是哪里，所以默认填充了NaN（空值）
3）Dataframe的排序问题

对于数据分析来说，为了更高的工作效率，数据常常是需要排序的，以下是几种常见的排序方法：

【1】sort_index() 数据按照索引自动排序，

student_DF1.sort_index()
Out[192]:
   ID  age name     sex school  hometown
a  s1   23  Bob    male    BNU  New York
b  s2   22  Tom    male    BNU       NaN
c  s3   23  Fjl  famale    BNU    YunNan
d  s4   24  Zzf    male    BNU    HaiNan
f  s5   25   ZS  famale    BNU       NaN

因为这里我原来的数据的索引就是有序的了，所以执行了这个方法以后还是没有变化

默认是增序的，也可设置为降序，

sort_index(ascending=False)

注意：排序后的结果是保存在一个副本里，原来的值并没有发生改变

In [194]: student_DF1.sort_index(ascending=False)
Out[194]:
   ID  age name     sex school  hometown
f  s5   25   ZS  famale    BNU       NaN
d  s4   24  Zzf    male    BNU    HaiNan
c  s3   23  Fjl  famale    BNU    YunNan
b  s2   22  Tom    male    BNU       NaN
a  s1   23  Bob    male    BNU  New York

In [195]: student_DF1
Out[195]:
   ID  age name     sex school  hometown
a  s1   23  Bob    male    BNU  New York
b  s2   22  Tom    male    BNU       NaN
c  s3   23  Fjl  famale    BNU    YunNan
d  s4   24  Zzf    male    BNU    HaiNan
f  s5   25   ZS  famale    BNU       NaN

【2】sort_values(by='')按照指定值排序

In [197]: student_DF1.sort_values(by='name')
Out[197]:
   ID  age name     sex school  hometown
a  s1   23  Bob    male    BNU  New York
c  s3   23  Fjl  famale    BNU    YunNan
b  s2   22  Tom    male    BNU       NaN
f  s5   25   ZS  famale    BNU       NaN
d  s4   24  Zzf    male    BNU    HaiNan

4）DataFrame的删除问题

drop('索引名') 删除一行；

drop(‘列名’,axis=1) 删除一列

注意：删除的结果保存在一个副本里，原来的数据还是没有改变

In [199]: student_DF1.drop('a')
Out[199]:
   ID  age name     sex school hometown
b  s2   22  Tom    male    BNU      NaN
c  s3   23  Fjl  famale    BNU   YunNan
d  s4   24  Zzf    male    BNU   HaiNan
f  s5   25   ZS  famale    BNU      NaN

操作的结果为索引为‘a’的一行被删除了

In [200]: student_DF1
Out[200]:
   ID  age name     sex school  hometown
a  s1   23  Bob    male    BNU  New York
b  s2   22  Tom    male    BNU       NaN
c  s3   23  Fjl  famale    BNU    YunNan
d  s4   24  Zzf    male    BNU    HaiNan
f  s5   25   ZS  famale    BNU       NaN

上面证明了操作结果保存在一个副本里，没有改变原数据，保证了数据的安全性

In [201]: student_DF1.drop(['school'],axis=1)
Out[201]:
   ID  age name     sex  hometown
a  s1   23  Bob    male  New York
b  s2   22  Tom    male       NaN
c  s3   23  Fjl  famale    YunNan
d  s4   24  Zzf    male    HaiNan
f  s5   25   ZS  famale       NaN

列名为‘school’的一列被删除了

5）DataFrame中的新增加一列

df = DataFrame(data = {
    'age':[22,23,24],
    'name':['FJl','zzf','zdz']
    }, index = ['Fitst',1,3])

>>> df
       age name
Fitst   22  FJl
1       23  zzf
3       24  zdz

<pre name="code" class="python">>><span style="color:#ff6666;">> df['sex'] = ['f','m','m']</span>
>>> df
       age name sex
Fitst   22  FJl   f
1       23  zzf   m
3       24  zdz   m

DataFrame中的添加一行

>>> df.loc[len(df)+1] = [25,'sh','f']
>>> df
       age name sex
Fitst   22  FJl   f
1       23  zzf   m
3       24  zdz   m
4       25   sh   f

可以看出，index的值不管前面是什么，默认从最后一个递增

loc[索引值]可定位到DataFrame的任意一行，然后对这一行的值进行改变

>>> df.loc[3] = [25,'改了的值','f']
>>> df
       age  name sex
Fitst   22   FJl   f
1       23   zzf   m
3       25  改了的值   f
4       25    sh   f

Python高薪之路——pandas

2、DataFrame

猜你喜欢