(2)Python之Pandas库的使用


pandas有两个主要数据结构:Series和DataFrame。

1.Series

Series是一种类似于一维数组的对象,它由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成,即index和values两部分,可以通过索引的方式选取Series中的单个或一组值。

1.1 Series的创建

pd.Series(list,index=[ ]),第二个参数是Series中数据的索引,可以省略。

  • 第一个参数可以是列表
import pandas as pd
import numpy as np
from pandas import Series

# 创建Series对象
obj = Series(data=[1,2,3,4],index= ['a','b','c','d']) # data:数据;index:数据的自定义标签
print(obj)

运行结果:

a    1
b    2
c    3
d    4
dtype: int64

1.2 获取Series对象中的值

import pandas as pd
import numpy as np
from pandas import Series

# 创建Series对象
obj1 = Series(data=[1,2,3,4],index= ['a','b','c','d']) # data:数据;index:数据的自定义标签
# 获取Series对象的数据
print(obj1)
# 获取Series对象中的值
print(obj1.values)
# 获取Series对象的索引
print(obj1.index)

运行结果:

a    1
b    2
c    3
d    4
dtype: int64
[1 2 3 4]
Index(['a', 'b', 'c', 'd'], dtype='object')

1.3 对Series对象的切片

Series对象支持两种获取数据的方式:

  • 通过位置获取
  • 通过索引获取

(1)获取特定数据:

import pandas as pd
import numpy as np
from pandas import Series

# 创建Series对象
obj2 = Series(data=[1,2,3,4],index= ['a','b','c','d']) # data:数据;index:数据的自定义标签
# 获取Series对象的数据
print(obj2)
# 通过索引获取
print(obj2['c'])
#通过位置获取
print(obj2[2])

运行结果:

a    1
b    2
c    3
d    4
dtype: int64
3
3

(2)获取特定不连续数据

import pandas as pd
import numpy as np
from pandas import Series

# 创建Series对象
obj3 = Series(data=[1,2,3,4],index= ['a','b','c','d']) # data:数据;index:数据的自定义标签
# 获取Series对象的数据
print(obj3)
# 通过索引获取
print(obj3[['c','a']])
#通过位置获取
print(obj3[[2,1]])

运行结果:

a    1
b    2
c    3
d    4
dtype: int64
c    3
a    1
dtype: int64
c    3
b    2
dtype: int64

(3)获取连续数据(切片)

import pandas as pd
import numpy as np
from pandas import Series

# 创建Series对象
obj4 = Series(data=[1,2,3,4],index= ['a','b','c','d']) # data:数据;index:数据的自定义标签
# 获取Series对象的数据
print(obj4)
# 通过索引获取
print(obj4['b':'d'])
#通过位置获取
print(obj4[1:3])

运行结果:

a    1
b    2
c    3
d    4
dtype: int64
b    2
c    3
d    4
dtype: int64
b    2
c    3
dtype: int64

1.4 Series对象的索引

(1)重新给index赋值

import pandas as pd
import numpy as np
from pandas import Series

# 创建Series对象
obj5 = Series(data=[1,2,3,4],index= ['a','b','c','d']) # data:数据;index:数据的自定义标签
# 获取Series对象的数据
print(obj5)
#对索引重新赋值
obj5.index = ['x','y','z','w']
print(obj5)

运行结果:

a    1
b    2
c    3
d    4
dtype: int64
x    1
y    2
z    3
w    4
dtype: int64

(2)重新索引
通过reindex方法将Series对象重新索引,返回一个新的Series对象,原有的元素通过给定的索引重新排列,缺少的值用NaN填充。

import pandas as pd
import numpy as np
from pandas import Series

# 创建Series对象
obj6 = Series(data=[1,2,3,4],index= ['a','b','c','d']) # data:数据;index:数据的自定义标签
# 获取Series对象的数据
print(obj6)
#对索引重新赋值
obj7 = obj6.reindex(['c','a','b','d','e'])
print(obj7)

运行结果:

a    1
b    2
c    3
d    4
dtype: int64
c    3.0
a    1.0
b    2.0
d    4.0
e    NaN
dtype: float64

1.5 删除

通过drop方法删除指定位置的值

import pandas as pd
import numpy as np
from pandas import Series

# 创建Series对象
obj8 = Series(data=[1,2,3,4],index= ['a','b','c','d']) # data:数据;index:数据的自定义标签
# 获取Series对象的数据
print(obj8)
#删除
obj9 = obj8.drop(['c','a'])
print(obj9)

运行结果:

a    1
b    2
c    3
d    4
dtype: int64
b    2
d    4
dtype: int64

2 DaataFrame

DataFrame(数据表)是一种2维数据结构,以表格的形式存在,分成若干行和列。
DataFrame可以进行多种常见操作:选取、替换行/列的数据,重组数据表、修改索引、筛选等。
DataFrame可以将多种格式的数据转换为DataFrame对象
DataFrame的三个参数:data-数据;index-行索引、columns-列索引

2.1 DataFrame的创建

(1)通过二维数组创建

import pandas as pd
import numpy as np
from pandas import DataFrame

# 使用二维数组创建DataFrame对象
df1 = DataFrame(np.random.randint(0,10,(4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
print(df1)

运行结果:

   a  b  c  d
1  3  8  3  5
2  0  1  5  9
3  7  4  9  2
4  4  8  9  2

(2)使用字典创建

import pandas as pd
import numpy as np
from pandas import DataFrame

# 使用字典创建DataFrame对象,行索引由index决定,列索引由字典的键决定
dict = {
   "city":['Guangdong','Shanghai','Beijing','Shenzhen'],
   "pop":[10, 20, 30, 40],
   'year':[2017,2015,2016,2018]}
df2 = DataFrame(data=dict,index=[1,2,3,4])
print(df2)

运行结果:

        city  pop  year
1  Guangdong   10  2017
2   Shanghai   20  2015
3    Beijing   30  2016
4   Shenzhen   40  2018

(3)通过from_dict方法从字典创建

import pandas as pd
import numpy as np
from pandas import DataFrame

dict2 = {'a':[1,2,3],'b':[4,5,6]}
df3 = pd.DataFrame.from_dict(dict2)
print(df3)

运行结果:

   a  b
0  1  4
1  2  5
2  3  6

(4)创建DataFrame对象时,索引相同的情况下,相同索引的值会相互对应,缺少的值会填充NaN

import pandas as pd
import numpy as np
from pandas import DataFrame

data = {
    "Name":pd.Series(['zhangsan','lisi','wangwu','zhaoliu'],index=['a','b','c','d']),
    "Age":pd.Series([18,20,30],index=['a','b','c']),
    "Country":pd.Series(['CH','JP','US'],index=['a','c','d'])}
df4 = pd.DataFrame(data)
print(df4)

运行结果:


    Age Country      Name
a  18.0      CH  zhangsan
b  20.0     NaN      lisi
c  30.0      JP    wangwu
d   NaN      US   zhaoliu

2.2 DataFrame对象的形状

import pandas as pd
from pandas import DataFrame
import numpy as np

df_dict = {
    "name":['zhangsan','lisi','wangwu','zhaoliu'],
    "age":[18,20,19,14],
    'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)

# 获取DataFrame对象的行数和列数(形状)
print(df.shape)

运行结果:

   age country      name
a   18   china  zhangsan
b   20      us      lisi
c   19   china    wangwu
d   14      us   zhaoliu
(4, 3)

2.3 索引

import pandas as pd
from pandas import DataFrame
import numpy as np

df_dict = {
    "name":['zhangsan','lisi','wangwu','zhaoliu'],
    "age":[18,20,19,14],
    'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)

# 获取DataFrame对象的行索引
print("行索引:",df.index.tolist())
# 获取列索引
print("列索引:",df.columns.tolist())

运行结果:

  age country      name
a   18   china  zhangsan
b   20      us      lisi
c   19   china    wangwu
d   14      us   zhaoliu
行索引: ['a', 'b', 'c', 'd']
列索引: ['age', 'country', 'name']

2.4 数据类型

import pandas as pd
from pandas import DataFrame
import numpy as np

df_dict = {
    "name":['zhangsan','lisi','wangwu','zhaoliu'],
    "age":[18,20,19,14],
    'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)

# 获取DataFrame对象的数据类型
print(df.dtypes)

运行结果:

   age country      name
a   18   china  zhangsan
b   20      us      lisi
c   19   china    wangwu
d   14      us   zhaoliu
age         int64
country    object
name       object
dtype: object

2.5 维度

import pandas as pd
from pandas import DataFrame
import numpy as np

df_dict = {
    "name":['zhangsan','lisi','wangwu','zhaoliu'],
    "age":[18,20,19,14],
    'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)

# 获取DataFrame对象的数据维度
print(df.ndim)

运行结果:

   age country      name
a   18   china  zhangsan
b   20      us      lisi
c   19   china    wangwu
d   14      us   zhaoliu
2

2.6 DataFrame对象中数据的获取

(1)获取DataFrame对象的某一列

import pandas as pd
from pandas import DataFrame
import numpy as np

df_dict = {
    "name":['zhangsan','lisi','wangwu','zhaoliu'],
    "age":[18,20,19,14],
    'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)

print(df['name'])
print("-"*30)
print(type(df['name']))

运行结果:

   age country      name
a   18   china  zhangsan
b   20      us      lisi
c   19   china    wangwu
d   14      us   zhaoliu
a    zhangsan
b        lisi
c      wangwu
d     zhaoliu
Name: name, dtype: object
------------------------------
<class 'pandas.core.series.Series'>

(2)获取DataFrame对象的某几列

import pandas as pd
from pandas import DataFrame
import numpy as np

df_dict = {
    "name":['zhangsan','lisi','wangwu','zhaoliu'],
    "age":[18,20,19,14],
    'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)

print(df[['name','age']])

运行结果:

   age country      name
a   18   china  zhangsan
b   20      us      lisi
c   19   china    wangwu
d   14      us   zhaoliu
       name  age
a  zhangsan   18
b      lisi   20
c    wangwu   19
d   zhaoliu   14

(3)获取DataFrame对象的某几行

import pandas as pd
from pandas import DataFrame
import numpy as np

df_dict = {
    "name":['zhangsan','lisi','wangwu','zhaoliu'],
    "age":[18,20,19,14],
    'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)

print(df[0:2])

运行结果:

   age country      name
a   18   china  zhangsan
b   20      us      lisi
c   19   china    wangwu
d   14      us   zhaoliu
   age country      name
a   18   china  zhangsan
b   20      us      lisi

(4)获取DataFrame对象的某几行中的某几列

import pandas as pd
from pandas import DataFrame
import numpy as np

df_dict = {
    "name":['zhangsan','lisi','wangwu','zhaoliu'],
    "age":[18,20,19,14],
    'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)

print(df[1:3][['name','country']])

运行结果:

   age country      name
a   18   china  zhangsan
b   20      us      lisi
c   19   china    wangwu
d   14      us   zhaoliu
     name country
b    lisi      us
c  wangwu   china

2.7 通过loc、iloc方法获取DataFrame对象的数据

  • loc:通过标签索引行数据
  • iloc:通过位置(下标)索引行数据

(1)loc获取某一行某一列(一个数据)

import pandas as pd
from pandas import DataFrame
import numpy as np

df_dict = {
    "name":['zhangsan','lisi','wangwu','zhaoliu'],
    "age":[18,20,19,14],
    'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)
print('-'*40)

# 获取某个值
print(df.loc['a','country'])

运行结果:

   age country      name
a   18   china  zhangsan
b   20      us      lisi
c   19   china    wangwu
d   14      us   zhaoliu
----------------------------------------
china

(2)loc获取某一行的所有列

import pandas as pd
from pandas import DataFrame
import numpy as np

df_dict = {
    "name":['zhangsan','lisi','wangwu','zhaoliu'],
    "age":[18,20,19,14],
    'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)
print('-'*40)

# 获取某一行
print(df.loc['b',:])

运行结果:

   age country      name
a   18   china  zhangsan
b   20      us      lisi
c   19   china    wangwu
d   14      us   zhaoliu
----------------------------------------
age          20
country      us
name       lisi
Name: b, dtype: object

(3)loc获取某些行的某些列

import pandas as pd
from pandas import DataFrame
import numpy as np

df_dict = {
    "name":['zhangsan','lisi','wangwu','zhaoliu'],
    "age":[18,20,19,14],
    'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)
print('-'*40)

# 获取间隔的多行中的某些列
print(df.loc[['a','c'],['name','age']])

运行结果:

   age country      name
a   18   china  zhangsan
b   20      us      lisi
c   19   china    wangwu
d   14      us   zhaoliu
----------------------------------------
       name  age
a  zhangsan   18
c    wangwu   19

(4) iloc获取指定值

import pandas as pd
from pandas import DataFrame
import numpy as np

df_dict = {
    "name":['zhangsan','lisi','wangwu','zhaoliu'],
    "age":[18,20,19,14],
    'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'],columns = ['name','age','country'])
print(df)
print('-'*40)

# 获取某个值
print(df.iloc[2,1])

运行结果:

       name  age country
a  zhangsan   18   china
b      lisi   20      us
c    wangwu   19   china
d   zhaoliu   14      us
----------------------------------------
19

(5)iloc获取某一行

import pandas as pd
from pandas import DataFrame
import numpy as np

df_dict = {
    "name":['zhangsan','lisi','wangwu','zhaoliu'],
    "age":[18,20,19,14],
    'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'],columns = ['name','age','country'])
print(df)
print('-'*40)

print(df.iloc[1])

运行结果:

       name  age country
a  zhangsan   18   china
b      lisi   20      us
c    wangwu   19   china
d   zhaoliu   14      us
----------------------------------------
name       lisi
age          20
country      us
Name: b, dtype: object

(6)iloc获取间隔的多行

import pandas as pd
from pandas import DataFrame
import numpy as np

df_dict = {
    "name":['zhangsan','lisi','wangwu','zhaoliu'],
    "age":[18,20,19,14],
    'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'],columns = ['name','age','country'])
print(df)
print('-'*40)

print(df.iloc[[0,2],:])

运行结果:

     name  age country
a  zhangsan   18   china
b      lisi   20      us
c    wangwu   19   china
d   zhaoliu   14      us
----------------------------------------
       name  age country
a  zhangsan   18   china
c    wangwu   19   china

2.8 修改DataFrame对象中的数据

import pandas as pd
from pandas import DataFrame
import numpy as np

df_dict = {
    "name":['zhangsan','lisi','wangwu','zhaoliu'],
    "age":[18,20,19,14],
    'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)
print('-'*40)

df.loc['a','name'] = 'python'
df.iloc[1,0] = 'C++'
print(df)

运行结果:

   age country      name
a   18   china  zhangsan
b   20      us      lisi
c   19   china    wangwu
d   14      us   zhaoliu
----------------------------------------
   age country     name
a   18   china   python
b  C++      us     lisi
c   19   china   wangwu
d   14      us  zhaoliu

2.9 DataFrame对象的数据排序

通过sort_values()方法,可以根据DataFrame对象中某一列的值进行排序

df.sort_values(by=‘age’,asceding=False):

  • age:依据排序的列
  • asceding:排序方式,False为降序排列,默认是升序True
import pandas as pd
from pandas import DataFrame
import numpy as np

df_dict = {
    "name":['zhangsan','lisi','wangwu','zhaoliu'],
    "age":[18,20,19,14],
    'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)
print('-'*40)

df = df.sort_values(by='age',ascending=False) # 根据"age"列进行降序排列
print(df)

运行结果:

   age country      name
a   18   china  zhangsan
b   20      us      lisi
c   19   china    wangwu
d   14      us   zhaoliu
----------------------------------------
   age country      name
b   20      us      lisi
c   19   china    wangwu
a   18   china  zhangsan
d   14      us   zhaoliu

猜你喜欢

转载自blog.csdn.net/zstu_lihang/article/details/106861378
今日推荐