目录
pandas有两个主要数据结构:Series和DataFrame。
1.Series
Series是一种类似于一维数组的对象,它由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成,即index和values两部分,可以通过索引的方式选取Series中的单个或一组值。
1.1 Series的创建
pd.Series(list,index=[ ]),第二个参数是Series中数据的索引,可以省略。
- 第一个参数可以是列表
import pandas as pd
import numpy as np
from pandas import Series
# 创建Series对象
obj = Series(data=[1,2,3,4],index= ['a','b','c','d']) # data:数据;index:数据的自定义标签
print(obj)
运行结果:
a 1
b 2
c 3
d 4
dtype: int64
1.2 获取Series对象中的值
import pandas as pd
import numpy as np
from pandas import Series
# 创建Series对象
obj1 = Series(data=[1,2,3,4],index= ['a','b','c','d']) # data:数据;index:数据的自定义标签
# 获取Series对象的数据
print(obj1)
# 获取Series对象中的值
print(obj1.values)
# 获取Series对象的索引
print(obj1.index)
运行结果:
a 1
b 2
c 3
d 4
dtype: int64
[1 2 3 4]
Index(['a', 'b', 'c', 'd'], dtype='object')
1.3 对Series对象的切片
Series对象支持两种获取数据的方式:
- 通过位置获取
- 通过索引获取
(1)获取特定数据:
import pandas as pd
import numpy as np
from pandas import Series
# 创建Series对象
obj2 = Series(data=[1,2,3,4],index= ['a','b','c','d']) # data:数据;index:数据的自定义标签
# 获取Series对象的数据
print(obj2)
# 通过索引获取
print(obj2['c'])
#通过位置获取
print(obj2[2])
运行结果:
a 1
b 2
c 3
d 4
dtype: int64
3
3
(2)获取特定不连续数据
import pandas as pd
import numpy as np
from pandas import Series
# 创建Series对象
obj3 = Series(data=[1,2,3,4],index= ['a','b','c','d']) # data:数据;index:数据的自定义标签
# 获取Series对象的数据
print(obj3)
# 通过索引获取
print(obj3[['c','a']])
#通过位置获取
print(obj3[[2,1]])
运行结果:
a 1
b 2
c 3
d 4
dtype: int64
c 3
a 1
dtype: int64
c 3
b 2
dtype: int64
(3)获取连续数据(切片)
import pandas as pd
import numpy as np
from pandas import Series
# 创建Series对象
obj4 = Series(data=[1,2,3,4],index= ['a','b','c','d']) # data:数据;index:数据的自定义标签
# 获取Series对象的数据
print(obj4)
# 通过索引获取
print(obj4['b':'d'])
#通过位置获取
print(obj4[1:3])
运行结果:
a 1
b 2
c 3
d 4
dtype: int64
b 2
c 3
d 4
dtype: int64
b 2
c 3
dtype: int64
1.4 Series对象的索引
(1)重新给index赋值
import pandas as pd
import numpy as np
from pandas import Series
# 创建Series对象
obj5 = Series(data=[1,2,3,4],index= ['a','b','c','d']) # data:数据;index:数据的自定义标签
# 获取Series对象的数据
print(obj5)
#对索引重新赋值
obj5.index = ['x','y','z','w']
print(obj5)
运行结果:
a 1
b 2
c 3
d 4
dtype: int64
x 1
y 2
z 3
w 4
dtype: int64
(2)重新索引
通过reindex方法将Series对象重新索引,返回一个新的Series对象,原有的元素通过给定的索引重新排列,缺少的值用NaN填充。
import pandas as pd
import numpy as np
from pandas import Series
# 创建Series对象
obj6 = Series(data=[1,2,3,4],index= ['a','b','c','d']) # data:数据;index:数据的自定义标签
# 获取Series对象的数据
print(obj6)
#对索引重新赋值
obj7 = obj6.reindex(['c','a','b','d','e'])
print(obj7)
运行结果:
a 1
b 2
c 3
d 4
dtype: int64
c 3.0
a 1.0
b 2.0
d 4.0
e NaN
dtype: float64
1.5 删除
通过drop方法删除指定位置的值
import pandas as pd
import numpy as np
from pandas import Series
# 创建Series对象
obj8 = Series(data=[1,2,3,4],index= ['a','b','c','d']) # data:数据;index:数据的自定义标签
# 获取Series对象的数据
print(obj8)
#删除
obj9 = obj8.drop(['c','a'])
print(obj9)
运行结果:
a 1
b 2
c 3
d 4
dtype: int64
b 2
d 4
dtype: int64
2 DaataFrame
DataFrame(数据表)是一种2维数据结构,以表格的形式存在,分成若干行和列。
DataFrame可以进行多种常见操作:选取、替换行/列的数据,重组数据表、修改索引、筛选等。
DataFrame可以将多种格式的数据转换为DataFrame对象
DataFrame的三个参数:data-数据;index-行索引、columns-列索引
2.1 DataFrame的创建
(1)通过二维数组创建
import pandas as pd
import numpy as np
from pandas import DataFrame
# 使用二维数组创建DataFrame对象
df1 = DataFrame(np.random.randint(0,10,(4,4)),index=[1,2,3,4],columns=['a','b','c','d'])
print(df1)
运行结果:
a b c d
1 3 8 3 5
2 0 1 5 9
3 7 4 9 2
4 4 8 9 2
(2)使用字典创建
import pandas as pd
import numpy as np
from pandas import DataFrame
# 使用字典创建DataFrame对象,行索引由index决定,列索引由字典的键决定
dict = {
"city":['Guangdong','Shanghai','Beijing','Shenzhen'],
"pop":[10, 20, 30, 40],
'year':[2017,2015,2016,2018]}
df2 = DataFrame(data=dict,index=[1,2,3,4])
print(df2)
运行结果:
city pop year
1 Guangdong 10 2017
2 Shanghai 20 2015
3 Beijing 30 2016
4 Shenzhen 40 2018
(3)通过from_dict方法从字典创建
import pandas as pd
import numpy as np
from pandas import DataFrame
dict2 = {'a':[1,2,3],'b':[4,5,6]}
df3 = pd.DataFrame.from_dict(dict2)
print(df3)
运行结果:
a b
0 1 4
1 2 5
2 3 6
(4)创建DataFrame对象时,索引相同的情况下,相同索引的值会相互对应,缺少的值会填充NaN
import pandas as pd
import numpy as np
from pandas import DataFrame
data = {
"Name":pd.Series(['zhangsan','lisi','wangwu','zhaoliu'],index=['a','b','c','d']),
"Age":pd.Series([18,20,30],index=['a','b','c']),
"Country":pd.Series(['CH','JP','US'],index=['a','c','d'])}
df4 = pd.DataFrame(data)
print(df4)
运行结果:
Age Country Name
a 18.0 CH zhangsan
b 20.0 NaN lisi
c 30.0 JP wangwu
d NaN US zhaoliu
2.2 DataFrame对象的形状
import pandas as pd
from pandas import DataFrame
import numpy as np
df_dict = {
"name":['zhangsan','lisi','wangwu','zhaoliu'],
"age":[18,20,19,14],
'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)
# 获取DataFrame对象的行数和列数(形状)
print(df.shape)
运行结果:
age country name
a 18 china zhangsan
b 20 us lisi
c 19 china wangwu
d 14 us zhaoliu
(4, 3)
2.3 索引
import pandas as pd
from pandas import DataFrame
import numpy as np
df_dict = {
"name":['zhangsan','lisi','wangwu','zhaoliu'],
"age":[18,20,19,14],
'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)
# 获取DataFrame对象的行索引
print("行索引:",df.index.tolist())
# 获取列索引
print("列索引:",df.columns.tolist())
运行结果:
age country name
a 18 china zhangsan
b 20 us lisi
c 19 china wangwu
d 14 us zhaoliu
行索引: ['a', 'b', 'c', 'd']
列索引: ['age', 'country', 'name']
2.4 数据类型
import pandas as pd
from pandas import DataFrame
import numpy as np
df_dict = {
"name":['zhangsan','lisi','wangwu','zhaoliu'],
"age":[18,20,19,14],
'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)
# 获取DataFrame对象的数据类型
print(df.dtypes)
运行结果:
age country name
a 18 china zhangsan
b 20 us lisi
c 19 china wangwu
d 14 us zhaoliu
age int64
country object
name object
dtype: object
2.5 维度
import pandas as pd
from pandas import DataFrame
import numpy as np
df_dict = {
"name":['zhangsan','lisi','wangwu','zhaoliu'],
"age":[18,20,19,14],
'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)
# 获取DataFrame对象的数据维度
print(df.ndim)
运行结果:
age country name
a 18 china zhangsan
b 20 us lisi
c 19 china wangwu
d 14 us zhaoliu
2
2.6 DataFrame对象中数据的获取
(1)获取DataFrame对象的某一列
import pandas as pd
from pandas import DataFrame
import numpy as np
df_dict = {
"name":['zhangsan','lisi','wangwu','zhaoliu'],
"age":[18,20,19,14],
'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)
print(df['name'])
print("-"*30)
print(type(df['name']))
运行结果:
age country name
a 18 china zhangsan
b 20 us lisi
c 19 china wangwu
d 14 us zhaoliu
a zhangsan
b lisi
c wangwu
d zhaoliu
Name: name, dtype: object
------------------------------
<class 'pandas.core.series.Series'>
(2)获取DataFrame对象的某几列
import pandas as pd
from pandas import DataFrame
import numpy as np
df_dict = {
"name":['zhangsan','lisi','wangwu','zhaoliu'],
"age":[18,20,19,14],
'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)
print(df[['name','age']])
运行结果:
age country name
a 18 china zhangsan
b 20 us lisi
c 19 china wangwu
d 14 us zhaoliu
name age
a zhangsan 18
b lisi 20
c wangwu 19
d zhaoliu 14
(3)获取DataFrame对象的某几行
import pandas as pd
from pandas import DataFrame
import numpy as np
df_dict = {
"name":['zhangsan','lisi','wangwu','zhaoliu'],
"age":[18,20,19,14],
'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)
print(df[0:2])
运行结果:
age country name
a 18 china zhangsan
b 20 us lisi
c 19 china wangwu
d 14 us zhaoliu
age country name
a 18 china zhangsan
b 20 us lisi
(4)获取DataFrame对象的某几行中的某几列
import pandas as pd
from pandas import DataFrame
import numpy as np
df_dict = {
"name":['zhangsan','lisi','wangwu','zhaoliu'],
"age":[18,20,19,14],
'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)
print(df[1:3][['name','country']])
运行结果:
age country name
a 18 china zhangsan
b 20 us lisi
c 19 china wangwu
d 14 us zhaoliu
name country
b lisi us
c wangwu china
2.7 通过loc、iloc方法获取DataFrame对象的数据
- loc:通过标签索引行数据
- iloc:通过位置(下标)索引行数据
(1)loc获取某一行某一列(一个数据)
import pandas as pd
from pandas import DataFrame
import numpy as np
df_dict = {
"name":['zhangsan','lisi','wangwu','zhaoliu'],
"age":[18,20,19,14],
'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)
print('-'*40)
# 获取某个值
print(df.loc['a','country'])
运行结果:
age country name
a 18 china zhangsan
b 20 us lisi
c 19 china wangwu
d 14 us zhaoliu
----------------------------------------
china
(2)loc获取某一行的所有列
import pandas as pd
from pandas import DataFrame
import numpy as np
df_dict = {
"name":['zhangsan','lisi','wangwu','zhaoliu'],
"age":[18,20,19,14],
'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)
print('-'*40)
# 获取某一行
print(df.loc['b',:])
运行结果:
age country name
a 18 china zhangsan
b 20 us lisi
c 19 china wangwu
d 14 us zhaoliu
----------------------------------------
age 20
country us
name lisi
Name: b, dtype: object
(3)loc获取某些行的某些列
import pandas as pd
from pandas import DataFrame
import numpy as np
df_dict = {
"name":['zhangsan','lisi','wangwu','zhaoliu'],
"age":[18,20,19,14],
'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)
print('-'*40)
# 获取间隔的多行中的某些列
print(df.loc[['a','c'],['name','age']])
运行结果:
age country name
a 18 china zhangsan
b 20 us lisi
c 19 china wangwu
d 14 us zhaoliu
----------------------------------------
name age
a zhangsan 18
c wangwu 19
(4) iloc获取指定值
import pandas as pd
from pandas import DataFrame
import numpy as np
df_dict = {
"name":['zhangsan','lisi','wangwu','zhaoliu'],
"age":[18,20,19,14],
'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'],columns = ['name','age','country'])
print(df)
print('-'*40)
# 获取某个值
print(df.iloc[2,1])
运行结果:
name age country
a zhangsan 18 china
b lisi 20 us
c wangwu 19 china
d zhaoliu 14 us
----------------------------------------
19
(5)iloc获取某一行
import pandas as pd
from pandas import DataFrame
import numpy as np
df_dict = {
"name":['zhangsan','lisi','wangwu','zhaoliu'],
"age":[18,20,19,14],
'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'],columns = ['name','age','country'])
print(df)
print('-'*40)
print(df.iloc[1])
运行结果:
name age country
a zhangsan 18 china
b lisi 20 us
c wangwu 19 china
d zhaoliu 14 us
----------------------------------------
name lisi
age 20
country us
Name: b, dtype: object
(6)iloc获取间隔的多行
import pandas as pd
from pandas import DataFrame
import numpy as np
df_dict = {
"name":['zhangsan','lisi','wangwu','zhaoliu'],
"age":[18,20,19,14],
'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'],columns = ['name','age','country'])
print(df)
print('-'*40)
print(df.iloc[[0,2],:])
运行结果:
name age country
a zhangsan 18 china
b lisi 20 us
c wangwu 19 china
d zhaoliu 14 us
----------------------------------------
name age country
a zhangsan 18 china
c wangwu 19 china
2.8 修改DataFrame对象中的数据
import pandas as pd
from pandas import DataFrame
import numpy as np
df_dict = {
"name":['zhangsan','lisi','wangwu','zhaoliu'],
"age":[18,20,19,14],
'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)
print('-'*40)
df.loc['a','name'] = 'python'
df.iloc[1,0] = 'C++'
print(df)
运行结果:
age country name
a 18 china zhangsan
b 20 us lisi
c 19 china wangwu
d 14 us zhaoliu
----------------------------------------
age country name
a 18 china python
b C++ us lisi
c 19 china wangwu
d 14 us zhaoliu
2.9 DataFrame对象的数据排序
通过sort_values()方法,可以根据DataFrame对象中某一列的值进行排序
df.sort_values(by=‘age’,asceding=False):
- age:依据排序的列
- asceding:排序方式,False为降序排列,默认是升序True
import pandas as pd
from pandas import DataFrame
import numpy as np
df_dict = {
"name":['zhangsan','lisi','wangwu','zhaoliu'],
"age":[18,20,19,14],
'country':["china",'us','china','us']}
df = DataFrame(data=df_dict,index=['a','b','c','d'])
print(df)
print('-'*40)
df = df.sort_values(by='age',ascending=False) # 根据"age"列进行降序排列
print(df)
运行结果:
age country name
a 18 china zhangsan
b 20 us lisi
c 19 china wangwu
d 14 us zhaoliu
----------------------------------------
age country name
b 20 us lisi
c 19 china wangwu
a 18 china zhangsan
d 14 us zhaoliu