提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

文章目录

前言
一、DataFrame创建
二、DataFrame属性
三、DataFrame切片与索引
四、DataFrame操作

前言

DataFrame是一个表格型的数据结构，它含有一组有序的，每列可以是不同的值类型（数值，字符串，布尔值等）。DataFrame既有行索引也有列索引。

提示：以下是本篇文章正文内容，下面案例可供参考

一、DataFrame创建

1、函数创建

代码如下：

import pandas as pd 
import numpy as np

frame=pd.DataFrame(np.random.randn(3,3),index=list('abc'),columns=list('ABC'))
frame

输出结果：

		A			B			C
a	-0.391570	0.182729	1.010572
b	0.455405	0.418206	0.134341
c	-0.491456	-0.527641	0.868909

2、直接创建

代码如下：

import pandas as pd
import numpy as np

frame= pd.DataFrame([[1, 2, 3], 
                    [2, 3, 4],
                    [3, 4, 5]],
                   index=list('abc'), columns=list('ABC'))
frame

#可以分别定义列索引(columns)与行切片(index)
frame1=pd.DataFrame([[1, 2, 3], 
                    [2, 3, 4],
                    [3, 4, 5]])
frame1.columns=list('ABC')  
frame1.index=list('abc') 
frame1

输出结果：

>>frame
   A  B  C
a  1  2  3
b  2  3  4
c  3  4  5
>>frame1
   A  B  C
a  1  2  3
b  2  3  4
c  3  4  5

3、字典创建

代码如下：

import pandas as pd
data={
    
    'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
      'year':[2000,2001,2002,2001,2002],
      'pop':[1.5,1.7,3.6,2.4,2.9]}

frame=pd.DataFrame(data)
frame

输出结果：

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9

二、DataFrame属性

1、查看列的数据类型

使用 “DataFrame.dtypes" 要查看列数据类型

代码如下：

frame.dtypes

输出结果：

A    float64
B    float64
C    float64
dtype: object

2、查看DataFrame的前几行后几行

使用 “head()” 可以查看前几行的数据，默认是前5行，参数也可以自己设置。
使用 “tail()” 可以查看后几行的数据，默认是后5行，参数也可以自己设置。

默认是前5行
代码如下：

frame = pd.DataFrame(np.arange(36).reshape(6, 6), index=list('abcdef'), columns=list('ABCDEF'))
frame.head() #默认是前5行

输出结果：

	A	B	C	D	E	F
a	0	1	2	3	4	5
b	6	7	8	9	10	11
c	12	13	14	15	16	17
d	18	19	20	21	22	23
e	24	25	26	27	28	29

前2行
代码如下：

frame.head(2)

输出结果：

	A	B	C	D	E	F
a	0	1	2	3	4	5
b	6	7	8	9	10	11

默认后5行
代码如下：

frame.tail()

输出结果：

	A	B	C	D	E	F
b	6	7	8	9	10	11
c	12	13	14	15	16	17
d	18	19	20	21	22	23
e	24	25	26	27	28	29
f	30	31	32	33	34	35

后2行
代码如下：

frame.tail(2)

输出结果：

	A	B	C	D	E	F
e	24	25	26	27	28	29
f	30	31	32	33	34	35

3、查看行名与列名

使用 ”DataFrame.columns" 查看列名

代码如下：

frame.columns ##查看列名

输出结果：

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

使用 “DataFrame.index” 查看行名

代码如下：

frame.index ##查看行名

输出结果：

Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')

4、查看数据值

使用 “values” 可以查看DataFrame里的数据值，返的是一个数组。

代码如下：

frame.values

输出结果：

array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35]])

查看某一列所有的数据值

代码如下：

 print(frame['B'].values)

输出结果：

 [ 1  7 13 19 25 31]

查看某一行所有的数据值，
- 使用iloc查看数据值，根据数字索引（也就是行号，提示：0开始，代表第一行。）；
- 使用loc查看数据值，根据行名称进行索引。

代码如下：

 frame.iloc[0]
 frame.loc['a']

输出结果：

5、查看行列数

使用shape查看行列数，参数为0表示查看行，参数为1表示查看列数。

代码如下：

 frame.shape[0]
 frame.shape[1]

输出结果：

6
6

三、DataFrame切片与索引

切片表示是行切片；索引表示是列索引

行

使用冒号进行切片
借助loc,iloc

代码如下：

#使用冒号进行切片
>> frame['a':'b']
>    	A	B	C	D	E	F
	a	0	1	2	3	4	5
	b	6	7	8	9	10	11

#借助loc,iloc
#loc
>>frame.loc['a':'c','A':'C']  # ':',切片 
>		A	B	C
	a	0	1	2
	b	6	7	8
	c	12	13	14

>>frame.loc[['a','b'],['A','C']] # '[]', 索引特定行列
>		A	C
	a	0	2
	b	6	8

#iloc
>>frame.iloc[1:]  # 行切片，取第2行之后所有行
>		A	B	C	D	E	F
	b	6	7	8	9	10	11
	c	12	13	14	15	16	17
	d	18	19	20	21	22	23
	e	24	25	26	27	28	29
	f	30	31	32	33	34	35

>>frame[frame['B']==13].index #显示所有的行名
> Index(['c'], dtype='object')

列

可以直接根据列名。
使用loc/iloc

代码如下：

>>frame['A'] #取名为‘A‘的列
> 	a     0
  	b     6
  	c    12
    d    18
 	e    24
  	f    30
  	
>>frame.loc[:,'A':'C'] #取A-C列
>		A	B	C
	a	0	1	2
	b	6	7	8
	c	12	13	14
	d	18	19	20
	e	24	25	26
	f	30	31	32
	
>>frame.iloc[:,1] #取第二列 
>	a     1
	b     7
	c    13
	d    19
	e    25
	f    31

行+列

代码如下：

>> frame.iloc[1:,-2:] #行：第二行开始 列：倒数第二列开始
>		E	F
	b	10	11
	c	16	17
	d	22	23
	e	28	29
	f	34	35
	
>> frame[frame['A']>7] #A值大于7的所有行
>		A	B	C	D	E	F
	c	12	13	14	15	16	17
	d	18	19	20	21	22	23
	e	24	25	26	27	28	29
	f	30	31	32	33	34	35
	
>> frame['B'][frame['A']>7]   # A>7的所有行的'B'信息
>	c    13
	d    19
	e    25
	f    31
	Name: B, dtype: int32

四、DataFrame操作

1、转置

使用字母".T"

代码如下：

frame.T

输出结果：

	a	b	c	d	e	f
A	0	6	12	18	24	30
B	1	7	13	19	25	31
C	2	8	14	20	26	32
D	3	9	15	21	27	33
E	4	10	16	22	28	34
F	5	11	17	23	29	35

2、描述性统计

使用 “describe()” 可以对数据根据列进行描述性统计，如果有的列是非数值型的，就是不会进行统计，如果想对行进行描述性统计，转置后再进行"describe()“

代码如下：

frame.describe()

输出结果：

			A			B			C			D			E			F
count	6.000000	6.000000	6.000000	6.000000	6.000000	6.000000
mean	15.000000	16.000000	17.000000	18.000000	19.000000	20.000000
std		11.224972	11.224972	11.224972	11.224972	11.224972	11.224972
min		0.000000	1.000000	2.000000	3.000000	4.000000	5.000000
25%		7.500000	8.500000	9.500000	10.500000	11.500000	12.500000
50%		15.000000	16.000000	17.000000	18.000000	19.000000	20.000000
75%		22.500000	23.500000	24.500000	25.500000	26.500000	27.500000
max		30.000000	31.000000	32.000000	33.000000	34.000000	35.000000

3、计算

算术运算

add(other) 数学运算加上具体的一个数字

代码如下：

frame['A'].add(100)

输出结果：

sub(other) 求出两列的数据差

代码如下：

frame['A-B‘]=frame['A'].sub(frame['B'])
frame

输出结果：

	A	B	C	D	E	F	A-B
a	0	1	2	3	4	5	-1
b	6	7	8	9	10	11	-1
c	12	13	14	15	16	17	-1
d	18	19	20	21	22	23	-1
e	24	25	26	27	28	29	-1
f	30	31	32	33	34	35	-1

round(other) : 保留小数位数

保留两位小数
代码如下：

frame2=pd.DataFrame({
    
    'col1':[1.234,2.34,4.5678],'col2':[1.0987,0.9876,3.45]}) #
frame2.round(2)

输出结果：

	col1	col2
0	1.23	1.10
1	2.34	0.99
2	4.57	3.45

不同的列制定不同的小数位数
代码如下：

frame2.round({
    
    'col1':1,'col2':2})

输出结果：

	col1 col2
0	1.2	 1.10
1	2.3	 0.99
2	4.6	 3.45

逻辑运算

逻辑运算符号 < , > , | , &

逻辑运算类型：>,>=,<,<=,==,!=
复合逻辑运算：&，|，~，（与，或，非）

筛选B>8的数据
代码如下：

frame['B']>2 #返回逻辑结果

输出结果：

a    False
b     True
c     True
d     True
e     True
f     True
Name: B, dtype: bool

逻辑筛选的结果作为筛选的依据
代码如下：

frame[frame['B']>2]

输出结果：

	A	B	C	D	E	F	A-B
b	6	7	8	9	10	11	-1
c	12	13	14	15	16	17	-1
d	18	19	20	21	22	23	-1
e	24	25	26	27	28	29	-1
f	30	31	32	33	34	35	-1

一个或多个逻辑判断，筛选B>8并且C>10
代码如下：

frame[(frame['B']>8)& (frame['C']>10)]

输出结果：

	A	B	C	D	E	F	A-B
c	12	13	14	15	16	17	-1
d	18	19	20	21	22	23	-1
e	24	25	26	27	28	29	-1
f	30	31	32	33	34	35	-1

逻辑运算函数

DataFrame.query() ##直接得到结果数据
- query(expr)
  - expr: 查询字符串
DataFrame.B.isin([3,6,4]) ##生成bool系列，还需要索引才能得出数据

使用query让“frame[(frame[‘B’]>8)&(frame[‘C’]>10)]” 更加方便简单
代码如下：

frame.query("B>2 & C>10")

输出结果：

	A	B	C	D	E	F	A-B
c	12	13	14	15	16	17	-1
d	18	19	20	21	22	23	-1
e	24	25	26	27	28	29	-1
f	30	31	32	33	34	35	-1

isin(values)

判断 C 是否为20，26，32
代码如下：

frame[frame['C'].isin([20,26,32])]

输出结果：

	A	B	C	D	E	F	A-B
d	18	19	20	21	22	23	-1
e	24	25	26	27	28	29	-1
f	30	31	32	33	34	35	-1

统计函数

单个函数进行统计的时候，坐标轴还是按照默认对每列axis=0，如果要对行，需要指定axis=1。

count() : number of non-NA observations,统计非空数量。
sum() : 求和。默认对每列求和，“sum(1)” 为对每行求和
mean() : 平均值。
median() : 中位数。
min() : 最小值。
max() : 最大值。
mode() :
abs() : 绝对值
std() : 标准差
var() : 均方差

代码如下：

frame.sum()#对每列求和

输出结果：

代码如下：

frame.sum(1)#每行求和

输出结果：

a     14
b     50
c     86
d    122
e    158
f    194
dtype: int64

代码如下：

frame.count()#统计非空数量

输出结果：

A      6
B      6
C      6
D      6
E      6
F      6
A-B    6
dtype: int64

代码如下：

frame.count(1)#统计每行非空数量

输出结果：

a    7
b    7
c    7
d    7
e    7
f    7
dtype: int64

累计统计函数

cumsum() 计算前1/2/3/…/n个数的和
cummax() 计算前1/2/3/…/n个数的最大值
cummin() 计算前1/2/3/…/n个数的最小值
cumprod() 计算前1/2/3/…/n个数的积

代码如下：

frame.cumsum()

输出结果：

	A	B	C	D	E	F	A-B
a	0	1	2	3	4	5	-1
b	6	8	10	12	14	16	-2
c	18	21	24	27	30	33	-3
d	36	40	44	48	52	56	-4
e	60	65	70	75	80	85	-5
f	90	96	102	108	114	120	-6

自定义运算

使用 “DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)” 进行数乘运算
- func : function
- axis : 0是指的行，1是指的列，默认为行。

定义累计求和
代码如下：

frame.apply(np.cumsum,axis=0,result_type=None)

输出结果：

	A	B	C	D	E	F	A-B
a	0	1	2	3	4	5	-1
b	6	8	10	12	14	16	-2
c	18	21	24	27	30	33	-3
d	36	40	44	48	52	56	-4
e	60	65	70	75	80	85	-5
f	90	96	102	108	114	120	-6

定义一个对列，最大值-最小值的函数
代码如下：

frame[['A','B']].apply(lambda x : x.max()-x.min())

输出结果：

A    30
B    30
dtype: int64

定义对某列进行乘法运算
代码如下：

frame[['A','B']].apply(lambda x:x*2))

输出结果：

4、新增

列

添加空列

代码如下：

frame['price']=''

输出结果：

	A	B	C	D	E	F	A-B	price
a	0	1	2	3	4	5	-1	
b	6	7	8	9	10	11	-1	
c	12	13	14	15	16	17	-1	
d	18	19	20	21	22	23	-1	
e	24	25	26	27	28	29	-1	
f	30	31	32	33	34	35	-1

代码如下：

frame['price'] = pd.Series(dtype='int',index=['a','b','c','d','e','f'])
frame['price']=0

输出结果：

A	B	C	D	E	F	A-B	price
a	0	1	2	3	4	5	-1	0
b	6	7	8	9	10	11	-1	0
c	12	13	14	15	16	17	-1	0
d	18	19	20	21	22	23	-1	0
e	24	25	26	27	28	29	-1	0
f	30	31	32	33	34	35	-1	0

添加 / 在指定位置插入列

扩充列可以直接像字典一样，列名对应一个 list ，要注意 list 的长度要跟 index 的长度一致。
代码如下：

frame['G']=['999','999','999','999','999','999']

输出结果：

A	B	C	D	E	F	A-B	price	G
a	0	1	2	3	4	5	-1	0	999
b	6	7	8	9	10	11	-1	0	999
c	12	13	14	15	16	17	-1	0	999
d	18	19	20	21	22	23	-1	0	999
e	24	25	26	27	28	29	-1	0	999
f	30	31	32	33	34	35	-1	0	999

对索引顺序有要求的用Series添加。
注意：若使用Series初始化一定要指定index，因为它默认索引为0、1、2…，如果你的dataframe索引不是，就会全部初始化为NaN。

代码如下：

frame['H']=pd.Series([1,2,3])
frame

输出结果：

	A	B	C	D	E	F	A-B	price	G	H
a	0	1	2	3	4	5	-1	0	999	NaN
b	6	7	8	9	10	11	-1	0	999	NaN
c	12	13	14	15	16	17	-1	0	999	NaN
d	18	19	20	21	22	23	-1	0	999	NaN
e	24	25	26	27	28	29	-1	0	999	NaN
f	30	31	32	33	34	35	-1	0	999	NaN

代码如下：

frame['H']=pd.Series([1,2,3,4,5,6],index=['a','b','c','d','e','f'])
frame

输出结果：

	A	B	C	D	E	F	A-B	price	G	H
a	0	1	2	3	4	5	-1	0	999	1
b	6	7	8	9	10	11	-1	0	999	2
c	12	13	14	15	16	17	-1	0	999	3
d	18	19	20	21	22	23	-1	0	999	4
e	24	25	26	27	28	29	-1	0	999	5
f	30	31	32	33	34	35	-1	0	999	6

使用insert，使用这个方法可以指定把列插入到第几列，其他的列顺延。

代码如下：

## 将列名为“QQ”，数值['999','999','999','999','999','999']插入到第一列，其他列顺延。
frame.insert(0, 'QQ', ['999','999','999','999','999','999'])

输出结果：

	QQ	A	B	C	D	E	F	A-B	price	G
a	999	0	1	2	3	4	5	-1	0	999
b	999	6	7	8	9	10	11	-1	0	999
c	999	12	13	14	15	16	17	-1	0	999
d	999	18	19	20	21	22	23	-1	0	999
e	999	24	25	26	27	28	29	-1	0	999
f	999	30	31	32	33	34	35	-1	0	999

行

添加行

用loc直接赋值新的行
代码如下：

new_data_list=['666','999','555','3','4','8','0','0','1']
frame.loc[6]=new_data_list
frame

输出结果：

	B	C	D	E	F	A-B	price	H	J
a	1	2	3	4	5	-1	0	1	1
b	7	8	9	10	11	-1	0	2	2
c	13	14	15	16	17	-1	0	3	3
d	19	20	21	22	23	-1	0	4	4
e	25	26	27	28	29	-1	0	5	5
f	31	32	33	34	35	-1	0	6	6
6	666	999	555	3	4	8	0	0	1

用loc的标签直接赋值新的行
代码如下：

frame.loc['g']=['666','999','555','3','4','8','0','0','1']

输出结果：

B	C	D	E	F	A-B	price	H	J
a	1	2	3	4	5	-1	0	1	1
b	7	8	9	10	11	-1	0	2	2
c	13	14	15	16	17	-1	0	3	3
d	19	20	21	22	23	-1	0	4	4
e	25	26	27	28	29	-1	0	5	5
f	31	32	33	34	35	-1	0	6	6
6	666	999	555	3	4	8	0	0	1
g	666	999	555	3	4	8	0	0	1

5、修改

使用 DataFrame.rename(mapper=None, *, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors=‘ignore’)

修改列名
代码如下：

frame.rename(columns={
    
    'A':'key'},inplace=False)

输出结果：

key	B	C	D	E	F	price
a	0	1	2	3	4	5	0
b	6	7	8	9	10	11	0
c	12	13	14	15	16	17	0
d	18	19	20	21	22	23	0
e	24	25	26	27	28	29	0
f	30	31	32	33	34	35	0

6、删除

drop()函数
- 默认删除行，不会删除原数据。
- 指定axis=1为删除列。
- 指定inplace=True直接在原数据上执行操作。

代码如下：

frame.drop('C')
frame

输出结果：

	A	B	C	D	E	F	A-B	price	G	H
a	0	1	2	3	4	5	-1	0	999	1
b	6	7	8	9	10	11	-1	0	999	2
c	12	13	14	15	16	17	-1	0	999	3
d	18	19	20	21	22	23	-1	0	999	4
e	24	25	26	27	28	29	-1	0	999	5
f	30	31	32	33	34	35	-1	0	999	6

代码如下：

frame.drop('A',axis=1,inplace=True)
frame

输出结果：

	B	C	D	E	F	A-B	price	G	H
a	1	2	3	4	5	-1	0	999	1
b	7	8	9	10	11	-1	0	999	2
c	13	14	15	16	17	-1	0	999	3
d	19	20	21	22	23	-1	0	999	4
e	25	26	27	28	29	-1	0	999	5
f	31	32	33	34	35	-1	0	999	6

del会直接在原数据中删除
代码如下：

del frame['G']
frame

输出结果：

	B	C	D	E	F	A-B	price	H
a	1	2	3	4	5	-1	0	1
b	7	8	9	10	11	-1	0	2
c	13	14	15	16	17	-1	0	3
d	19	20	21	22	23	-1	0	4
e	25	26	27	28	29	-1	0	5
f	31	32	33	34	35	-1	0	6

7、去重

DataFrame.drop_duplicates(subset=None,keep=‘first’, inplace=False)
- subset：指定是哪些列重复。
- keep：去重后留下第几行，{‘first’, ‘last’, False}, default ‘first’｝，如果是False，则去除全部重复的行。
- inplace：是否作用于原来的DataFrame。

#原来数据情况
	B	C	D	E	F	A-B	price	H	J
a	1	2	3	4	5	-1	0	1	1
b	7	8	9	10	11	-1	0	2	2
c	13	14	15	16	17	-1	0	3	3
d	19	20	21	22	23	-1	0	4	4
e	25	26	27	28	29	-1	0	5	5
f	31	32	33	34	35	-1	0	6	6
6	666	999	555	3	4	8	0	0	6
g	666	999	555	3	4	8	0	0	6

去重重复行，保留重复行中最后一行
代码如下：

#删除重复行
frame.drop_duplicates(keep='last')

输出结果：

	B	C	D	E	F	A-B	price	H	J
a	1	2	3	4	5	-1	0	1	1
b	7	8	9	10	11	-1	0	2	2
c	13	14	15	16	17	-1	0	3	3
d	19	20	21	22	23	-1	0	4	4
e	25	26	27	28	29	-1	0	5	5
f	31	32	33	34	35	-1	0	6	6
g	666	999	555	3	4	8	0	0	6

去重’J’列中有重复值所在的行
代码如下：

frame.drop_duplicates(subset=('J',))

输出结果：

	B	C	D	E	F	A-B	price	H	J
a	1	2	3	4	5	-1	0	1	1
b	7	8	9	10	11	-1	0	2	2
c	13	14	15	16	17	-1	0	3	3
d	19	20	21	22	23	-1	0	4	4
e	25	26	27	28	29	-1	0	5	5
f	31	32	33	34	35	-1	0	6	6

8、排序

sort_values()
- by：指定按哪列排序
- ascending：是否升序

代码如下：

frame.sort_values(by='J',ascending=False)

输出结果：

	B	C	D	E	F	A-B	price	H	J
f	31	32	33	34	35	-1	0	6	6
6	666	999	555	3	4	8	0	0	6
g	666	999	555	3	4	8	0	0	6
e	25	26	27	28	29	-1	0	5	5
d	19	20	21	22	23	-1	0	4	4
c	13	14	15	16	17	-1	0	3	3
b	7	8	9	10	11	-1	0	2	2
a	1	2	3	4	5	-1	0	1	1

9、合并

merge 方法主要基于两个dataframe的共同列进行合并
join 方法主要基于两个dataframe的索引进行合并
concat 方法是对series或dataframe进行拼接或列拼接

merge方法

基于单列的连接

内连接
代码如下：

import pandas as pd
import numpy as np
#定义df1
df1=pd.DataFrame({
    
    'A':[1,,3,4,5],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
#定义df2
df2=pd.DataFrame({
    
    'A':[1,2,6,7,8],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']})
#基于相同列A的内连接
df3=pd.merge(df1,df2,how='inner',on='A')

print(df1)
print(df2)
print(df3)

输出结果：

>>df1
   A  freatrue1 feature2
0  1        0.0      one
1  2        0.2      two
2  2        0.4    three
3  4        0.6     four
4  5        0.8     five
>>df2
   A   color      fruits
0  1     red       apple
1  1    blue      grades
2  2  orange  watermelon
3  7  purple        pear
4  8    pink       mango
>>df3
   A  freatrue1 feature2   color      fruits
0  1        0.0      one     red       apple
1  1        0.0      one    blue      grades
2  2        0.2      two  orange  watermelon
3  2        0.4    three  orange  watermelon

外连接

基于共同列的并集进行连接，参数 how=’outer’, on=共有列名。若两个dataframe间除了on设置的连接列外并无相同列，则该列的值置为NaN。

代码如下：

import pandas as pd
import numpy as np
#定义df1
df1=pd.DataFrame({
    
    'A':[1,,3,4,5],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
#定义df2
df2=pd.DataFrame({
    
    'A':[1,2,6,7,8],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']})
#基于相同列A的内连接
df3=pd.merge(df1,df2,how='outer',on='A')

print(df1)
print(df2)
print(df3)

输出结果：

>>df1
   A  freatrue1 feature2
0  1        0.0      one
1  2        0.2      two
2  2        0.4    three
3  4        0.6     four
4  5        0.8     five
>>df2
   A   color      fruits
0  1     red       apple
1  1    blue      grades
2  2  orange  watermelon
3  7  purple        pear
4  8    pink       mango
>>df3
   A  freatrue1 feature2   color      fruits
0  1        0.0      one     red       apple
1  1        0.0      one    blue      grades
2  2        0.2      two  orange  watermelon
3  2        0.4    three  orange  watermelon
4  4        0.6     four     NaN         NaN
5  5        0.8     five     NaN         NaN
6  7        NaN      NaN  purple        pear
7  8        NaN      NaN    pink       mango

左连接

基于左边位置dataframe的列进行连接，参数 how=’left’, on=共有列名。若两个dataframe间除了on设置的连接列外并无相同列，则该列的值置为NaN。

代码如下：

import pandas as pd
import numpy as np
#定义df1
df1=pd.DataFrame({
    
    'A':[1,,3,4,5],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
#定义df2
df2=pd.DataFrame({
    
    'A':[1,2,6,7,8],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']})
#基于相同列A的内连接
df3=pd.merge(df1,df2,how='left',on='A')

print(df1)
print(df2)
print(df3)

输出结果：

>>df1
   A  freatrue1 feature2
0  1        0.0      one
1  2        0.2      two
2  2        0.4    three
3  4        0.6     four
4  5        0.8     five
>>df2
   A   color      fruits
0  1     red       apple
1  1    blue      grades
2  2  orange  watermelon
3  7  purple        pear
4  8    pink       mango
>>df3
 	A  freatrue1 feature2   color      fruits
0  1        0.0      one     red       apple
1  1        0.0      one    blue      grades
2  2        0.2      two  orange  watermelon
3  2        0.4    three  orange  watermelon
4  4        0.6     four     NaN         NaN
5  5        0.8     five     NaN         NaN

右连接

基于右边位置dataframe的列进行连接，参数 how=’right’, on=共有列名。若两个dataframe间除了on设置的连接列外并无相同列，则该列的值置为NaN。

代码如下：

import pandas as pd
import numpy as np
#定义df1
df1=pd.DataFrame({
    
    'A':[1,,3,4,5],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
#定义df2
df2=pd.DataFrame({
    
    'A':[1,2,6,7,8],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']})
#基于相同列A的内连接
df3=pd.merge(df1,df2,how='right',on='A')

print(df1)
print(df2)
print(df3)

输出结果：

>>df1
   A  freatrue1 feature2
0  1        0.0      one
1  2        0.2      two
2  2        0.4    three
3  4        0.6     four
4  5        0.8     five
>>df2
   A   color      fruits
0  1     red       apple
1  1    blue      grades
2  2  orange  watermelon
3  7  purple        pear
4  8    pink       mango
>>df3
 	A  freatrue1 feature2   color      fruits
0  1        0.0      one     red       apple
1  1        0.0      one    blue      grades
2  2        0.2      two  orange  watermelon
3  2        0.4    three  orange  watermelon
4  7        NaN      NaN  purple        pear
5  8        NaN      NaN    pink       mango

基于多列的连接

多列的内连接（取交集）
代码如下：

df1=pd.DataFrame({
    
    'A':[1,2,2,4,5],'B':['a','b','a','d','c'],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
df2=pd.DataFrame({
    
    'A':[1,1,2,7,8],'B':['e','g','a','d','c'],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']})
df3=pd.merge(df1,df2,how='inner',on=['A','B'])

print(df1)
print(df2)
print(df3)

输出结果：

>>df1
   A  B  freatrue1 feature2
0  1  a        0.0      one
1  2  b        0.2      two
2  2  a        0.4    three
3  4  d        0.6     four
4  5  c        0.8     five
>>df2
   A  B   color      fruits
0  1  e     red       apple
1  1  g    blue      grades
2  2  a  orange  watermelon
3  7  d  purple        pear
4  8  c    pink       mango
>>df3
   A  B  freatrue1 feature2   color      fruits
0  2  a        0.4    three  orange  watermelon

多列的外连接（取并集）
代码如下：

df1=pd.DataFrame({
    
    'A':[1,2,2,4,5],'B':['a','b','a','d','c'],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
df2=pd.DataFrame({
    
    'A':[1,1,2,7,8],'B':['e','g','a','d','c'],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']})
df3=pd.merge(df1,df2,how='outer',on=['A','B'])

print(df1)
print(df2)
print(df3)

输出结果：

>>df1
   A  B  freatrue1 feature2
0  1  a        0.0      one
1  2  b        0.2      two
2  2  a        0.4    three
3  4  d        0.6     four
4  5  c        0.8     five
>>df2
   A  B   color      fruits
0  1  e     red       apple
1  1  g    blue      grades
2  2  a  orange  watermelon
3  7  d  purple        pear
4  8  c    pink       mango
>>df3
   A  B  freatrue1 feature2   color      fruits
0  1  a        0.0      one     NaN         NaN
1  2  b        0.2      two     NaN         NaN
2  2  a        0.4    three  orange  watermelon
3  4  d        0.6     four     NaN         NaN
4  5  c        0.8     five     NaN         NaN
5  1  e        NaN      NaN     red       apple
6  1  g        NaN      NaN    blue      grades
7  7  d        NaN      NaN  purple        pear
8  8  c        NaN      NaN    pink       mango

多列的左连接
代码如下：

df1=pd.DataFrame({
    
    'A':[1,2,2,4,5],'B':['a','b','a','d','c'],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
df2=pd.DataFrame({
    
    'A':[1,1,2,7,8],'B':['e','g','a','d','c'],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']})
df3=pd.merge(df1,df2,how='left',on=['A','B'])

print(df1)
print(df2)
print(df3)

输出结果：

>>df1
   A  B  freatrue1 feature2
0  1  a        0.0      one
1  2  b        0.2      two
2  2  a        0.4    three
3  4  d        0.6     four
4  5  c        0.8     five
>>df2
   A  B   color      fruits
0  1  e     red       apple
1  1  g    blue      grades
2  2  a  orange  watermelon
3  7  d  purple        pear
4  8  c    pink       mango
>>df3
   A  B  freatrue1 feature2   color      fruits
0  1  a        0.0      one     NaN         NaN
1  2  b        0.2      two     NaN         NaN
2  2  a        0.4    three  orange  watermelon
3  4  d        0.6     four     NaN         NaN
4  5  c        0.8     five     NaN         NaN

基于index的连接方法

代码如下：

import numpy as np
import pandas as pd
df1=pd.DataFrame({
    
    'A':[1,2,2,4,5],'B':['a','b','a','d','c'],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
df2=pd.DataFrame({
    
    'A':[1,1,2,7,8],'B':['e','g','a','d','c'],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']},index=[4,5,6,7,8])
#基于df1的A列与df2的index连接
df3=pd.merge(df1,df2,how='inner',left_on='A',right_index=True)

#设置参数suffixes以修改除连接列外相同列的后缀名
df4=pd.merge(df1,df2,how='inner',left_on='A',right_index=True,suffixes=('_df1','_df2'))

print(df1)
print(df2)
print(df3)
print(df4)

输出结果：

>>df1
   A  B  freatrue1 feature2
0  1  a        0.0      one
1  2  b        0.2      two
2  2  a        0.4    three
3  4  d        0.6     four
4  5  c        0.8     five
>>df2
   A  B   color      fruits
4  1  e     red       apple
5  1  g    blue      grades
6  2  a  orange  watermelon
7  7  d  purple        pear
8  8  c    pink       mango
>>df3
   A  A_x B_x  freatrue1 feature2  A_y B_y color  fruits
3  4    4   d        0.6     four    1   e   red   apple
4  5    5   c        0.8     five    1   g  blue  grades
>>df4
   A  A_df1 B_df1  freatrue1 feature2  A_df2 B_df2 color  fruits
3  4      4     d        0.6     four      1     e   red   apple
4  5      5     c        0.8     five      1     g  blue  grades

join 方法

基于inde连接dataframe,连接方法与merge一致，内连接，外连接，左连接和右连接。

index与index连接

代码如下：

df1=pd.DataFrame({
    
    'A':[1,2,3,4,5],'B':['red','blue','orange','purple','pink']})
df2=pd.DataFrame({
    
    'A':[1,2,3],'fruits':['apple','grades','watermelon']})
# lsuffix 和 rsuffix 设置连接的后缀名
df3=df1.join(df2,lsuffix='_caller',rsuffix='_other',how='inner')

print(df1)
print(df2)
print(df3)

输出结果：

>>df1
	A       B
0  1     red
1  2    blue
2  3  orange
3  4  purple
>>df2
   A      fruits
0  1       apple
1  2      grades
2  3  watermelon
>>df3
   A_caller       B  A_other      fruits
0         1     red        1       apple
1         2    blue        2      grades
2         3  orange        3  watermelon

基于列进行连接（join）

代码如下：

df1=pd.DataFrame({
    
    'A':[1,2,3,4,5],'B':['red','blue','orange','purple','pink']})
df2=pd.DataFrame({
    
    'A':[1,2,3],'fruits':['apple','grades','watermelon']})
#基于A列进行连接
df3=df1.set_index('A').join(df2.set_index('A'),how='inner')

print(df1)
print(df2)
print(df3)

输出结果：

>>df1
	A     B
0  1     red
1  2    blue
2  3  orange
3  4  purple
>>df2
   A      fruits
0  1       apple
1  2      grades
2  3  watermelon
>>df3
        B      fruits
A                    
1     red       apple
2    blue      grades
3  orange  watermelon

concat 方法

concat 方法是拼接函数，有行拼接和列拼接，默认是行拼接，拼接方法默认是外拼接（并集），拼接的对象是pandas数据类型。

series类型的拼接方法

行拼接

代码如下：

df1=pd.Series([1,2,3],index=['a','b','c'])
df2=pd.Series([4,5,6],index=['b','c','d'])
df3=pd.concat([df1,df2])
df4=pd.concat([df1,df2],keys=['fea1','fea2'])#行拼接若有相同的索引，为了区分索引，我们最外层定义了索引的分组情况。

print(df1)
print(df2)
print(df3)
print(df4)

输出结果：

>>df1
a    1
b    2
c    3
dtype: int64
>>df2
b    4
c    5
d    6
dtype: int64
>>df3
a    1
b    2
c    3
b    4
c    5
d    6
dtype: int64
>>df4
fea1  a    1
      b    2
      c    3
fea2  b    4
      c    5
      d    6
dtype: int64

列拼接

默认以并集的方式拼接

代码如下：

pd.concat([df1,df2],axis=1)

输出结果：

	0	1
a	1.0	NaN
b	2.0	4.0
c	3.0	5.0
d	NaN	6.0

以交集的方式拼接
- keys : 设置列拼接的列名
- join：‘inner’ 交集

代码如下：

pd.concat([df1,df2],axis=1,join='inner',keys=['fea1','fea2'])

输出结果：

	0	1
b	2	4
c	3	5

DataFrame类型拼接方法

行拼接

代码如下：

df1=pd.DataFrame({
    
    'A':[1,2,3],'fea1':['b','c','d']})
df2=pd.DataFrame({
    
    'A':[4,5,6],'fea1':['a','b','c']})

df3=pd.concat([df1,df2]) #行拼接
print(df1)
print(df2)
print(df3)

输出结果：

>>df1
   A fea1
0  1    b
1  2    c
2  3    d
>>df2
   A fea1
0  4    a
1  5    b
2  6    c
>>df3
   A fea1
0  1    b
1  2    c
2  3    d
0  4    a
1  5    b
2  6    c

列拼接
代码如下：

df1=pd.DataFrame({
    
    'A':[1,2,3],'fea1':['b','c','d']})
df2=pd.DataFrame({
    
    'A':[4,5,6],'fea1':['a','b','c']})

df3=pd.concat([df1,df2],axis=1) #列拼接
print(df1)
print(df2)
print(df3)

输出结果：

>>df1
   A fea1
0  1    b
1  2    c
2  3    d
>>df2
   A fea1
0  4    a
1  5    b
2  6    c
>>df3
   A fea1  A fea1
0  1    b  4    a
1  2    c  5    b
2  3    d  6    c

python数据分析的基础知识—pandas中dataframe()使用

文章目录

前言

一、DataFrame创建

1、函数创建

2、直接创建

3、字典创建

二、DataFrame属性

1、查看列的数据类型

2、查看DataFrame的前几行后几行

3、查看行名与列名

4、查看数据值

5、查看行列数

三、DataFrame切片与索引

四、DataFrame操作

1、转置

2、描述性统计

3、计算

算术运算

逻辑运算

统计函数

累计统计函数

自定义运算

4、新增

5、修改

6、删除

7、去重

8、排序

9、合并

merge方法

基于单列的连接

基于多列的连接

基于index的连接方法

join 方法

index与index连接

基于列进行连接（join）

concat 方法

series类型的拼接方法

DataFrame类型拼接方法

猜你喜欢