Basic knowledge of python data analysis - use of dataframe() in pandas

Tip: After the article is written, the table of contents can be automatically generated. How to generate it can refer to the help document on the right


foreword

A DataFrame is a tabular data structure that contains an ordered set of columns that can be of a different value type (numeric, string, boolean, etc.). DataFrame has both row and column indexes.


提示:以下是本篇文章正文内容,下面案例可供参考

1. DataFrame creation

1. Function creation

code show as below:

import pandas as pd 
import numpy as np

frame=pd.DataFrame(np.random.randn(3,3),index=list('abc'),columns=list('ABC'))
frame

Output result:

		A			B			C
a	-0.391570	0.182729	1.010572
b	0.455405	0.418206	0.134341
c	-0.491456	-0.527641	0.868909

2. Create directly

code show as below:

import pandas as pd
import numpy as np

frame= pd.DataFrame([[1, 2, 3], 
                    [2, 3, 4],
                    [3, 4, 5]],
                   index=list('abc'), columns=list('ABC'))
frame

#可以分别定义列索引(columns)与行切片(index)
frame1=pd.DataFrame([[1, 2, 3], 
                    [2, 3, 4],
                    [3, 4, 5]])
frame1.columns=list('ABC')  
frame1.index=list('abc') 
frame1      

Output result:

>>frame
   A  B  C
a  1  2  3
b  2  3  4
c  3  4  5
>>frame1
   A  B  C
a  1  2  3
b  2  3  4
c  3  4  5

3. Dictionary creation

code show as below:

import pandas as pd
data={
    
    'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
      'year':[2000,2001,2002,2001,2002],
      'pop':[1.5,1.7,3.6,2.4,2.9]}

frame=pd.DataFrame(data)
frame

Output result:

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9

Two, DataFrame properties

1. View the data type of the column

  • Use "DataFrame.dtypes" to see column data types

code show as below:

frame.dtypes

Output result:

A    float64
B    float64
C    float64
dtype: object

2. View the first few lines and the last few lines of the DataFrame

  • Use "head()" to view the data of the first few rows, the default is the first 5 rows, and the parameters can also be set by yourself.
  • Use "tail()" to view the data of the next few lines, the default is the last 5 lines, and the parameters can also be set by yourself.

The default is the first 5 lines.
The code is as follows:

frame = pd.DataFrame(np.arange(36).reshape(6, 6), index=list('abcdef'), columns=list('ABCDEF'))
frame.head() #默认是前5

Output result:

	A	B	C	D	E	F
a	0	1	2	3	4	5
b	6	7	8	9	10	11
c	12	13	14	15	16	17
d	18	19	20	21	22	23
e	24	25	26	27	28	29

The first 2 lines
of code are as follows:

frame.head(2) 

Output result:

	A	B	C	D	E	F
a	0	1	2	3	4	5
b	6	7	8	9	10	11

The last 5 lines
of code are as follows by default:

frame.tail() 

Output result:

	A	B	C	D	E	F
b	6	7	8	9	10	11
c	12	13	14	15	16	17
d	18	19	20	21	22	23
e	24	25	26	27	28	29
f	30	31	32	33	34	35

The last 2 lines
of code are as follows:

frame.tail(2) 

Output result:

	A	B	C	D	E	F
e	24	25	26	27	28	29
f	30	31	32	33	34	35

3. View row and column names

  • Use "DataFrame.columns" to see column names

code show as below:

frame.columns ##查看列名

Output result:

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
  • Use "DataFrame.index" to see the row names

code show as below:

frame.index ##查看行名

Output result:

Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')

4. Check the data value

  • Use "values" to view the data values ​​in the DataFrame, which returns an array.

code show as below:

frame.values

Output result:

array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35]])
  • View all data values ​​in a column

code show as below:

 print(frame['B'].values)

Output result:

 [ 1  7 13 19 25 31]
  • View all data values ​​of a row,
    • Use iloc to view the data value, according to the number index (that is, the line number, prompt: 0 starts, representing the first line.);
    • Use loc to look at data values, indexed by row name.

code show as below:

 frame.iloc[0]
 frame.loc['a']

Output result:

A    0
B    1
C    2
D    3
E    4
F    5

5. View the number of rows and columns

  • Use shape to view the number of rows and columns. The parameter is 0 to view the row, and the parameter is 1 to view the number of columns.

code show as below:

 frame.shape[0]
 frame.shape[1]

Output result:

6
6

3. DataFrame slicing and indexing

Slice representation is a row slice; index representation is a column index

OK

  • Slicing with colons
  • With loc, iloc

code show as below:

#使用冒号进行切片
>> frame['a':'b']
>    	A	B	C	D	E	F
	a	0	1	2	3	4	5
	b	6	7	8	9	10	11

#借助loc,iloc
#loc
>>frame.loc['a':'c','A':'C']  # ':',切片 
>		A	B	C
	a	0	1	2
	b	6	7	8
	c	12	13	14

>>frame.loc[['a','b'],['A','C']] # '[]', 索引特定行列
>		A	C
	a	0	2
	b	6	8

#iloc
>>frame.iloc[1:]  # 行切片,取第2行之后所有行
>		A	B	C	D	E	F
	b	6	7	8	9	10	11
	c	12	13	14	15	16	17
	d	18	19	20	21	22	23
	e	24	25	26	27	28	29
	f	30	31	32	33	34	35

>>frame[frame['B']==13].index #显示所有的行名
> Index(['c'], dtype='object')

List

  • Can be directly based on the column name.
  • Use loc/iloc

code show as below:

>>frame['A'] #取名为‘A‘的列
> 	a     0
  	b     6
  	c    12
    d    18
 	e    24
  	f    30
  	
>>frame.loc[:,'A':'C'] #取A-C列
>		A	B	C
	a	0	1	2
	b	6	7	8
	c	12	13	14
	d	18	19	20
	e	24	25	26
	f	30	31	32
	
>>frame.iloc[:,1] #取第二列 
>	a     1
	b     7
	c    13
	d    19
	e    25
	f    31

row+column

code show as below:

>> frame.iloc[1:,-2:] #行:第二行开始 列:倒数第二列开始
>		E	F
	b	10	11
	c	16	17
	d	22	23
	e	28	29
	f	34	35
	
>> frame[frame['A']>7] #A值大于7的所有行
>		A	B	C	D	E	F
	c	12	13	14	15	16	17
	d	18	19	20	21	22	23
	e	24	25	26	27	28	29
	f	30	31	32	33	34	35
	
>> frame['B'][frame['A']>7]   # A>7的所有行的'B'信息
>	c    13
	d    19
	e    25
	f    31
	Name: B, dtype: int32

Four, DataFrame operation

1. Transpose

  • Use the letter ".T"

code show as below:

frame.T

Output result:

	a	b	c	d	e	f
A	0	6	12	18	24	30
B	1	7	13	19	25	31
C	2	8	14	20	26	32
D	3	9	15	21	27	33
E	4	10	16	22	28	34
F	5	11	17	23	29	35

2. Descriptive statistics

  • Use "describe()" to perform descriptive statistics on data according to columns. If some columns are non-numeric, statistics will not be performed. If you want to perform descriptive statistics on rows, perform "describe()" after transposing

code show as below:

frame.describe()

Output result:

			A			B			C			D			E			F
count	6.000000	6.000000	6.000000	6.000000	6.000000	6.000000
mean	15.000000	16.000000	17.000000	18.000000	19.000000	20.000000
std		11.224972	11.224972	11.224972	11.224972	11.224972	11.224972
min		0.000000	1.000000	2.000000	3.000000	4.000000	5.000000
25%		7.500000	8.500000	9.500000	10.500000	11.500000	12.500000
50%		15.000000	16.000000	17.000000	18.000000	19.000000	20.000000
75%		22.500000	23.500000	24.500000	25.500000	26.500000	27.500000
max		30.000000	31.000000	32.000000	33.000000	34.000000	35.000000

3. Calculate

arithmetic operation

  • add(other) mathematical operation plus a specific number

code show as below:

frame['A'].add(100)

Output result:

a    100
b    106
c    112
d    118
e    124
f    130
  • sub(other) Find the data difference of two columns

code show as below:

frame['A-B‘]=frame['A'].sub(frame['B'])
frame

Output result:

	A	B	C	D	E	F	A-B
a	0	1	2	3	4	5	-1
b	6	7	8	9	10	11	-1
c	12	13	14	15	16	17	-1
d	18	19	20	21	22	23	-1
e	24	25	26	27	28	29	-1
f	30	31	32	33	34	35	-1
  • round(other) : retain the number of decimal places

Keep two decimal
places The code is as follows:

frame2=pd.DataFrame({
    
    'col1':[1.234,2.34,4.5678],'col2':[1.0987,0.9876,3.45]}) #
frame2.round(2)

Output result:

	col1	col2
0	1.23	1.10
1	2.34	0.99
2	4.57	3.45

Different columns specify different decimal places.
The code is as follows:

frame2.round({
    
    'col1':1,'col2':2}) 

Output result:

	col1 col2
0	1.2	 1.10
1	2.3	 0.99
2	4.6	 3.45

logic operation

Logical operators < , > , | , &

  • Logical operation type: >,>=,<,<=,==,!=
  • Compound logical operations: &, |, ~, (and, or, not)

The code to filter the data of B>8
is as follows:

frame['B']>2 #返回逻辑结果

Output result:

a    False
b     True
c     True
d     True
e     True
f     True
Name: B, dtype: bool

The results of logical filtering are used as the basis for filtering.
The code is as follows:

frame[frame['B']>2]

Output result:

	A	B	C	D	E	F	A-B
b	6	7	8	9	10	11	-1
c	12	13	14	15	16	17	-1
d	18	19	20	21	22	23	-1
e	24	25	26	27	28	29	-1
f	30	31	32	33	34	35	-1

One or more logical judgments, screening B>8 and C>10
codes are as follows:

frame[(frame['B']>8)& (frame['C']>10)]

Output result:

	A	B	C	D	E	F	A-B
c	12	13	14	15	16	17	-1
d	18	19	20	21	22	23	-1
e	24	25	26	27	28	29	-1
f	30	31	32	33	34	35	-1

logical operation function

  • DataFrame.query() ##Get the result data directly
    • query(expr)
      • expr: query string
  • DataFrame.B.isin([3,6,4]) ##Generate bool series, also need index to get data

Use query to make "frame[(frame['B']>8)&(frame['C']>10)]" more convenient and simple. The
code is as follows:

frame.query("B>2 & C>10")

Output result:

	A	B	C	D	E	F	A-B
c	12	13	14	15	16	17	-1
d	18	19	20	21	22	23	-1
e	24	25	26	27	28	29	-1
f	30	31	32	33	34	35	-1
  • isin(values)

The code to judge whether C is 20, 26, or 32
is as follows:

frame[frame['C'].isin([20,26,32])]

Output result:

	A	B	C	D	E	F	A-B
d	18	19	20	21	22	23	-1
e	24	25	26	27	28	29	-1
f	30	31	32	33	34	35	-1

statistical function

When performing statistics on a single function, the coordinate axis still uses axis=0 for each column by default . If you want to use rows, you need to specify axis=1.

  • count() : number of non-NA observations, count the number of non-NA observations.
  • sum() : Sum up. The default is to sum each column, "sum(1)" is to sum each row
  • mean() : mean value.
  • median() : Median.
  • min() : minimum value.
  • max() : the maximum value.
  • mode() :
  • abs() : absolute value
  • std() : standard deviation
  • var() : mean squared error

code show as below:

frame.sum()#对每列求和

Output result:

A       90
B       96
C      102
D      108
E      114
F      120
A-B     -6

code show as below:

frame.sum(1)#每行求和

Output result:

a     14
b     50
c     86
d    122
e    158
f    194
dtype: int64

code show as below:

frame.count()#统计非空数量

Output result:

A      6
B      6
C      6
D      6
E      6
F      6
A-B    6
dtype: int64

code show as below:

frame.count(1)#统计每行非空数量

Output result:

a    7
b    7
c    7
d    7
e    7
f    7
dtype: int64

cumulative statistics function

  • cumsum() calculates the sum of the first 1/2/3/.../n numbers
  • cummax() calculates the maximum value of the first 1/2/3/.../n numbers
  • cummin() calculates the minimum value of the first 1/2/3/.../n numbers
  • cumprod() calculates the product of the first 1/2/3/.../n numbers

code show as below:

frame.cumsum()

Output result:

	A	B	C	D	E	F	A-B
a	0	1	2	3	4	5	-1
b	6	8	10	12	14	16	-2
c	18	21	24	27	30	33	-3
d	36	40	44	48	52	56	-4
e	60	65	70	75	80	85	-5
f	90	96	102	108	114	120	-6

custom calculation

  • Use "DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)" for multiplication
    • func : function
    • axis : 0 refers to the row, 1 refers to the column, the default is row.

Define the cumulative summation
code as follows:

frame.apply(np.cumsum,axis=0,result_type=None)

Output result:

	A	B	C	D	E	F	A-B
a	0	1	2	3	4	5	-1
b	6	8	10	12	14	16	-2
c	18	21	24	27	30	33	-3
d	36	40	44	48	52	56	-4
e	60	65	70	75	80	85	-5
f	90	96	102	108	114	120	-6

Define a column, maximum-minimum function
code as follows:

frame[['A','B']].apply(lambda x : x.max()-x.min())

Output result:

A    30
B    30
dtype: int64

The code to define the multiplication operation for a certain column
is as follows:

frame[['A','B']].apply(lambda x:x*2))

Output result:

	A	B
a	0	2
b	12	14
c	24	26
d	36	38
e	48	50
f	60	62

4. Add

List

  • add empty column

code show as below:

frame['price']=''

Output result:

	A	B	C	D	E	F	A-B	price
a	0	1	2	3	4	5	-1	
b	6	7	8	9	10	11	-1	
c	12	13	14	15	16	17	-1	
d	18	19	20	21	22	23	-1	
e	24	25	26	27	28	29	-1	
f	30	31	32	33	34	35	-1

code show as below:

frame['price'] = pd.Series(dtype='int',index=['a','b','c','d','e','f'])
frame['price']=0

Output result:

A	B	C	D	E	F	A-B	price
a	0	1	2	3	4	5	-1	0
b	6	7	8	9	10	11	-1	0
c	12	13	14	15	16	17	-1	0
d	18	19	20	21	22	23	-1	0
e	24	25	26	27	28	29	-1	0
f	30	31	32	33	34	35	-1	0
  • Add/insert column at specified position

The extended column can be directly like a dictionary, and the column name corresponds to a list. Note that the length of the list must be the same as that of the index.
code show as below:

frame['G']=['999','999','999','999','999','999']

Output result:

A	B	C	D	E	F	A-B	price	G
a	0	1	2	3	4	5	-1	0	999
b	6	7	8	9	10	11	-1	0	999
c	12	13	14	15	16	17	-1	0	999
d	18	19	20	21	22	23	-1	0	999
e	24	25	26	27	28	29	-1	0	999
f	30	31	32	33	34	35	-1	0	999
  • If there is a requirement for the index order, use Series to add.
    Note: If you use Series initialization, you must specify index , because its default index is 0, 1, 2... If your dataframe index is not, it will all be initialized to NaN.

code show as below:

frame['H']=pd.Series([1,2,3])
frame

Output result:

	A	B	C	D	E	F	A-B	price	G	H
a	0	1	2	3	4	5	-1	0	999	NaN
b	6	7	8	9	10	11	-1	0	999	NaN
c	12	13	14	15	16	17	-1	0	999	NaN
d	18	19	20	21	22	23	-1	0	999	NaN
e	24	25	26	27	28	29	-1	0	999	NaN
f	30	31	32	33	34	35	-1	0	999	NaN

code show as below:

frame['H']=pd.Series([1,2,3,4,5,6],index=['a','b','c','d','e','f'])
frame

Output result:

	A	B	C	D	E	F	A-B	price	G	H
a	0	1	2	3	4	5	-1	0	999	1
b	6	7	8	9	10	11	-1	0	999	2
c	12	13	14	15	16	17	-1	0	999	3
d	18	19	20	21	22	23	-1	0	999	4
e	24	25	26	27	28	29	-1	0	999	5
f	30	31	32	33	34	35	-1	0	999	6
  • Using insert, this method can be used to specify which column to insert the column into, and the other columns will be postponed.

code show as below:

## 将列名为“QQ”,数值['999','999','999','999','999','999']插入到第一列,其他列顺延。
frame.insert(0, 'QQ', ['999','999','999','999','999','999'])

Output result:

	QQ	A	B	C	D	E	F	A-B	price	G
a	999	0	1	2	3	4	5	-1	0	999
b	999	6	7	8	9	10	11	-1	0	999
c	999	12	13	14	15	16	17	-1	0	999
d	999	18	19	20	21	22	23	-1	0	999
e	999	24	25	26	27	28	29	-1	0	999
f	999	30	31	32	33	34	35	-1	0	999

OK

  • add line

Use loc to directly assign a new row
The code is as follows:

new_data_list=['666','999','555','3','4','8','0','0','1']
frame.loc[6]=new_data_list
frame

Output result:

	B	C	D	E	F	A-B	price	H	J
a	1	2	3	4	5	-1	0	1	1
b	7	8	9	10	11	-1	0	2	2
c	13	14	15	16	17	-1	0	3	3
d	19	20	21	22	23	-1	0	4	4
e	25	26	27	28	29	-1	0	5	5
f	31	32	33	34	35	-1	0	6	6
6	666	999	555	3	4	8	0	0	1

Use the label of loc to directly assign a new line.
The code is as follows:

frame.loc['g']=['666','999','555','3','4','8','0','0','1']

Output result:

B	C	D	E	F	A-B	price	H	J
a	1	2	3	4	5	-1	0	1	1
b	7	8	9	10	11	-1	0	2	2
c	13	14	15	16	17	-1	0	3	3
d	19	20	21	22	23	-1	0	4	4
e	25	26	27	28	29	-1	0	5	5
f	31	32	33	34	35	-1	0	6	6
6	666	999	555	3	4	8	0	0	1
g	666	999	555	3	4	8	0	0	1

5. Modify

  • 使用 DataFrame.rename(mapper=None, *, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors=‘ignore’)

Modify the column name
code as follows:

frame.rename(columns={
    
    'A':'key'},inplace=False)

Output result:

key	B	C	D	E	F	price
a	0	1	2	3	4	5	0
b	6	7	8	9	10	11	0
c	12	13	14	15	16	17	0
d	18	19	20	21	22	23	0
e	24	25	26	27	28	29	0
f	30	31	32	33	34	35	0

6. Delete

  • drop() function
    • The row is deleted by default, and the original data will not be deleted.
    • Specify axis=1 to delete columns.
    • Specify inplace=True to perform operations directly on the original data.

code show as below:

frame.drop('C')
frame

Output result:

	A	B	C	D	E	F	A-B	price	G	H
a	0	1	2	3	4	5	-1	0	999	1
b	6	7	8	9	10	11	-1	0	999	2
c	12	13	14	15	16	17	-1	0	999	3
d	18	19	20	21	22	23	-1	0	999	4
e	24	25	26	27	28	29	-1	0	999	5
f	30	31	32	33	34	35	-1	0	999	6

code show as below:

frame.drop('A',axis=1,inplace=True)
frame

Output result:

	B	C	D	E	F	A-B	price	G	H
a	1	2	3	4	5	-1	0	999	1
b	7	8	9	10	11	-1	0	999	2
c	13	14	15	16	17	-1	0	999	3
d	19	20	21	22	23	-1	0	999	4
e	25	26	27	28	29	-1	0	999	5
f	31	32	33	34	35	-1	0	999	6
  • del will directly delete the original data.
    The code is as follows:
del frame['G']
frame

Output result:

	B	C	D	E	F	A-B	price	H
a	1	2	3	4	5	-1	0	1
b	7	8	9	10	11	-1	0	2
c	13	14	15	16	17	-1	0	3
d	19	20	21	22	23	-1	0	4
e	25	26	27	28	29	-1	0	5
f	31	32	33	34	35	-1	0	6

7. Deduplication

  • DataFrame.drop_duplicates(subset=None,keep=‘first’, inplace=False)
    • subset: Specifies which columns are repeated.
    • keep: Leave the first few lines after deduplication, {'first', 'last', False}, default 'first'}, if it is False, remove all duplicate lines.
    • inplace: Whether to act on the original DataFrame.
#原来数据情况
	B	C	D	E	F	A-B	price	H	J
a	1	2	3	4	5	-1	0	1	1
b	7	8	9	10	11	-1	0	2	2
c	13	14	15	16	17	-1	0	3	3
d	19	20	21	22	23	-1	0	4	4
e	25	26	27	28	29	-1	0	5	5
f	31	32	33	34	35	-1	0	6	6
6	666	999	555	3	4	8	0	0	6
g	666	999	555	3	4	8	0	0	6

Remove duplicate rows and keep the last row in the duplicate row.
The code is as follows:

#删除重复行
frame.drop_duplicates(keep='last')

Output result:

	B	C	D	E	F	A-B	price	H	J
a	1	2	3	4	5	-1	0	1	1
b	7	8	9	10	11	-1	0	2	2
c	13	14	15	16	17	-1	0	3	3
d	19	20	21	22	23	-1	0	4	4
e	25	26	27	28	29	-1	0	5	5
f	31	32	33	34	35	-1	0	6	6
g	666	999	555	3	4	8	0	0	6

The code for deduplicating rows with duplicate values ​​in column 'J'
is as follows:

frame.drop_duplicates(subset=('J',))

Output result:

	B	C	D	E	F	A-B	price	H	J
a	1	2	3	4	5	-1	0	1	1
b	7	8	9	10	11	-1	0	2	2
c	13	14	15	16	17	-1	0	3	3
d	19	20	21	22	23	-1	0	4	4
e	25	26	27	28	29	-1	0	5	5
f	31	32	33	34	35	-1	0	6	6

8. Sorting

  • sort_values()
    • by: specify which column to sort by
    • ascending: whether to ascend

code show as below:

frame.sort_values(by='J',ascending=False)  

Output result:

	B	C	D	E	F	A-B	price	H	J
f	31	32	33	34	35	-1	0	6	6
6	666	999	555	3	4	8	0	0	6
g	666	999	555	3	4	8	0	0	6
e	25	26	27	28	29	-1	0	5	5
d	19	20	21	22	23	-1	0	4	4
c	13	14	15	16	17	-1	0	3	3
b	7	8	9	10	11	-1	0	2	2
a	1	2	3	4	5	-1	0	1	1

9. Merge

  • The merge method is mainly based on the common columns of the two dataframes to merge
  • The join method is mainly based on the index of two dataframes to merge
  • The concat method is to concatenate or column concatenate series or dataframe

merge method

single-column-based joins

The inner join
code is as follows:

import pandas as pd
import numpy as np
#定义df1
df1=pd.DataFrame({
    
    'A':[1,,3,4,5],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
#定义df2
df2=pd.DataFrame({
    
    'A':[1,2,6,7,8],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']})
#基于相同列A的内连接
df3=pd.merge(df1,df2,how='inner',on='A')

print(df1)
print(df2)
print(df3)

Output result:

>>df1
   A  freatrue1 feature2
0  1        0.0      one
1  2        0.2      two
2  2        0.4    three
3  4        0.6     four
4  5        0.8     five
>>df2
   A   color      fruits
0  1     red       apple
1  1    blue      grades
2  2  orange  watermelon
3  7  purple        pear
4  8    pink       mango
>>df3
   A  freatrue1 feature2   color      fruits
0  1        0.0      one     red       apple
1  1        0.0      one    blue      grades
2  2        0.2      two  orange  watermelon
3  2        0.4    three  orange  watermelon

outer join

  • Join based on the union of common columns, parameter how='outer', on=common column name. If there is no same column between the two dataframes except the connection column set by on, the value of this column is set to NaN.

code show as below:

import pandas as pd
import numpy as np
#定义df1
df1=pd.DataFrame({
    
    'A':[1,,3,4,5],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
#定义df2
df2=pd.DataFrame({
    
    'A':[1,2,6,7,8],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']})
#基于相同列A的内连接
df3=pd.merge(df1,df2,how='outer',on='A')

print(df1)
print(df2)
print(df3)

Output result:

>>df1
   A  freatrue1 feature2
0  1        0.0      one
1  2        0.2      two
2  2        0.4    three
3  4        0.6     four
4  5        0.8     five
>>df2
   A   color      fruits
0  1     red       apple
1  1    blue      grades
2  2  orange  watermelon
3  7  purple        pear
4  8    pink       mango
>>df3
   A  freatrue1 feature2   color      fruits
0  1        0.0      one     red       apple
1  1        0.0      one    blue      grades
2  2        0.2      two  orange  watermelon
3  2        0.4    three  orange  watermelon
4  4        0.6     four     NaN         NaN
5  5        0.8     five     NaN         NaN
6  7        NaN      NaN  purple        pear
7  8        NaN      NaN    pink       mango

left join

  • Connect based on the columns of the dataframe at the left position, the parameters how='left', on=shared column names. If there is no same column between the two dataframes except the connection column set by on, the value of this column is set to NaN.

code show as below:

import pandas as pd
import numpy as np
#定义df1
df1=pd.DataFrame({
    
    'A':[1,,3,4,5],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
#定义df2
df2=pd.DataFrame({
    
    'A':[1,2,6,7,8],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']})
#基于相同列A的内连接
df3=pd.merge(df1,df2,how='left',on='A')

print(df1)
print(df2)
print(df3)

Output result:

>>df1
   A  freatrue1 feature2
0  1        0.0      one
1  2        0.2      two
2  2        0.4    three
3  4        0.6     four
4  5        0.8     five
>>df2
   A   color      fruits
0  1     red       apple
1  1    blue      grades
2  2  orange  watermelon
3  7  purple        pear
4  8    pink       mango
>>df3
 	A  freatrue1 feature2   color      fruits
0  1        0.0      one     red       apple
1  1        0.0      one    blue      grades
2  2        0.2      two  orange  watermelon
3  2        0.4    three  orange  watermelon
4  4        0.6     four     NaN         NaN
5  5        0.8     five     NaN         NaN

right join

  • Connect based on the columns of the dataframe at the right position, the parameters how='right', on=common column names. If there is no same column between the two dataframes except the connection column set by on, the value of this column is set to NaN.

code show as below:

import pandas as pd
import numpy as np
#定义df1
df1=pd.DataFrame({
    
    'A':[1,,3,4,5],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
#定义df2
df2=pd.DataFrame({
    
    'A':[1,2,6,7,8],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']})
#基于相同列A的内连接
df3=pd.merge(df1,df2,how='right',on='A')

print(df1)
print(df2)
print(df3)

Output result:

>>df1
   A  freatrue1 feature2
0  1        0.0      one
1  2        0.2      two
2  2        0.4    three
3  4        0.6     four
4  5        0.8     five
>>df2
   A   color      fruits
0  1     red       apple
1  1    blue      grades
2  2  orange  watermelon
3  7  purple        pear
4  8    pink       mango
>>df3
 	A  freatrue1 feature2   color      fruits
0  1        0.0      one     red       apple
1  1        0.0      one    blue      grades
2  2        0.2      two  orange  watermelon
3  2        0.4    three  orange  watermelon
4  7        NaN      NaN  purple        pear
5  8        NaN      NaN    pink       mango

Joins based on multiple columns

The inner join (intersection) code of multiple columns
is as follows:

df1=pd.DataFrame({
    
    'A':[1,2,2,4,5],'B':['a','b','a','d','c'],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
df2=pd.DataFrame({
    
    'A':[1,1,2,7,8],'B':['e','g','a','d','c'],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']})
df3=pd.merge(df1,df2,how='inner',on=['A','B'])

print(df1)
print(df2)
print(df3)

Output result:

>>df1
   A  B  freatrue1 feature2
0  1  a        0.0      one
1  2  b        0.2      two
2  2  a        0.4    three
3  4  d        0.6     four
4  5  c        0.8     five
>>df2
   A  B   color      fruits
0  1  e     red       apple
1  1  g    blue      grades
2  2  a  orange  watermelon
3  7  d  purple        pear
4  8  c    pink       mango
>>df3
   A  B  freatrue1 feature2   color      fruits
0  2  a        0.4    three  orange  watermelon

The code for outer join (union) of multiple columns
is as follows:

df1=pd.DataFrame({
    
    'A':[1,2,2,4,5],'B':['a','b','a','d','c'],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
df2=pd.DataFrame({
    
    'A':[1,1,2,7,8],'B':['e','g','a','d','c'],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']})
df3=pd.merge(df1,df2,how='outer',on=['A','B'])

print(df1)
print(df2)
print(df3)

Output result:

>>df1
   A  B  freatrue1 feature2
0  1  a        0.0      one
1  2  b        0.2      two
2  2  a        0.4    three
3  4  d        0.6     four
4  5  c        0.8     five
>>df2
   A  B   color      fruits
0  1  e     red       apple
1  1  g    blue      grades
2  2  a  orange  watermelon
3  7  d  purple        pear
4  8  c    pink       mango
>>df3
   A  B  freatrue1 feature2   color      fruits
0  1  a        0.0      one     NaN         NaN
1  2  b        0.2      two     NaN         NaN
2  2  a        0.4    three  orange  watermelon
3  4  d        0.6     four     NaN         NaN
4  5  c        0.8     five     NaN         NaN
5  1  e        NaN      NaN     red       apple
6  1  g        NaN      NaN    blue      grades
7  7  d        NaN      NaN  purple        pear
8  8  c        NaN      NaN    pink       mango

The left join code for multiple columns
is as follows:

df1=pd.DataFrame({
    
    'A':[1,2,2,4,5],'B':['a','b','a','d','c'],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
df2=pd.DataFrame({
    
    'A':[1,1,2,7,8],'B':['e','g','a','d','c'],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']})
df3=pd.merge(df1,df2,how='left',on=['A','B'])

print(df1)
print(df2)
print(df3)

Output result:

>>df1
   A  B  freatrue1 feature2
0  1  a        0.0      one
1  2  b        0.2      two
2  2  a        0.4    three
3  4  d        0.6     four
4  5  c        0.8     five
>>df2
   A  B   color      fruits
0  1  e     red       apple
1  1  g    blue      grades
2  2  a  orange  watermelon
3  7  d  purple        pear
4  8  c    pink       mango
>>df3
   A  B  freatrue1 feature2   color      fruits
0  1  a        0.0      one     NaN         NaN
1  2  b        0.2      two     NaN         NaN
2  2  a        0.4    three  orange  watermelon
3  4  d        0.6     four     NaN         NaN
4  5  c        0.8     five     NaN         NaN

Index-based connection method

code show as below:

import numpy as np
import pandas as pd
df1=pd.DataFrame({
    
    'A':[1,2,2,4,5],'B':['a','b','a','d','c'],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
df2=pd.DataFrame({
    
    'A':[1,1,2,7,8],'B':['e','g','a','d','c'],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']},index=[4,5,6,7,8])
#基于df1的A列与df2的index连接
df3=pd.merge(df1,df2,how='inner',left_on='A',right_index=True)

#设置参数suffixes以修改除连接列外相同列的后缀名
df4=pd.merge(df1,df2,how='inner',left_on='A',right_index=True,suffixes=('_df1','_df2'))

print(df1)
print(df2)
print(df3)
print(df4)

Output result:

>>df1
   A  B  freatrue1 feature2
0  1  a        0.0      one
1  2  b        0.2      two
2  2  a        0.4    three
3  4  d        0.6     four
4  5  c        0.8     five
>>df2
   A  B   color      fruits
4  1  e     red       apple
5  1  g    blue      grades
6  2  a  orange  watermelon
7  7  d  purple        pear
8  8  c    pink       mango
>>df3
   A  A_x B_x  freatrue1 feature2  A_y B_y color  fruits
3  4    4   d        0.6     four    1   e   red   apple
4  5    5   c        0.8     five    1   g  blue  grades
>>df4
   A  A_df1 B_df1  freatrue1 feature2  A_df2 B_df2 color  fruits
3  4      4     d        0.6     four      1     e   red   apple
4  5      5     c        0.8     five      1     g  blue  grades

join method

  • Based on inde connection dataframe, the connection method is consistent with merge, inner connection, outer connection, left connection and right connection.

index and index connection

code show as below:

df1=pd.DataFrame({
    
    'A':[1,2,3,4,5],'B':['red','blue','orange','purple','pink']})
df2=pd.DataFrame({
    
    'A':[1,2,3],'fruits':['apple','grades','watermelon']})
# lsuffix 和 rsuffix 设置连接的后缀名
df3=df1.join(df2,lsuffix='_caller',rsuffix='_other',how='inner')

print(df1)
print(df2)
print(df3)

Output result:

>>df1
	A       B
0  1     red
1  2    blue
2  3  orange
3  4  purple
>>df2
   A      fruits
0  1       apple
1  2      grades
2  3  watermelon
>>df3
   A_caller       B  A_other      fruits
0         1     red        1       apple
1         2    blue        2      grades
2         3  orange        3  watermelon

Join based on columns

code show as below:

df1=pd.DataFrame({
    
    'A':[1,2,3,4,5],'B':['red','blue','orange','purple','pink']})
df2=pd.DataFrame({
    
    'A':[1,2,3],'fruits':['apple','grades','watermelon']})
#基于A列进行连接
df3=df1.set_index('A').join(df2.set_index('A'),how='inner')

print(df1)
print(df2)
print(df3)

Output result:

>>df1
	A     B
0  1     red
1  2    blue
2  3  orange
3  4  purple
>>df2
   A      fruits
0  1       apple
1  2      grades
2  3  watermelon
>>df3
        B      fruits
A                    
1     red       apple
2    blue      grades
3  orange  watermelon

concat method

  • The concat method is a splicing function, including row splicing and column splicing. The default is row splicing, and the splicing method defaults to outer splicing (union). The object of splicing is pandas data type.

splicing method of series type

row splicing

code show as below:

df1=pd.Series([1,2,3],index=['a','b','c'])
df2=pd.Series([4,5,6],index=['b','c','d'])
df3=pd.concat([df1,df2])
df4=pd.concat([df1,df2],keys=['fea1','fea2'])#行拼接若有相同的索引,为了区分索引,我们最外层定义了索引的分组情况。

print(df1)
print(df2)
print(df3)
print(df4)

Output result:

>>df1
a    1
b    2
c    3
dtype: int64
>>df2
b    4
c    5
d    6
dtype: int64
>>df3
a    1
b    2
c    3
b    4
c    5
d    6
dtype: int64
>>df4
fea1  a    1
      b    2
      c    3
fea2  b    4
      c    5
      d    6
dtype: int64 

column splicing

  • The default is splicing by union

code show as below:

pd.concat([df1,df2],axis=1)

Output result:

	0	1
a	1.0	NaN
b	2.0	4.0
c	3.0	5.0
d	NaN	6.0
  • splicing by intersection
    • keys : Set the column names for column splicing
    • join: 'inner' intersection

code show as below:

pd.concat([df1,df2],axis=1,join='inner',keys=['fea1','fea2'])

Output result:

	0	1
b	2	4
c	3	5

DataFrame type concatenation method

row splicing

code show as below:

df1=pd.DataFrame({
    
    'A':[1,2,3],'fea1':['b','c','d']})
df2=pd.DataFrame({
    
    'A':[4,5,6],'fea1':['a','b','c']})

df3=pd.concat([df1,df2]) #行拼接
print(df1)
print(df2)
print(df3)

Output result:

>>df1
   A fea1
0  1    b
1  2    c
2  3    d
>>df2
   A fea1
0  4    a
1  5    b
2  6    c
>>df3
   A fea1
0  1    b
1  2    c
2  3    d
0  4    a
1  5    b
2  6    c

The column splicing
code is as follows:

df1=pd.DataFrame({
    
    'A':[1,2,3],'fea1':['b','c','d']})
df2=pd.DataFrame({
    
    'A':[4,5,6],'fea1':['a','b','c']})

df3=pd.concat([df1,df2],axis=1) #列拼接
print(df1)
print(df2)
print(df3)

Output result:

>>df1
   A fea1
0  1    b
1  2    c
2  3    d
>>df2
   A fea1
0  4    a
1  5    b
2  6    c
>>df3
   A fea1  A fea1
0  1    b  4    a
1  2    c  5    b
2  3    d  6    c

Guess you like

Origin blog.csdn.net/sodaloveer/article/details/126061582