leading
For more article code details, please check the blogger’s personal website: https://www.iwtmbtly.com/
Import the required libraries and files:
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.read_csv('data/table.csv')
>>> df.head()
School Class ID Gender Address Height Weight Math Physics
0 S_1 C_1 1101 M street_1 173 63 34.0 A+
1 S_1 C_1 1102 F street_2 192 73 32.5 B+
2 S_1 C_1 1103 M street_2 186 82 87.2 B+
3 S_1 C_1 1104 F street_2 167 81 80.4 B-
4 S_1 C_1 1105 F street_4 159 64 84.8 B+
1. append and assign
(1) append method
1. Add rows by sequence (name must be specified)
>>> df_append = df.loc[:3, ['Gender', 'Height']].copy()
>>> df_append
Gender Height
0 M 173
1 F 192
2 M 186
3 F 167
>>> s = pd.Series({
'Gender':'F','Height':188},name='new_row')
>>> df_append.append(s)
Gender Height
0 M 173
1 F 192
2 M 186
3 F 167
new_row F 188
2. Add table with DataFrame
>>> df_temp = pd.DataFrame({
'Gender':['F', 'M'], 'Height':[188, 176]}, index=['new_1', 'new_2'])
>>> df_append.append(df_temp)
Gender Height
0 M 173
1 F 192
2 M 186
3 F 167
new_1 F 188
new_2 M 176
(2) assign method
This method is mainly used to add columns, and the column names are directly specified by parameters:
>>> s = pd.Series(list('abcd'), index=range(4))
>>> df_append.assign(Letter=s)
Gender Height Letter
0 M 173 a
1 F 192 b
2 M 186 c
3 F 167 d
Multiple columns can be added at once:
>>> df_append.assign(col1=lambda x:x['Gender']*2,
... col2=s)
Gender Height col1 col2
0 M 173 MM a
1 F 192 FF b
2 M 186 MM c
3 F 167 FF d
Two, combine and update
(1) combine method
Both combine and update are filling functions for tables, which can be filled according to certain rules
1. Fill the object
The following example shows that the combine method loops column by column in turn according to the order of the tables, and automatically aligns the indexes. The missing value is NaN. It is very important to understand this:
>>> df_combine_1 = df.loc[:1,['Gender','Height']].copy()
>>> df_combine_2 = df.loc[10:11,['Gender','Height']].copy()
>>> df_combine_1.combine(df_combine_2,lambda x,y:print(x,y))
0 M
1 F
10 NaN
11 NaN
Name: Gender, dtype: object 0 NaN
1 NaN
10 M
11 F
Name: Gender, dtype: object
0 173.0
1 192.0
10 NaN
11 NaN
Name: Height, dtype: float64 0 NaN
1 NaN
10 161.0
11 175.0
Name: Height, dtype: float64
Gender Height
0 NaN NaN
1 NaN NaN
10 NaN NaN
11 NaN NaN
2. Some examples
(a) fill according to the size of the column mean
>>> df1 = pd.DataFrame({
'A': [1, 2], 'B': [3, 4]})
>>> df2 = pd.DataFrame({
'A': [8, 7], 'B': [6, 5]})
>>> df1.combine(df2,lambda x,y:x if x.mean()>y.mean() else y)
A B
0 8 6
1 7 5
(b) Index alignment feature (by default, the rows and columns that are not in the following tables will be set to NaN)
>>> df2 = pd.DataFrame({
'B': [8, 7], 'C': [6, 5]},index=[1,2])
>>> df1.combine(df2,lambda x,y:x if x.mean()>y.mean() else y)
A B C
0 NaN NaN NaN
1 NaN 8.0 6.0
2 NaN 7.0 5.0
(c) Make the original value of df1 eligible will not be overwritten
>>> df1.combine(df2,lambda x,y:x if x.mean()>y.mean() else y,overwrite=False)
A B C
0 1.0 NaN NaN
1 2.0 8.0 6.0
2 NaN 7.0 5.0
(d) Fill -1 in the newly added element position matching df2
>>> df1.combine(df2,lambda x,y:x if x.mean()>y.mean() else y,fill_value=-1)
A B C
0 1.0 -1.0 -1.0
1 2.0 8.0 6.0
2 -1.0 7.0 5.0
3. combine_first method
The function of this method is to use df2 to fill in the missing values of df1. The function is relatively simple, but it is often more commonly used than combine. Here are two examples:
>>> df1 = pd.DataFrame({
'A': [None, 0], 'B': [None, 4]})
>>> df2 = pd.DataFrame({
'A': [1, 1], 'B': [3, 3]})
>>> df1.combine_first(df2)
A B
0 1.0 3.0
1 0.0 4.0
>>> df1 = pd.DataFrame({
'A': [None, 0], 'B': [4, None]})
>>> df2 = pd.DataFrame({
'B': [3, 3], 'C': [1, 1]}, index=[1, 2])
>>> df1.combine_first(df2)
A B C
0 NaN 4.0 NaN
1 0.0 3.0 1.0
2 NaN 3.0 1.0
(2) update method
1. Three characteristics
-
The returned box index will only be consistent with the called box (by default, left join is used, which will be introduced in the next section)
-
nan elements in the second box won't work
-
No return value, directly operate on df
2. Examples
(a) Operation when the index is fully aligned
>>> df1 = pd.DataFrame({
'A': [1, 2, 3],
... 'B': [400, 500, 600]})
>>> df2 = pd.DataFrame({
'B': [4, 5, 6],
... 'C': [7, 8, 9]})
>>> df1.update(df2)
>>> df1
A B
0 1 4
1 2 5
2 3 6
(b) Partial filling
>>> df1 = pd.DataFrame({
'A': ['a', 'b', 'c'],
... 'B': ['x', 'y', 'z']})
>>> df2 = pd.DataFrame({
'B': ['d', 'e']}, index=[1,2])
>>> df1.update(df2)
>>> df1
A B
0 a x
1 b d
2 c e
(c) Missing values will not be filled
>>> df1 = pd.DataFrame({
'A': [1, 2, 3],
... 'B': [400, 500, 600]})
>>> df2 = pd.DataFrame({
'B': [4, np.nan, 6]})
>>> df1.update(df2)
>>> df1
A B
0 1 4.0
1 2 500.0
2 3 6.0
Three, concat method
The concat method can be spliced in two dimensions, the default is vertical splicing (axis=0), and the splicing method defaults to outer join
The so-called outer join is to take the union of the splicing directions, and the 'inner' is to take the intersection of the splicing directions (if the default vertical splicing is used, it is the intersection of columns)
Here are some examples to illustrate its parameters:
>>> df1 = pd.DataFrame({
'A': ['A0', 'A1'],
... 'B': ['B0', 'B1']},
>>> df2 = pd.DataFrame({
'A': ['A2', 'A3'],
... 'B': ['B2', 'B3']},
... index = [2,3])
>>> df3 = pd.DataFrame({
'A': ['A1', 'A3'],
... 'D': ['D1', 'D3'],
... 'E': ['E1', 'E3']},
... index = [1,3])
Stitching in the default state:
>>> pd.concat([df1,df2])
A B
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3
Splicing along the column direction when axis=1:
>>> pd.concat([df1,df2],axis=1)
A B A B
0 A0 B0 NaN NaN
1 A1 B1 NaN NaN
2 NaN NaN A2 B2
3 NaN NaN A3 B3
join is set to inner join (since axis=0, so the columns take intersection):
>>> pd.concat([df3,df1],join='inner')
A
1 A1
3 A3
0 A0
1 A1
join is set as an external link:
>>> pd.concat([df3,df1],join='outer',sort=True) #sort设置列排序,默认为False
A B D E
1 A1 NaN D1 E1
3 A3 NaN D3 E3
0 A0 B0 NaN NaN
1 A1 B1 NaN NaN
verify_integrity checks if a column is unique:
>>> pd.concat([df3,df1],verify_integrity=True,sort=True) # 报错
Similarly, Series can be added:
>>> s = pd.Series(['X0', 'X1'], name='X')
>>> pd.concat([df1,s],axis=1)
A B X
0 A0 B0 X0
1 A1 B1 X1
The key parameter is used to add a label to different data frames for easy indexing:
>>> pd.concat([df1,df2], keys=['x', 'y'])
A B
x 0 A0 B0
1 A1 B1
y 2 A2 B2
3 A3 B3
>>> pd.concat([df1,df2], keys=['x', 'y']).index
MultiIndex([('x', 0),
('x', 1),
('y', 2),
('y', 3)],
)
Four, merge and join
(1) merge function
The function of the merge function is to merge two pandas objects horizontally. When duplicate index items are encountered, the Cartesian product will be used. The default inner connection, optional left, outer, right connection
The so-called left join means that based on the first table index, if the table on the right is no longer on the left, it will not be added, and if it is on the left, it will be added in the form of Cartesian product
The difference between merge/join and concat is the on parameter, which can specify an object as the key to connect
Likewise, here are some examples:
>>> left = pd.DataFrame({
'key1': ['K0', 'K0', 'K1', 'K2'],
... 'key2': ['K0', 'K1', 'K0', 'K1'],
... 'A': ['A0', 'A1', 'A2', 'A3'],
... 'B': ['B0', 'B1', 'B2', 'B3']})
>>> right = pd.DataFrame({
'key1': ['K0', 'K1', 'K1', 'K2'],
... 'key2': ['K0', 'K0', 'K0', 'K0'],
... 'C': ['C0', 'C1', 'C2', 'C3'],
... 'D': ['D0', 'D1', 'D2', 'D3']})
>>> right2 = pd.DataFrame({
'key1': ['K0', 'K1', 'K1', 'K2'],
... 'key2': ['K0', 'K0', 'K0', 'K0'],
... 'C': ['C0', 'C1', 'C2', 'C3']})
Use key1 as the criterion to connect, if they have the same columns, the default suffixes=('_x','_y'):
>>> pd.merge(left, right, on='key1')
key1 key2_x A B key2_y C D
0 K0 K0 A0 B0 K0 C0 D0
1 K0 K1 A1 B1 K0 C0 D0
2 K1 K0 A2 B2 K0 C1 D1
3 K1 K0 A2 B2 K0 C2 D2
4 K2 K1 A3 B3 K0 C3 D3
Connect with multiple sets of keys:
>>> pd.merge(left, right, on=['key1','key2'])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
The inner connection is used by default, because the merge can only be spliced horizontally, so the intersection of the upward keys is taken. Let’s see if you use the how=outer parameter
Note: how here is the join of concat
>>> pd.merge(left, right, how='outer', on=['key1','key2'])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
5 K2 K0 NaN NaN C3 D3
Left join:
>>> pd.merge(left, right, how='left', on=['key1', 'key2'])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
Right join:
>>> pd.merge(left, right, how='right', on=['key1', 'key2'])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
3 K2 K0 NaN NaN C3 D3
If you still don't know much about the Cartesian product, please be sure to understand the following example. Since all elements of B are 2, 6 lines are required:
>>> left = pd.DataFrame({
'A': [1, 2], 'B': [2, 2]})
>>> right = pd.DataFrame({
'A': [4, 5, 6], 'B': [2, 2, 2]})
>>> pd.merge(left, right, on='B', how='outer')
A_x B A_y
0 1 2 4
1 1 2 5
2 1 2 6
3 2 2 4
4 2 2 5
5 2 2 6
Validate checks which side has a duplicate index. If it is "one_to_one", the indexes on both sides are unique, and if it is "one_to_many", the left side is unique.
>>> left = pd.DataFrame({
'A': [1, 2], 'B': [2, 2]})
>>> right = pd.DataFrame({
'A': [4, 5, 6], 'B': [2, 3, 4]})
>>> #pd.merge(left, right, on='B', how='outer',validate='one_to_one') # 报错
>>> left = pd.DataFrame({
'A': [1, 2], 'B': [2, 1]})
>>> pd.merge(left, right, on='B', how='outer',validate='one_to_one')
A_x B A_y
0 1.0 2 4.0
1 2.0 1 NaN
2 NaN 3 5.0
3 NaN 4 6.0
The indicator parameter indicates the source of the row index after merging
>>> df1 = pd.DataFrame({
'col1': [0, 1], 'col_left': ['a', 'b']})
>>> df2 = pd.DataFrame({
'col1': [1, 2, 2], 'col_right': [2, 2, 2]})
>>> pd.merge(df1, df2, on='col1', how='outer', indicator=True) # indicator='indicator_column'也是可以的
col1 col_left col_right _merge
0 0 a NaN left_only
1 1 b 2.0 both
2 2 NaN 2.0 right_only
3 2 NaN 2.0 right_only
(2) join function
The role of the join function is to splice multiple pandas objects horizontally. When encountering duplicate index items, the Cartesian product will be used. The default left connection, optional inner, outer, right connection
>>> left = pd.DataFrame({
'A': ['A0', 'A1', 'A2'],
... 'B': ['B0', 'B1', 'B2']},
... index=['K0', 'K1', 'K2'])
>>> right = pd.DataFrame({
'C': ['C0', 'C2', 'C3'],
... 'D': ['D0', 'D2', 'D3']},
... index=['K0', 'K2', 'K3'])
>>> left.join(right)
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
For merging in many_to_one mode, it is often more convenient to join
You can also specify a key:
>>> left = pd.DataFrame({
'A': ['A0', 'A1', 'A2', 'A3'],
... 'B': ['B0', 'B1', 'B2', 'B3'],
... 'key': ['K0', 'K1', 'K0', 'K1']})
>>> right = pd.DataFrame({
'C': ['C0', 'C1'],
... 'D': ['D0', 'D1']},
... index=['K0', 'K1'])
>>> left.join(right, on='key')
A B key C D
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
2 A2 B2 K0 C0 D0
3 A3 B3 K1 C1 D1
Multi-layer key:
>>> left = pd.DataFrame({
'A': ['A0', 'A1', 'A2', 'A3'],
... 'key1': ['K0', 'K0', 'K1', 'K2'],
... 'key2': ['K0', 'K1', 'K0', 'K1']})
>>> index = pd.MultiIndex.from_tuples([('K0', 'K0'), ('K1', 'K0'),
... ('K2', 'K0'), ('K2', 'K1')],names=['key1','key2'])
>>> right = pd.DataFrame({
'C': ['C0', 'C1', 'C2', 'C3'],
... 'D': ['D0', 'D1', 'D2', 'D3']},
... index=index)
>>> left.join(right, on=['key1','key2'])
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A3 B3 K2 K1 C3 D3