Pandas study notes (5) - Pandas merge

leading

For more article code details, please check the blogger’s personal website: https://www.iwtmbtly.com/

Import the required libraries and files:

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.read_csv('data/table.csv')
>>> df.head()
  School Class    ID Gender   Address  Height  Weight  Math Physics
0    S_1   C_1  1101      M  street_1     173      63  34.0      A+
1    S_1   C_1  1102      F  street_2     192      73  32.5      B+
2    S_1   C_1  1103      M  street_2     186      82  87.2      B+
3    S_1   C_1  1104      F  street_2     167      81  80.4      B-
4    S_1   C_1  1105      F  street_4     159      64  84.8      B+

1. append and assign

(1) append method

1. Add rows by sequence (name must be specified)

>>> df_append = df.loc[:3, ['Gender', 'Height']].copy()
>>> df_append
  Gender  Height
0      M     173
1      F     192
2      M     186
3      F     167

>>> s = pd.Series({
    
    'Gender':'F','Height':188},name='new_row')
>>> df_append.append(s)
        Gender  Height
0            M     173
1            F     192
2            M     186
3            F     167
new_row      F     188

2. Add table with DataFrame

>>> df_temp = pd.DataFrame({
    
    'Gender':['F', 'M'], 'Height':[188, 176]}, index=['new_1', 'new_2'])
>>> df_append.append(df_temp)
      Gender  Height
0          M     173
1          F     192
2          M     186
3          F     167
new_1      F     188
new_2      M     176

(2) assign method

This method is mainly used to add columns, and the column names are directly specified by parameters:

>>> s = pd.Series(list('abcd'), index=range(4))
>>> df_append.assign(Letter=s)
  Gender  Height Letter
0      M     173      a
1      F     192      b
2      M     186      c
3      F     167      d

Multiple columns can be added at once:

>>> df_append.assign(col1=lambda x:x['Gender']*2,
...                  col2=s)
  Gender  Height col1 col2
0      M     173   MM    a
1      F     192   FF    b
2      M     186   MM    c
3      F     167   FF    d

Two, combine and update

(1) combine method

Both combine and update are filling functions for tables, which can be filled according to certain rules

1. Fill the object

The following example shows that the combine method loops column by column in turn according to the order of the tables, and automatically aligns the indexes. The missing value is NaN. It is very important to understand this:

>>> df_combine_1 = df.loc[:1,['Gender','Height']].copy()
>>> df_combine_2 = df.loc[10:11,['Gender','Height']].copy()
>>> df_combine_1.combine(df_combine_2,lambda x,y:print(x,y))
0       M
1       F
10    NaN
11    NaN
Name: Gender, dtype: object 0     NaN
1     NaN
10      M
11      F
Name: Gender, dtype: object
0     173.0
1     192.0
10      NaN
11      NaN
Name: Height, dtype: float64 0       NaN
1       NaN
10    161.0
11    175.0
Name: Height, dtype: float64
Gender	Height
0	NaN	NaN
1	NaN	NaN
10	NaN	NaN
11	NaN	NaN

2. Some examples

(a) fill according to the size of the column mean

>>> df1 = pd.DataFrame({
    
    'A': [1, 2], 'B': [3, 4]})
>>> df2 = pd.DataFrame({
    
    'A': [8, 7], 'B': [6, 5]})
>>> df1.combine(df2,lambda x,y:x if x.mean()>y.mean() else y)
   A  B
0  8  6
1  7  5

(b) Index alignment feature (by default, the rows and columns that are not in the following tables will be set to NaN)

>>> df2 = pd.DataFrame({
    
    'B': [8, 7], 'C': [6, 5]},index=[1,2])
>>> df1.combine(df2,lambda x,y:x if x.mean()>y.mean() else y)
    A    B    C
0 NaN  NaN  NaN
1 NaN  8.0  6.0
2 NaN  7.0  5.0

(c) Make the original value of df1 eligible will not be overwritten

>>> df1.combine(df2,lambda x,y:x if x.mean()>y.mean() else y,overwrite=False)
     A    B    C
0  1.0  NaN  NaN
1  2.0  8.0  6.0
2  NaN  7.0  5.0

(d) Fill -1 in the newly added element position matching df2

>>> df1.combine(df2,lambda x,y:x if x.mean()>y.mean() else y,fill_value=-1)
     A    B    C
0  1.0 -1.0 -1.0
1  2.0  8.0  6.0
2 -1.0  7.0  5.0

3. combine_first method

The function of this method is to use df2 to fill in the missing values of df1. The function is relatively simple, but it is often more commonly used than combine. Here are two examples:

>>> df1 = pd.DataFrame({
    
    'A': [None, 0], 'B': [None, 4]})
>>> df2 = pd.DataFrame({
    
    'A': [1, 1], 'B': [3, 3]})
>>> df1.combine_first(df2)
     A    B
0  1.0  3.0
1  0.0  4.0

>>> df1 = pd.DataFrame({
    
    'A': [None, 0], 'B': [4, None]})
>>> df2 = pd.DataFrame({
    
    'B': [3, 3], 'C': [1, 1]}, index=[1, 2])
>>> df1.combine_first(df2)
     A    B    C
0  NaN  4.0  NaN
1  0.0  3.0  1.0
2  NaN  3.0  1.0

(2) update method

1. Three characteristics

The returned box index will only be consistent with the called box (by default, left join is used, which will be introduced in the next section)
nan elements in the second box won't work
No return value, directly operate on df

2. Examples

(a) Operation when the index is fully aligned

>>> df1 = pd.DataFrame({
    
    'A': [1, 2, 3],
...                     'B': [400, 500, 600]})
>>> df2 = pd.DataFrame({
    
    'B': [4, 5, 6],
...                     'C': [7, 8, 9]})
>>> df1.update(df2)
>>> df1
   A  B
0  1  4
1  2  5
2  3  6

(b) Partial filling

>>> df1 = pd.DataFrame({
    
    'A': ['a', 'b', 'c'],
...                     'B': ['x', 'y', 'z']})
>>> df2 = pd.DataFrame({
    
    'B': ['d', 'e']}, index=[1,2])
>>> df1.update(df2)
>>> df1
   A  B
0  a  x
1  b  d
2  c  e

(c) Missing values will not be filled

>>> df1 = pd.DataFrame({
    
    'A': [1, 2, 3],
...                     'B': [400, 500, 600]})
>>> df2 = pd.DataFrame({
    
    'B': [4, np.nan, 6]})
>>> df1.update(df2)
>>> df1
   A      B
0  1    4.0
1  2  500.0
2  3    6.0

Three, concat method

The concat method can be spliced in two dimensions, the default is vertical splicing (axis=0), and the splicing method defaults to outer join

The so-called outer join is to take the union of the splicing directions, and the 'inner' is to take the intersection of the splicing directions (if the default vertical splicing is used, it is the intersection of columns)

Here are some examples to illustrate its parameters:

>>> df1 = pd.DataFrame({
    
    'A': ['A0', 'A1'],
...                     'B': ['B0', 'B1']},
>>> df2 = pd.DataFrame({
    
    'A': ['A2', 'A3'],
...                     'B': ['B2', 'B3']},
...                     index = [2,3])
>>> df3 = pd.DataFrame({
    
    'A': ['A1', 'A3'],
...                     'D': ['D1', 'D3'],
...                     'E': ['E1', 'E3']},
...                     index = [1,3])

Stitching in the default state:

>>> pd.concat([df1,df2])
    A   B
0  A0  B0
1  A1  B1
2  A2  B2
3  A3  B3

Splicing along the column direction when axis=1:

>>> pd.concat([df1,df2],axis=1)
     A    B    A    B
0   A0   B0  NaN  NaN
1   A1   B1  NaN  NaN
2  NaN  NaN   A2   B2
3  NaN  NaN   A3   B3

join is set to inner join (since axis=0, so the columns take intersection):

>>> pd.concat([df3,df1],join='inner')
    A
1  A1
3  A3
0  A0
1  A1

join is set as an external link:

>>> pd.concat([df3,df1],join='outer',sort=True) #sort设置列排序，默认为False
    A    B    D    E
1  A1  NaN   D1   E1
3  A3  NaN   D3   E3
0  A0   B0  NaN  NaN
1  A1   B1  NaN  NaN

verify_integrity checks if a column is unique:

>>> pd.concat([df3,df1],verify_integrity=True,sort=True) # 报错

Similarly, Series can be added:

>>> s = pd.Series(['X0', 'X1'], name='X')
>>> pd.concat([df1,s],axis=1)
    A   B   X
0  A0  B0  X0
1  A1  B1  X1

The key parameter is used to add a label to different data frames for easy indexing:

>>> pd.concat([df1,df2], keys=['x', 'y'])
      A   B
x 0  A0  B0
  1  A1  B1
y 2  A2  B2
  3  A3  B3
>>> pd.concat([df1,df2], keys=['x', 'y']).index
MultiIndex([('x', 0),
            ('x', 1),
            ('y', 2),
            ('y', 3)],
           )

Four, merge and join

(1) merge function

The function of the merge function is to merge two pandas objects horizontally. When duplicate index items are encountered, the Cartesian product will be used. The default inner connection, optional left, outer, right connection

The so-called left join means that based on the first table index, if the table on the right is no longer on the left, it will not be added, and if it is on the left, it will be added in the form of Cartesian product

The difference between merge/join and concat is the on parameter, which can specify an object as the key to connect

Likewise, here are some examples:

>>> left = pd.DataFrame({
    
    'key1': ['K0', 'K0', 'K1', 'K2'],
...                      'key2': ['K0', 'K1', 'K0', 'K1'],
...                       'A': ['A0', 'A1', 'A2', 'A3'],
...                       'B': ['B0', 'B1', 'B2', 'B3']})
>>> right = pd.DataFrame({
    
    'key1': ['K0', 'K1', 'K1', 'K2'],
...                       'key2': ['K0', 'K0', 'K0', 'K0'],
...                       'C': ['C0', 'C1', 'C2', 'C3'],
...                       'D': ['D0', 'D1', 'D2', 'D3']})
>>> right2 = pd.DataFrame({
    
    'key1': ['K0', 'K1', 'K1', 'K2'],
...                       'key2': ['K0', 'K0', 'K0', 'K0'],
...                       'C': ['C0', 'C1', 'C2', 'C3']})

Use key1 as the criterion to connect, if they have the same columns, the default suffixes=('_x','_y'):

>>> pd.merge(left, right, on='key1')
  key1 key2_x   A   B key2_y   C   D
0   K0     K0  A0  B0     K0  C0  D0
1   K0     K1  A1  B1     K0  C0  D0
2   K1     K0  A2  B2     K0  C1  D1
3   K1     K0  A2  B2     K0  C2  D2
4   K2     K1  A3  B3     K0  C3  D3

Connect with multiple sets of keys:

>>> pd.merge(left, right, on=['key1','key2'])
  key1 key2   A   B   C   D
0   K0   K0  A0  B0  C0  D0
1   K1   K0  A2  B2  C1  D1
2   K1   K0  A2  B2  C2  D2

The inner connection is used by default, because the merge can only be spliced horizontally, so the intersection of the upward keys is taken. Let’s see if you use the how=outer parameter

Note: how here is the join of concat

>>> pd.merge(left, right, how='outer', on=['key1','key2'])
  key1 key2    A    B    C    D
0   K0   K0   A0   B0   C0   D0
1   K0   K1   A1   B1  NaN  NaN
2   K1   K0   A2   B2   C1   D1
3   K1   K0   A2   B2   C2   D2
4   K2   K1   A3   B3  NaN  NaN
5   K2   K0  NaN  NaN   C3   D3

Left join:

>>> pd.merge(left, right, how='left', on=['key1', 'key2'])
  key1 key2   A   B    C    D
0   K0   K0  A0  B0   C0   D0
1   K0   K1  A1  B1  NaN  NaN
2   K1   K0  A2  B2   C1   D1
3   K1   K0  A2  B2   C2   D2
4   K2   K1  A3  B3  NaN  NaN

Right join:

>>> pd.merge(left, right, how='right', on=['key1', 'key2'])
  key1 key2    A    B   C   D
0   K0   K0   A0   B0  C0  D0
1   K1   K0   A2   B2  C1  D1
2   K1   K0   A2   B2  C2  D2
3   K2   K0  NaN  NaN  C3  D3

If you still don't know much about the Cartesian product, please be sure to understand the following example. Since all elements of B are 2, 6 lines are required:

>>> left = pd.DataFrame({
    
    'A': [1, 2], 'B': [2, 2]})
>>> right = pd.DataFrame({
    
    'A': [4, 5, 6], 'B': [2, 2, 2]})
>>> pd.merge(left, right, on='B', how='outer')
   A_x  B  A_y
0    1  2    4
1    1  2    5
2    1  2    6
3    2  2    4
4    2  2    5
5    2  2    6

Validate checks which side has a duplicate index. If it is "one_to_one", the indexes on both sides are unique, and if it is "one_to_many", the left side is unique.

>>> left = pd.DataFrame({
    
    'A': [1, 2], 'B': [2, 2]})
>>> right = pd.DataFrame({
    
    'A': [4, 5, 6], 'B': [2, 3, 4]})
>>> #pd.merge(left, right, on='B', how='outer',validate='one_to_one') # 报错
>>> left = pd.DataFrame({
    
    'A': [1, 2], 'B': [2, 1]})
>>> pd.merge(left, right, on='B', how='outer',validate='one_to_one')
   A_x  B  A_y
0  1.0  2  4.0
1  2.0  1  NaN
2  NaN  3  5.0
3  NaN  4  6.0

The indicator parameter indicates the source of the row index after merging

>>> df1 = pd.DataFrame({
    
    'col1': [0, 1], 'col_left': ['a', 'b']})
>>> df2 = pd.DataFrame({
    
    'col1': [1, 2, 2], 'col_right': [2, 2, 2]})
>>> pd.merge(df1, df2, on='col1', how='outer', indicator=True) # indicator='indicator_column'也是可以的
   col1 col_left  col_right      _merge
0     0        a        NaN   left_only
1     1        b        2.0        both
2     2      NaN        2.0  right_only
3     2      NaN        2.0  right_only

(2) join function

The role of the join function is to splice multiple pandas objects horizontally. When encountering duplicate index items, the Cartesian product will be used. The default left connection, optional inner, outer, right connection

>>> left = pd.DataFrame({
    
    'A': ['A0', 'A1', 'A2'],
...                      'B': ['B0', 'B1', 'B2']},
...                     index=['K0', 'K1', 'K2'])
>>> right = pd.DataFrame({
    
    'C': ['C0', 'C2', 'C3'],
...                       'D': ['D0', 'D2', 'D3']},
...                     index=['K0', 'K2', 'K3'])
>>> left.join(right)
     A   B    C    D
K0  A0  B0   C0   D0
K1  A1  B1  NaN  NaN
K2  A2  B2   C2   D2

For merging in many_to_one mode, it is often more convenient to join

You can also specify a key:

>>> left = pd.DataFrame({
    
    'A': ['A0', 'A1', 'A2', 'A3'],
...                      'B': ['B0', 'B1', 'B2', 'B3'],
...                      'key': ['K0', 'K1', 'K0', 'K1']})
>>> right = pd.DataFrame({
    
    'C': ['C0', 'C1'],
...                       'D': ['D0', 'D1']},
...                      index=['K0', 'K1'])
>>> left.join(right, on='key')
    A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K0  C0  D0
3  A3  B3  K1  C1  D1

Multi-layer key:

>>> left = pd.DataFrame({
    
    'A': ['A0', 'A1', 'A2', 'A3'],
...                      'key1': ['K0', 'K0', 'K1', 'K2'],
...                      'key2': ['K0', 'K1', 'K0', 'K1']})
>>> index = pd.MultiIndex.from_tuples([('K0', 'K0'), ('K1', 'K0'),
...                                    ('K2', 'K0'), ('K2', 'K1')],names=['key1','key2'])
>>> right = pd.DataFrame({
    
    'C': ['C0', 'C1', 'C2', 'C3'],
...                       'D': ['D0', 'D1', 'D2', 'D3']},
...                      index=index)
>>> left.join(right, on=['key1','key2'])
    A   B key1 key2    C    D
0  A0  B0   K0   K0   C0   D0
1  A1  B1   K0   K1  NaN  NaN
2  A2  B2   K1   K0   C1   D1
3  A3  B3   K2   K1   C3   D3