DataFrame Data Merge in Pandas Contract | Merge

In my work recently, I encountered data merging and connection problems, so it is organized as follows for the reference of those who need it~

Reference from: Elephant in Dance: https://blog.csdn.net/gdkyxy2013/article/details/80785361


concat

concat: stack multiple objects together along an axis

The concat method is equivalent to the union all in the database. It can not only specify the connection method (outer join or inner join), but also specify the connection according to a certain axis. Unlike the database, it does not de-duplicate, but you can use the drop_duplicates method to achieve the effect of de-duplication.

concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, 
       keys=None, levels=None, names=None, verify_integrity=False, copy=True): 

pd.concat() simply joins the two tables together. The parameter axis is the key. It is used to specify whether it is a row or a column. The axis defaults to 0. When axis=0, the effect of pd.concat([obj1, obj2]) is the same as obj1.append(obj2); when axis=1, the effect of pd.concat([obj1, obj2], axis=1) The effect is the same as pd.merge(obj1, obj2, left_index=True, right_index=True, how='outer').

Parameter introduction:

  • objs: The collection of objects that need to be connected, generally a list or dictionary;

  • axis: connection axis;

  • join: The parameter is'outer' or'inner';

  • join_axes=[]: Specify a custom index;

  • keys=[]: Create a hierarchical index;

  • ignore_index=True: Rebuild index

Case:

student.csv

   id   name  age    sex
0   1    tom   23    man
1   2   john   33    man
2   3  alice   22  woman
3   4   jack   42    man
4   5   saex   22  woman
5   6   jmas   21    man
6   7  jjban   34    man
7   8  alicn   22  woman

score.csv

   id   name  score
0   1    tom     89
1   2   john     90
2   3  alice     78
3   4   jack     99
4   5   saex     87

Use contract to connect, pay attention to the wording of contract([df1,df2]), join optional outer/inner

contract_pd = pd.concat([student_pd,score_pd],join='outer', ignore_index=True)

   id   name   age    sex  score
0   1    tom  23.0    man    NaN
1   2   john  33.0    man    NaN
2   3  alice  22.0  woman    NaN
3   4   jack  42.0    man    NaN
4   5   saex  22.0  woman    NaN
5   6   jmas  21.0    man    NaN
6   7  jjban  34.0    man    NaN
7   8  alicn  22.0  woman    NaN
0   1    tom   NaN    NaN   89.0
1   2   john   NaN    NaN   90.0
2   3  alice   NaN    NaN   78.0
3   4   jack   NaN    NaN   99.0
4   5   saex   NaN    NaN   87.0
5   6   jmas   NaN    NaN   33.0

contract_pd = pd.concat([student_pd,score_pd],join='inner', ignore_index=True)
    id   name
0    1    tom
1    2   john
2    3  alice
3    4   jack
4    5   saex
5    6   jmas
6    7  jjban
7    8  alicn
8    1    tom
9    2   john
10   3  alice
11   4   jack
12   5   saex
13   6   jmas


merge to join columns by key

Similar to the relational database connection mode may be different according to DatFrame connect one or more bonds. The typical application scenario of this function is to integrate two tables with different fields for the same primary key, and integrate them into one table according to the primary key.

merge(left, right, how='inner', on=None, left_on=None, right_on=None,  
      left_index=False, right_index=False, sort=True,  
      suffixes=('_x', '_y'), copy=True, indicator=False)

Parameter introduction:

  • left and right: two different DataFrames;

  • how: connection mode, there are inner, left, right, outer, the default is inner;

  • on: Refers to the column index name used for connection, which must exist in the left and right DataFrames. If it is not specified and other parameters are not specified, the intersection of the two DataFrame column names is used as the connection key;

  • left_on: The column name used for the connection key in the left DataFrame. This parameter is very useful when the left and right column names are different but the meaning is the same;

  • right_on: the column name used to connect the key in the right DataFrame;

  • left_index: Use the row index in the left DataFrame as the connection key;

  • right_index: Use the row index in the right DataFrame as the connection key;

  • sort: The default is True, sort the merged data, setting it to False can improve performance;

  • suffixes: A tuple of string values, used to specify the suffix name appended to the column name when the same column name exists in the left and right DataFrame, the default is ('_x','_y');

  • copy: The default is True, always copy data to the data structure, setting it to False can improve performance;

Example:


# 1.默认以重叠的列名当做连接键
contract_pd = pd.merge(student_pd,score_pd,how="inner",sort=True)
   id name_x  age    sex name_y  score
0   1    tom   23    man    tom     89
1   2   john   33    man   john     90
2   3  alice   22  woman  alice     78
3   4   jack   42    man   jack     99
4   5   saex   22  woman   saex     87
5   6   jmas   21    man   jmas     33


# 2.默认做inner连接(取key的交集),连接方式还有(left,right,outer),制定连接方式加参数:how=''
contract_pd = pd.merge(student_pd,score_pd,how="left",on='id',sort=True)
   id name_x  age    sex name_y  score
0   1    tom   23    man    tom   89.0
1   2   john   33    man   john   90.0
2   3  alice   22  woman  alice   78.0
3   4   jack   42    man   jack   99.0
4   5   saex   22  woman   saex   87.0
5   6   jmas   21    man   jmas   33.0
6   7  jjban   34    man    NaN    NaN
7   8  alicn   22  woman    NaN    NaN

# 3. 执行on 的时候不能够指定left_on 或者right_on
contract_pd = pd.merge(student_pd, score_pd, how="left", left_on='id', right_on='id', sort=True)
   id name_x  age    sex name_y  score
0   1    tom   23    man    tom   89.0
1   2   john   33    man   john   90.0
2   3  alice   22  woman  alice   78.0
3   4   jack   42    man   jack   99.0
4   5   saex   22  woman   saex   87.0
5   6   jmas   21    man   jmas   33.0
6   7  jjban   34    man    NaN    NaN
7   8  alicn   22  woman    NaN    NaN

Take out the result of merge according to the conditions:

# 取出score为NaN的记录
allsed = contract_pd.loc[contract_pd.score.isna()]
   id name_x  age    sex name_y  score
6   7  jjban   34    man    NaN    NaN
7   8  alicn   22  woman    NaN    NaN

# 取出score 为非NaN的记录
allsed = contract_pd.loc[~contract_pd.score.isna()]
   id name_x  age    sex name_y  score
0   1    tom   23    man    tom   89.0
1   2   john   33    man   john   90.0
2   3  alice   22  woman  alice   78.0
3   4   jack   42    man   jack   99.0
4   5   saex   22  woman   saex   87.0
5   6   jmas   21    man   jmas   33.0

# 对结果进行去重
allsed.drop_duplicates()

Function of lambda function:

tag_pd = pd.read_csv("tags.csv")
   id                           tags
0   1  1234|2345|3456|2348|7865|1357
1   2  1234|2345|3456|2348|7865|1357
2   3  1234|2345|3456|2348|7865|1357
3   4  1234|2345|3456|2348|7865|1357
4   5  1234|2345|3456|2348|7865|1357
5   6  1234|2345|3456|2348|7865|1357


tag_pd['idss'] = tag_pd.tags.apply(lambda x:x.split('|'))
   id                           tags                                  idss
0   1  1234|2345|3456|2348|7865|1357  [1234, 2345, 3456, 2348, 7865, 1357]
1   2  1234|2345|3456|2348|7865|1357  [1234, 2345, 3456, 2348, 7865, 1357]
2   3  1234|2345|3456|2348|7865|1357  [1234, 2345, 3456, 2348, 7865, 1357]
3   4  1234|2345|3456|2348|7865|1357  [1234, 2345, 3456, 2348, 7865, 1357]
4   5  1234|2345|3456|2348|7865|1357  [1234, 2345, 3456, 2348, 7865, 1357]

to sum up:

  1. contract is similar to the union all operation in relational databases
  2. merge is similar to inner join, left join, and right join operations in relational databases

Guess you like

Origin blog.csdn.net/qq_43081842/article/details/110354985