In my work recently, I encountered data merging and connection problems, so it is organized as follows for the reference of those who need it~
Reference from: Elephant in Dance: https://blog.csdn.net/gdkyxy2013/article/details/80785361
concat
concat: stack multiple objects together along an axis
The concat method is equivalent to the union all in the database. It can not only specify the connection method (outer join or inner join), but also specify the connection according to a certain axis. Unlike the database, it does not de-duplicate, but you can use the drop_duplicates method to achieve the effect of de-duplication.
concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False, copy=True):
pd.concat() simply joins the two tables together. The parameter axis is the key. It is used to specify whether it is a row or a column. The axis defaults to 0. When axis=0, the effect of pd.concat([obj1, obj2]) is the same as obj1.append(obj2); when axis=1, the effect of pd.concat([obj1, obj2], axis=1) The effect is the same as pd.merge(obj1, obj2, left_index=True, right_index=True, how='outer').
Parameter introduction:
-
objs: The collection of objects that need to be connected, generally a list or dictionary;
-
axis: connection axis;
-
join: The parameter is'outer' or'inner';
-
join_axes=[]: Specify a custom index;
-
keys=[]: Create a hierarchical index;
-
ignore_index=True: Rebuild index
Case:
student.csv
id name age sex
0 1 tom 23 man
1 2 john 33 man
2 3 alice 22 woman
3 4 jack 42 man
4 5 saex 22 woman
5 6 jmas 21 man
6 7 jjban 34 man
7 8 alicn 22 woman
score.csv
id name score
0 1 tom 89
1 2 john 90
2 3 alice 78
3 4 jack 99
4 5 saex 87
Use contract to connect, pay attention to the wording of contract([df1,df2]), join optional outer/inner
contract_pd = pd.concat([student_pd,score_pd],join='outer', ignore_index=True)
id name age sex score
0 1 tom 23.0 man NaN
1 2 john 33.0 man NaN
2 3 alice 22.0 woman NaN
3 4 jack 42.0 man NaN
4 5 saex 22.0 woman NaN
5 6 jmas 21.0 man NaN
6 7 jjban 34.0 man NaN
7 8 alicn 22.0 woman NaN
0 1 tom NaN NaN 89.0
1 2 john NaN NaN 90.0
2 3 alice NaN NaN 78.0
3 4 jack NaN NaN 99.0
4 5 saex NaN NaN 87.0
5 6 jmas NaN NaN 33.0
contract_pd = pd.concat([student_pd,score_pd],join='inner', ignore_index=True)
id name
0 1 tom
1 2 john
2 3 alice
3 4 jack
4 5 saex
5 6 jmas
6 7 jjban
7 8 alicn
8 1 tom
9 2 john
10 3 alice
11 4 jack
12 5 saex
13 6 jmas
merge to join columns by key
Similar to the relational database connection mode may be different according to DatFrame connect one or more bonds. The typical application scenario of this function is to integrate two tables with different fields for the same primary key, and integrate them into one table according to the primary key.
merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False)
Parameter introduction:
-
left and right: two different DataFrames;
-
how: connection mode, there are inner, left, right, outer, the default is inner;
-
on: Refers to the column index name used for connection, which must exist in the left and right DataFrames. If it is not specified and other parameters are not specified, the intersection of the two DataFrame column names is used as the connection key;
-
left_on: The column name used for the connection key in the left DataFrame. This parameter is very useful when the left and right column names are different but the meaning is the same;
-
right_on: the column name used to connect the key in the right DataFrame;
-
left_index: Use the row index in the left DataFrame as the connection key;
-
right_index: Use the row index in the right DataFrame as the connection key;
-
sort: The default is True, sort the merged data, setting it to False can improve performance;
-
suffixes: A tuple of string values, used to specify the suffix name appended to the column name when the same column name exists in the left and right DataFrame, the default is ('_x','_y');
-
copy: The default is True, always copy data to the data structure, setting it to False can improve performance;
Example:
# 1.默认以重叠的列名当做连接键
contract_pd = pd.merge(student_pd,score_pd,how="inner",sort=True)
id name_x age sex name_y score
0 1 tom 23 man tom 89
1 2 john 33 man john 90
2 3 alice 22 woman alice 78
3 4 jack 42 man jack 99
4 5 saex 22 woman saex 87
5 6 jmas 21 man jmas 33
# 2.默认做inner连接(取key的交集),连接方式还有(left,right,outer),制定连接方式加参数:how=''
contract_pd = pd.merge(student_pd,score_pd,how="left",on='id',sort=True)
id name_x age sex name_y score
0 1 tom 23 man tom 89.0
1 2 john 33 man john 90.0
2 3 alice 22 woman alice 78.0
3 4 jack 42 man jack 99.0
4 5 saex 22 woman saex 87.0
5 6 jmas 21 man jmas 33.0
6 7 jjban 34 man NaN NaN
7 8 alicn 22 woman NaN NaN
# 3. 执行on 的时候不能够指定left_on 或者right_on
contract_pd = pd.merge(student_pd, score_pd, how="left", left_on='id', right_on='id', sort=True)
id name_x age sex name_y score
0 1 tom 23 man tom 89.0
1 2 john 33 man john 90.0
2 3 alice 22 woman alice 78.0
3 4 jack 42 man jack 99.0
4 5 saex 22 woman saex 87.0
5 6 jmas 21 man jmas 33.0
6 7 jjban 34 man NaN NaN
7 8 alicn 22 woman NaN NaN
Take out the result of merge according to the conditions:
# 取出score为NaN的记录
allsed = contract_pd.loc[contract_pd.score.isna()]
id name_x age sex name_y score
6 7 jjban 34 man NaN NaN
7 8 alicn 22 woman NaN NaN
# 取出score 为非NaN的记录
allsed = contract_pd.loc[~contract_pd.score.isna()]
id name_x age sex name_y score
0 1 tom 23 man tom 89.0
1 2 john 33 man john 90.0
2 3 alice 22 woman alice 78.0
3 4 jack 42 man jack 99.0
4 5 saex 22 woman saex 87.0
5 6 jmas 21 man jmas 33.0
# 对结果进行去重
allsed.drop_duplicates()
Function of lambda function:
tag_pd = pd.read_csv("tags.csv")
id tags
0 1 1234|2345|3456|2348|7865|1357
1 2 1234|2345|3456|2348|7865|1357
2 3 1234|2345|3456|2348|7865|1357
3 4 1234|2345|3456|2348|7865|1357
4 5 1234|2345|3456|2348|7865|1357
5 6 1234|2345|3456|2348|7865|1357
tag_pd['idss'] = tag_pd.tags.apply(lambda x:x.split('|'))
id tags idss
0 1 1234|2345|3456|2348|7865|1357 [1234, 2345, 3456, 2348, 7865, 1357]
1 2 1234|2345|3456|2348|7865|1357 [1234, 2345, 3456, 2348, 7865, 1357]
2 3 1234|2345|3456|2348|7865|1357 [1234, 2345, 3456, 2348, 7865, 1357]
3 4 1234|2345|3456|2348|7865|1357 [1234, 2345, 3456, 2348, 7865, 1357]
4 5 1234|2345|3456|2348|7865|1357 [1234, 2345, 3456, 2348, 7865, 1357]
to sum up:
- contract is similar to the union all operation in relational databases
- merge is similar to inner join, left join, and right join operations in relational databases