A few days ago, in a group, I saw a friend who said that he had received an interview from Ali, and they asked about the use of pandas. One of the questions is: 5 ways to combine data in pandas .
Taking this opportunity today, I will take stock of 5 functions for merging data in pandas. But for each function, I don't intend to explain it in detail here. For specific usage, you can refer to the official pandas documentation.
- join is mainly used for index-based horizontal merging and splicing;
- merge is mainly used for horizontal merging and splicing based on specified columns;
- concat can be used for horizontal and vertical merging and splicing;
- append is mainly used for vertical appending;
- The combine function can be used to combine two DataFrames by column.
join
Join is a horizontal splicing based on the index. If the indexes are the same, the horizontal splicing is directly performed. If the index is inconsistent, it will be filled with Nan values.
index consistent
x = pd.DataFrame({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=[0, 1, 2])
y = pd.DataFrame({
'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index=[0, 1, 2])
x.join(y)
The result is as follows:
index inconsistency
x = pd.DataFrame({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=[0, 1, 2])
y = pd.DataFrame({
'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index=[1, 2, 3])
x.join(y)
The result is as follows:
merge
Merge is based on the horizontal splicing of specified columns. This function is similar to the connection method of relational databases. It can connect different DatFrames according to one or more keys. The typical application scenario of this function is that there are two tables with different fields for the same primary key, and they are integrated into one table according to the primary key.
- You can specify different how parameters to indicate the connection method, including inner connection, left connection, right connection, outer connection, the default is inner;
x = pd.DataFrame({
'姓名': ['张三', '李四', '王五'],
'班级': ['一班', '二班', '三班']})
y = pd.DataFrame({
'专业': ['统计学', '计算机', '绘画'],
'班级': ['一班', '三班', '四班']})
pd.merge(x,y,how="left")
The result is as follows:
concat
The concat function can be used for horizontal stitching as well as vertical stitching.
Vertical stitching
x = pd.DataFrame([['Jack','M',40],['Tony','M',20]], columns=['name','gender','age'])
y = pd.DataFrame([['Mary','F',30],['Bob','M',25]], columns=['name','gender','age'])
z = pd.concat([x,y],axis=0)
z
The result is as follows:
Horizontal stitching
x = pd.DataFrame({
'姓名': ['张三', '李四', '王五'],
'班级': ['一班', '二班', '三班']})
y = pd.DataFrame({
'专业': ['统计学', '计算机', '绘画'],
'班级': ['一班', '三班', '四班']})
z = pd.concat([x,y],axis=1)
z
The result is as follows:
append
append is mainly used to append data vertically.
x = pd.DataFrame([['Jack','M',40],['Tony','M',20]], columns=['name','gender','age'])
y = pd.DataFrame([['Mary','F',30],['Bob','M',25]], columns=['name','gender','age'])
x.append(y)
The result is as follows:
combine
combine can combine two DataFrames by column by using a function.
x = pd.DataFrame({
"A":[3,4],"B":[1,4]})
y = pd.DataFrame({
"A":[1,2],"B":[5,6]})
x.combine(y,lambda a,b:np.where(a>b,a,b))
The results are as follows:
Note: The above function is used to return the maximum value at the corresponding position.