A detailed explanation of Python splicing function concat and merge parameters (with code operation display)

DataFrame multi-table merge and splicing function concat, merge parameter detailed explanation + code operation display

Create a demo DataFrame

data = {
    
    'name': ['user1', 'user2', 'user3', 'user4', 'user5'],
        'old': [21, 18, 22, 28, 26],
        'weight': [124, 124, 102,107,121]
        }
test_DataFrame1= pd.DataFrame(data)
test_DataFrame1

data = {
    
    'name': ['user1', 'user3', 'user5', 'user6', 'user7'],
        'old': [21, 22, 26, 30, 31],
        'high': [171, 165, 180,175,159]
        }
test_DataFrame2= pd.DataFrame(data)
test_DataFrame2

Merge function

1.concat

There are concat functions in other languages, which are string concatenation in C language, and string concatenation in SQL. In Pandas, it is equivalent to a full connection (Union all) in the database: along one axis, multiple Objects are stacked together, so don't confuse them. Unlike the database, it does not deduplicate, but the drop_duplicates method can be used to achieve the effect of deduplication.

The syntax is as follows:

concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, 
       keys=None, levels=None, names=None, verify_integrity=False, copy=True): 

Official website: pandas.concat

Parameter explanation:

  • objs: A sequence or map of Series, DataFrame or Panel objects, that is, the two objects that are concatenated.
  • axis: The default is 0, 0 is row splicing, 1 is column splicing, meaning along the axis of the connection.
  • join: {'inner', 'outer'}, defaults to 'outer'. How to handle indices on other axes. outer is the union and inner is the intersection.
  • ignore_index: boolean, default False. If True, the merged index value is not used. The resulting axes will be labeled 0,...,n-1. Index values ​​on other axes persist in the join.
  • join_axes: List of Index objects. Use specific indices for other n-1 axes instead of performing internal/external setup logic.
  • keys: sequence, default None. Build a hierarchical index using the passed key as the outermost layer. If multi-index should be a tuple.
  • levels: list of sequences, default None. The specific level (unique value) used to build the MultiIndex. Otherwise, they will be inferred from the key.
  • names: list, default None. The name of the level in the resulting hierarchical index.
  • verify_integrity: boolean, default False. Check if the newly connected axis contains duplicates.
  • copy: boolean, the default value is True. If False, don't copy data unnecessarily.

When using concat directly without parameters, it is:

pd.concat([test_DataFrame1,test_DataFrame2])

At this time, axis defaults to 0, which means row connection, which is when axis is 1:

pd.concat([test_DataFrame1,test_DataFrame2],axis=1)

The number of rows does not change and the columns increase. Changing join to intersection only merges other deletes with the same column:

pd.concat([test_DataFrame1,test_DataFrame2],axis=0,join='inner')

Similarly, when axis is 1, it is the same, only to see if the row index is the same

pd.concat([test_DataFrame1,test_DataFrame2],axis=1,join='inner')

The new version of pandas has removed join_axes, which can be replaced by merge.

And ignore_index is to replace the index index correspondingly:

pd.concat([test_DataFrame1,test_DataFrame2],axis=0,join='outer',ignore_index=True)

2.merge

Merge is more like a SQL relational database to connect databases according to the corresponding keys in the table, similar to join.

The syntax is as follows:

merge(left, right, how='inner', on=None, left_on=None, right_on=None,  
      left_index=False, right_index=False, sort=True,  
      suffixes=('_x', '_y'), copy=True, indicator=False)

Official website: pandas.merge

Parameter Description:

  • left: The left DataFrame participating in the merge
  • right: the right DataFrame participating in the merge
  • how: {inner, outer, left, right} default is inner as intersection.
  • on: Column names to use for joins. The column name is like a feature common to the two tables being joined. Similar to merging multiple tables by key. If it is not specified, and other connection keys do not need to be specified, the intersection of the left and right column names is used as the connection key, that is, the inner connection.
  • left_on: the column in the left DataFrame to use as the join key
  • right_on: the right column in the DataFrame used as the join key
  • left_index: Use the row index in the left DataFrame as the join key
  • right_index: Use the row index in the right DataFrame as the join key
  • sort: The default is True, which sorts the merged data. Setting to False can improve performance in most cases
  • suffixes: A tuple of string values, used to specify the suffix name appended to the column name when the left and right DataFrames have the same column name, the default is ('_x','_y'). If both DataFrame objects have "Data", then "Data_x" and "Data_y" will appear in the result
  • copy: Defaults to True, always copies the data into the data structure; setting to False in most cases can improve performance
  • indicator: shows the source of the data in the combined data

Still the previous two DataFrames, the default effect is:

pd.merge(test_DataFrame1,test_DataFrame2)

If the intersection is merged according to the index, if there is no index value, it will be automatically filled with NaN:

pd.merge(test_DataFrame1,test_DataFrame2,how='outer')

If you want to specify a specific column name for indexing, you need the on function:

pd.merge(test_DataFrame1,test_DataFrame2,how='outer',on='name')

If both DataFrame objects have "old", then "old_x" and "old_y" will appear in the result

If the column names in the two tables are different, such as creating a DataFrame3:

You can use left_on and right_on when you want to merge with DataFrame1:

pd.merge(test_DataFrame1,test_DataFrame3,how='outer',left_on=['name'],right_on=['user'])

The line does not show the same effect, sort sort:

pd.merge(test_DataFrame1,test_DataFrame2,how='outer',sort=True)

Guess you like

Origin blog.csdn.net/m0_59596937/article/details/127235933