Pandas splicing operation data merge, join, concat

In doing data processing encountered splicing operation among a plurality of sets of data, since the data set where the read Pandas usually are used, it is generally aimed at the type of data DataFrame splicing operation.

1. merge

For row two data sets by one or more connecting bonds, similar to the SQL JOIN. Typical application scenarios of the function is, for the same primary key table that contains the presence of two different fields, and now we want to integrate them into a table. In this exemplary case, the result sets the number of lines does not increase, compared with the number of columns and number of rows minus the number of two linkages metadata.

df.merge(right, how='inner', on=None, left_on=None,
         right_on=None, left_index=False, right_index=False, 
         sort=False, suffixes=('_x', '_y'), copy=True)

Parameters explanation: 

right: the target data to be connected, with type or column names DataFrame Series 

how: connection, similar sql statement (left, right, inner, outer), where the default is the 'inner', optional parameters are left, right, inner, outer

on: column name or index connection, that is specified in the connection between two objects which you want to connect through the column name or index name.

left_on: Specifies the left of DataFrame which are connected to the column name or index name

right_on: DataFrame specifies the right of which is connected to the column name or index name

left_index: Key as the connection with the left DataFrame

right_index: Key as a connection with the right DataFrame

sort: for lexicographically sorting Key connection, the default is False, the default order of linkage depends on the type of connection

suffixes: default column name after the connection with the subscript ( '_x', '_y')

copy: Copy default, if set to False, as much as possible to avoid copying

a. In the same default name to join key column 

 

2. join 

Mosaic columns, mainly for consolidation on the index, provides an easy way for the two DataFrame different column index merged into a DataFrame

df.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)

Parameters explanation:

other: another or have to be spliced ​​DataFrame Series name list 

on: column name or index connection, that is specified in the connection between two objects which you want to connect through the column name or index name.

how: connection, similar sql statement (left, right, inner, outer), where the default is 'left', optional parameters are left, right, inner, outer

lsuffix: The Key to use the connection left subscript

rsuffix: the right to use the connection Key index

sort: column name after stitching lexicographically sorting, the default is False, False, when retained in accordance with the order to sort DataFrame left.

3. 

3. concat

A plurality of objects may be stacked together along a shaft

Method concat connected to total (UNION ALL) database can be specified by a connection shaft may be connected to specify join (outer, inner only two). The difference is that the database will not go concat weight, to achieve the effect of deduplication method can be used drop_duplicates

pandas.concat(objs, axis=0, join='outer',
              ignore_index=False, keys=None, levels=None,
              names=None, verify_integrity=False,copy=True)

Parameters explanation:

objs: a collection of objects to be connected, is typically a list or dictionary;

axis: 0 index representative of the connection, is connected for column 1, the default is 0 

join: connection, by default 'outer', you can also choose to 'inner'

ignore_index: The default is False, if True, the Index refers to previous neglect, according to the default allocation index directly allocated from 0-N-1, for the index after the combination did not much sense when the time is more appropriate, to rebuild the index

keys: used to create a hierarchy of index

levels: Level, to be divided in a column

names: column name specified levels of

verify_integrity: to determine whether the new combination of data duplicate values, but will be very resource-intensive

copy: copy data 

Guess you like

Origin blog.csdn.net/qq_27575895/article/details/88789147