Summary of commonly used functions in pandas data processing

1. Read files and view data sheet information

Read files and store

1.pd.read_csv()/pd.read_excel()

Read csv file/excel file The
parameter is the file path, you can add sheet_name, the default sheet1 name is 0

2.pd.to_csv(‘file_path’)/pd.to_excel()

Save as csv file/excel file The
parameter is the file path

Common functions

1. A=df.shape() to
view the data dimension: the
return value is a tuple type, which can be called as follows: df.shape[0], number of rows

df=pd.DataFrame({'id':[1,np.nan,3,4],'name':['asx',np.nan,'wes','asd'],'score':[78,90,np.nan,88]},index=list('abcd'))
print(df.shape[0])
df

Output result
2. df.info()
basic information of the data table (dimensions, column names, data format, space occupied, etc.):

df=pd.DataFrame({'id':[1,np.nan,3,4],'name':['asx',np.nan,'wes','asd'],'score':[78,90,np.nan,88]},index=list('abcd'))
df.info()

info result
3.
The format of each column of data in df.dtypes() :

4. df.head(n)/df.tail(n) the
first n rows of data/the last n rows of data, you can leave it blank, the default is 5 rows

5.df.describe()
Get total statistics, including sum, average, variance, median, etc.

6.df.dtypes()
The format of each column of data.

7. df.idxmax()/df.max() to
find the maximum index/maximum The first function only works on the numeric Series().

8.df.isnull()
returns whether each cell is null, and whether it is NAN, if yes, it returns True, if no, it returns False.

2. Data processing

0.df[‘col_name’].value_counts()

Count the number of occurrences of data in this column

1. df.sum()
sums the columns, when axis=1, sum each row.

2. df.dropna()
deletes rows with null values ​​and returns the results of all rows without null values. If how='all', delete all rows with null values ​​and return the remaining results.

3. df.fillna(value=0)
missing value fill, here use the number 0 to fill the empty value.
How to use, df.fillna(0) empty values ​​are all filled with 0

Differentiable filling, use dictionary dict={'col1':0,'col2':'Unknow',...}
Df.fillna(dict)/df.fillna(value=dict)

You can use the parameter method, method='backfill' to fill in the null value with the value appearing after the null value/fill in the value before the null value of'ffill'.

4.DataFrame.drop_duplicates(subset=None, keep=‘first’, inplace=False)

To remove duplicate values, the meaning of each parameter
subset: column label or sequence of labels, optional, input column name, can be single column or multiple columns, use [] for multiple columns, retain the first duplicate value by default. Multiple column values ​​are repeated together to delete rows.
Used to specify a specific column, by default all columns
keep: {'first','last', False}, default'first '
delete duplicates and keep the first occurrence
inplace: boolean, default False
is directly in the original data To modify the above, or keep a copy to
pass in a single column, the duplicate row of the column will be deleted, multiple columns can be passed in, and the row can be deleted when several columns are repeated at the same time. Multiple columns are passed in with ['a','b'], generally Can default to all columns

df3=pd.DataFrame({'a':[1,1,2,3],'b':[22,22,34,12],'c':[1,1,1,1]})
df3.drop_duplicates('a')

Duplicate results
5.df.drop(labels=['col_name'], axis=1, inplace = True)
delete one or more columns

6.df.pop('col_name')
pop a column

7.Df.insert(0,’col-name’,df[‘col’])

Insert a column, where the first parameter is the position, the second is the new column name, and the third is the inserted Series. Note that the indexes during insertion must be consistent.

8.df.set_index(‘col’)/df.reset_index()

Set as index, missing values ​​will be processed as nan string index.
To reset the index, generally do not write parameters, you can change the original index into a normal column and use the default index. You can reset the index without parameters

9.pd.to_datetime()
sets a certain column as a time series, but it must be in a certain time format, otherwise the results will be arranged in milliseconds on January 1, 1970

df=pd.DataFrame({'a':[1,1,2,3],'b':[22,22,34,12],'c':[1,1,1,1]})
df3['a']=pd.to_datetime(df3['a'])
df=df3.set_index('c')

Time series results
10.df.[‘col’].apply(lambda x: x.sum()/3, axis=1)

The apply function, the function processes the selected column data, the processing method is the function in the brackets. Here Axis=1 means that the row is processed, and the default is column processing.

11.df[‘col’].replace(‘text1’,’text2’)

Replace text1 with text2

12.df[‘col’].str.contains(text_list,regex=True,na=False)

Query whether the cell contains a certain content. text_list is text or regular expression. regex=True means text_list is a regular expression query. Whether it contains regular expressions or content in the text, it returns True if it is, otherwise it returns False.

13.df3[].round(decimals=2)

Set the cell to retain the number of decimal places, here to retain two decimal places.

14 df.map(str.strip)

Go to the space, here to find all the cells in the data table, remove the leading and trailing spaces and other symbols

To explain in detail, str.strip removes spaces in characters and other line breaks by default. The deletion method starts from the beginning and the end, but once there is no object that meets the deletion requirements, it will stop deleting
more usages, str=str='hiahia ohoh haha ihih'
Str.strip('hai') will be 'ohoh haha'.
There are two other similar methods for string str, lstrip() and rstrip(). The first is to delete only the head, and the second is to delete only the tail. The usage is similar.

3. Text splitting and splicing

1.df.str.split(’-’)

Column character separation
For example: df['counter name'].str.split('-')
is separated by a certain symbol. In this example, it is separated by "-" to form a new dateframe, which is implemented by expand.

Str.split() has three parameters, the first is the basis of the column, the second is whether to return the dateframe data format, the third is the number of rows in the column, if you want to find the basis of the column from right to left, Use rsplit().
The usage of rsplit() is similar to split(), one starts from the right and the other starts from the left.
The returned dateframe uses the same index as the original data and can be merged directly.

2.df.str.cat()

Character splicing
Df['col_name'].str.cat(df['col_name2'],sep='-') Connect name and id with -.
Multi-column splicing: Df['col_name'].str.cat([df['col_name2'],df['col_name3']],sep='-')

4. Data table merge and splicing

1.pd.merge(left, right, how=‘inner’, on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=(’_x’, ‘_y’), copy=True, indicator=False,
validate=None)

The combination of Dataframe, the meaning of each parameter

left: spliced ​​left DataFrame object
right: spliced ​​right DataFrame object
How:how: One of'left ','right','outer','inner'. Default inner. inner is the intersection, outer is the union. For example, left: ['A','B','C']; right[''A,'C','D']; if inner takes the intersection, the A that appears in left and the one that appears in right will buy one A is matched and spliced. If there is no B, it will be lost if it is not matched in right. 'outer' takes the union, the A that appears will be matched one by one, and the missing part will be added to the missing value if it does not appear at the same time.
Left, use the data frame on the left as the index, and the column without the index on the right will be set to null.
Right is the same.
Inner: take the intersection
Outer: take the union

sort: Sort the resulting DataFrame in lexicographic order by concatenating keys. The default is True, setting it to False will significantly improve performance in many situations.

suffixes: A tuple of string suffixes used for overlapping columns. The default is ('x',' y')

copy: Always copy data from the passed DataFrame object (default is True), even if there is no need to rebuild the index.

indicator: Add a column to the output DataFrame named _merge, which contains information about the source of each row. _merge is a classification type, and for the observations whose merge key only appears in the "left" DataFrame, the obtained value is left_only, and the observation whose merge key only appears in the "right" DataFrame is right_only, and if it is in both If the merge key of the observation point is found in both, it is left_only.
on: The name of the column or index level to be added. Must be found in the left and right DataFrame objects. If it is not passed and left_index and right_index are False, the intersection of the columns in the DataFrame will be inferred as the join key.
left_on: The column or index level in the left DataFrame is used as the key. It can be a column name, an index level name, or an array with a length equal to the length of the DataFrame.
right_on: The column or index level in the left DataFrame is used as the key. It can be a column name, an index level name, or an array with a length equal to the length of the DataFrame.
left_index: If True, the index (row label) in the left DataFrame is used as its connection key. For a DataFrame with MultiIndex (hierarchical), the number of levels must match the number of connection keys in the right DataFrame.
right_index: The function is similar to left_index.

2.pd.concat(objs, axis=0, join=‘outer’, join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False)

The second merge method. Pd.concat()
to merge data: objs=[df[],df[]…]
Axis=0, merge columns horizontally, axis=1. merge rows, merge vertically. If ignore_index=True, the index will be reset to the default index.

Guess you like

Origin blog.csdn.net/hu_hao/article/details/100568820