[Turn] Pandas common quick reference manual Chinese version

pandas practical manual

In this quick reference manual, we use the following abbreviations:

df: any Pandas DataFrame object
s: any Pandas Series object

At the same time we need to do the following introduction:

 

import pandas as pd
import numpy as np

 

Import Data

  • pd.read_csv(filename): Import data from CSV file
  • pd.read_table(filename): Import data from a text file with delimited delimiters
  • pd.read_excel(filename): Import data from Excel file
  • pd.read_sql(query, connection_object): Import data from SQL table/library
  • pd.read_json(json_string): Import data from a string in JSON format
  • pd.read_html(url): parse URL, string or HTML file, and extract the tables in it
  • pd.read_clipboard(): Get the content from your clipboard and pass it to read_table()
  • pd.DataFrame(dict): Import data from a dictionary object, Key is the column name, Value is the data


 

export data

  • df.to_csv(filename): Export data to CSV file
  • df.to_excel(filename): Export data to Excel file
  • df.to_sql(table_name, connection_object): Export data to SQL table
  • df.to_json(filename): Export data to text file in Json format


 

Create test object

  • pd.DataFrame(np.random.rand(20,5)): Create a DataFrame object composed of 20 rows and 5 columns of random numbers
  • pd.Series(my_list): Create a Series object from the iterable object my_list
  • df.index = pd.date_range('1900/1/30', periods=df.shape[0]): add a date index


 

View and check data

  • df.head(n): View the first n rows of the DataFrame object
  • df.tail(n): View the last n rows of the DataFrame object
  • df.shape(): View the number of rows and columns
  • http://df.info() : View index, data type and memory information
  • df.describe(): View summary statistics of numeric columns
  • s.value_counts(dropna=False): View the unique value and count of the Series object
  • df.apply(pd.Series.value_counts): View the unique value and count of each column in the DataFrame object


 

Data selection

  • df[col]: According to the column name, and return the column in the form of Series
  • df[[col1, col2]]: return multiple columns in the form of a DataFrame
  • s.iloc[0]: select data by location
  • s.loc['index_one']: select data by index
  • df.iloc[0,:]: return the first row
  • df.iloc[0,0]: returns the first element of the first column


 

Data cleaning

  • df.columns = ['a','b','c']: Rename column names
  • pd.isnull(): Check the null value in the DataFrame object and return a Boolean array
  • pd.notnull(): Check for non-null values ​​in the DataFrame object and return a Boolean array
  • df.dropna(): delete all rows containing null values
  • df.dropna(axis=1): delete all columns that contain null values
  • df.dropna(axis=1,thresh=n): delete all rows with less than n non-empty values
  • df.fillna(x): Replace all null values ​​in the DataFrame object with x
  • s.astype(float): Change the data type in the Series to float
  • s.replace(1,'one'): Replace all values ​​equal to 1 with'one'
  • s.replace([1,3],['one','three']): Use'one' instead of 1, and'three' instead of 3
  • df.rename(columns=lambda x: x + 1): change column names in bulk
  • df.rename(columns={'old_name':'new_ name'}): Optionally change the column name
  • df.set_index('column_one'): change the index column
  • df.rename(index=lambda x: x + 1): Rename indexes in batch


 

Data processing: Filter, Sort and GroupBy

  • df[df[col]> 0.5]: select the row whose col column value is greater than 0.5
  • df.sort_values(col1): Sort the data according to column col1, the default ascending order
  • df.sort_values(col2, ascending=False): Sort data in descending order of column col1
  • df.sort_values([col1,col2], ascending=[True,False]): first sort the data in ascending order by column col1, and then sort the data in descending order by col2
  • df.groupby(col): returns a Groupby object grouped by column col
  • df.groupby([col1,col2]): returns a Groupby object grouped by multiple columns
  • df.groupby(col1)[col2]: returns the mean value of column col2 after grouping by column col1
  • df.pivot_table(index=col1, values=[col2,col3], aggfunc=max): Create a pivot table that groups by column col1 and calculates the maximum value of col2 and col3
  • df.groupby(col1).agg(np.mean): returns the mean value of all columns grouped by column col1
  • data.apply(np.mean): apply the function np.mean to each column in the DataFrame
  • data.apply(np.max, axis=1): apply the function np.max to each row in the DataFrame


 

Data consolidation

  • df1.append(df2): add the rows in df2 to the end of df1
  • df.concat([df1, df2],axis=1): add the columns in df2 to the end of df1
  • df1.join(df2,on=col1,how='inner'): perform SQL join on the columns of df1 and df2


 

Statistics

  • df.describe(): View summary statistics of data value columns
  • df.mean(): Returns the mean value of all columns
  • df.corr(): Returns the correlation coefficient between columns
  • df.count(): returns the number of non-null values ​​in each column
  • df.max(): returns the maximum value of each column
  • df.min(): returns the minimum value of each column
  • df.median(): returns the median of each column
  • df.std(): Returns the standard deviation of each column

Guess you like

Origin blog.csdn.net/weixin_52071682/article/details/113446762