In this quick reference manual, we use the following abbreviations:
df: any Pandas DataFrame object
s: any Pandas Series object
At the same time we need to do the following introduction:
import pandas as pd
import numpy as np
Import Data
- pd.read_csv(filename): Import data from CSV file
- pd.read_table(filename): Import data from a text file with delimited delimiters
- pd.read_excel(filename): Import data from Excel file
- pd.read_sql(query, connection_object): Import data from SQL table/library
- pd.read_json(json_string): Import data from a string in JSON format
- pd.read_html(url): parse URL, string or HTML file, and extract the tables in it
- pd.read_clipboard(): Get the content from your clipboard and pass it to read_table()
- pd.DataFrame(dict): Import data from a dictionary object, Key is the column name, Value is the data
export data
- df.to_csv(filename): Export data to CSV file
- df.to_excel(filename): Export data to Excel file
- df.to_sql(table_name, connection_object): Export data to SQL table
- df.to_json(filename): Export data to text file in Json format
Create test object
- pd.DataFrame(np.random.rand(20,5)): Create a DataFrame object composed of 20 rows and 5 columns of random numbers
- pd.Series(my_list): Create a Series object from the iterable object my_list
- df.index = pd.date_range('1900/1/30', periods=df.shape[0]): add a date index
View and check data
- df.head(n): View the first n rows of the DataFrame object
- df.tail(n): View the last n rows of the DataFrame object
- df.shape(): View the number of rows and columns
- http://df.info() : View index, data type and memory information
- df.describe(): View summary statistics of numeric columns
- s.value_counts(dropna=False): View the unique value and count of the Series object
- df.apply(pd.Series.value_counts): View the unique value and count of each column in the DataFrame object
Data selection
- df[col]: According to the column name, and return the column in the form of Series
- df[[col1, col2]]: return multiple columns in the form of a DataFrame
- s.iloc[0]: select data by location
- s.loc['index_one']: select data by index
- df.iloc[0,:]: return the first row
- df.iloc[0,0]: returns the first element of the first column
Data cleaning
- df.columns = ['a','b','c']: Rename column names
- pd.isnull(): Check the null value in the DataFrame object and return a Boolean array
- pd.notnull(): Check for non-null values in the DataFrame object and return a Boolean array
- df.dropna(): delete all rows containing null values
- df.dropna(axis=1): delete all columns that contain null values
- df.dropna(axis=1,thresh=n): delete all rows with less than n non-empty values
- df.fillna(x): Replace all null values in the DataFrame object with x
- s.astype(float): Change the data type in the Series to float
- s.replace(1,'one'): Replace all values equal to 1 with'one'
- s.replace([1,3],['one','three']): Use'one' instead of 1, and'three' instead of 3
- df.rename(columns=lambda x: x + 1): change column names in bulk
- df.rename(columns={'old_name':'new_ name'}): Optionally change the column name
- df.set_index('column_one'): change the index column
- df.rename(index=lambda x: x + 1): Rename indexes in batch
Data processing: Filter, Sort and GroupBy
- df[df[col]> 0.5]: select the row whose col column value is greater than 0.5
- df.sort_values(col1): Sort the data according to column col1, the default ascending order
- df.sort_values(col2, ascending=False): Sort data in descending order of column col1
- df.sort_values([col1,col2], ascending=[True,False]): first sort the data in ascending order by column col1, and then sort the data in descending order by col2
- df.groupby(col): returns a Groupby object grouped by column col
- df.groupby([col1,col2]): returns a Groupby object grouped by multiple columns
- df.groupby(col1)[col2]: returns the mean value of column col2 after grouping by column col1
- df.pivot_table(index=col1, values=[col2,col3], aggfunc=max): Create a pivot table that groups by column col1 and calculates the maximum value of col2 and col3
- df.groupby(col1).agg(np.mean): returns the mean value of all columns grouped by column col1
- data.apply(np.mean): apply the function np.mean to each column in the DataFrame
- data.apply(np.max, axis=1): apply the function np.max to each row in the DataFrame
Data consolidation
- df1.append(df2): add the rows in df2 to the end of df1
- df.concat([df1, df2],axis=1): add the columns in df2 to the end of df1
- df1.join(df2,on=col1,how='inner'): perform SQL join on the columns of df1 and df2
Statistics
- df.describe(): View summary statistics of data value columns
- df.mean(): Returns the mean value of all columns
- df.corr(): Returns the correlation coefficient between columns
- df.count(): returns the number of non-null values in each column
- df.max(): returns the maximum value of each column
- df.min(): returns the minimum value of each column
- df.median(): returns the median of each column
- df.std(): Returns the standard deviation of each column