Common operation command Encyclopedia pandas

Online individual does not actually have to tap to add a point common
environment IDE anaconda python3.7

In this Quick Reference Guide, we use the following abbreviations:

df: Object arbitrary Pandas DataFrame

s: arbitrary objects Pandas Series

At the same time we need to do the following introduction:

import pandas as pd

import numpy as np

Import Data

pd.read_csv (filename): import data from a CSV file
pd.read_table (filename): import data from a delimited text file defining
pd.read_excel (filename): import data from the Excel file
pd.read_sql (query, connection_object): import data from SQL table / database
pd.read_json (json_string): import data from JSON-formatted string
pd.read_html (url): parsing the URL, the HTML file or string, wherein the extracted tables form
pd.read_clipboard (): Gets content from your clipboard, and passed read_table ()
pd.DataFrame (dict): import data from the dictionary objects, Key is the column name, Value data

export data

Create a test object

pd.DataFrame (np.random.rand (20,5)): Create Object 20 DataFrame row random number consisting of 5
pd.Series (my_list): Creating a Series object from iterables my_list
df.index = pd.date_range ( '1900/1/30', periods = df.shape [0]): increase a date index

Check, check data

df.head (n): View DataFrame object the first n rows
df.tail (n): View DataFrame object last n lines
() df.shape: View the number of rows and columns # Windows error parentheses
df.info () : View the index, data type and memory information
df.columns View column
df.index View Index
df.describe () to view summary statistics for numeric columns will be digital statistics show the total number of the largest minimum difference
s.value_counts (dropna = False): See Series unique object count value and
df.apply (pd.Series.value_counts): See DataFrame object unique values and each column count

Data Selector

df [col]: The column name, and returns a column in the form of Series
df [[col1, col2]]: Returns the form of a plurality of columns in DataFrame
s.iloc [0]: Select data by location support indexing, sliced
s.loc [ 'index_one']: Select the data according to the index did not understand what the hell is this
df.iloc [0 ,:]: returning the first row colon indicates start to finish, can specify the length of a slice
df.iloc [0,0]: Returns the first element of the first column
df.iloc [:, 0]: Returns the first column

Data Cleansing

df.columns = [ 'a', 'b', 'c']: Rename column name
pd.isnull () any ():. DataFrame null inspection object and returns a Boolean array
pd.notnull () any ():. DataFrame non-null values inspection object and returns a Boolean array
pd [pd.notnull () == True] filter all null
pd [pd. Column name .notnull () == True] This filtration column worth of data is empty
df.dropna (): delete all the rows contain null values
df.dropna (axis = 1): Delete all columns contain null values
df.dropna (axis = 1, thresh = n): to delete all non-null value less than n rows
df.fillna (x): Replace all null values DataFrame object by x
s.astype (float): Change the data type to float in Series
s.replace (1, 'one') : with a 'one' is equal to 1 instead of the value of all the tests in the alternative floating-point type int int entire column into
s.replace ([1,3], [ 'one', 'three']): with a 'one' in place of 1, with 'three' instead of 3
df.rename (columns = lambda x: x + 1): Mass Change column names
df.rename (columns = { 'old_name': 'new_ name'}): selectively change a column name
df.set_index ( 'column_one'): Change the index column
df.rename (index = lambda x: x + 1): Batch rename index

Data processing: Filter, Sort and GroupBy

df [df [col]> 0.5]: select col column values greater than 0.5
df.sort_values (col1): col1 sort the data in columns, in ascending order by default
df.sort_values (col2, ascending = False): in descending order according to the data column col1
df.sort_values ([col1, col2], ascending = [True, False]): first by column in ascending order col1, col2 descending order according to the data
df.groupby (col):. returns one of Groupby objects grouped by column col is really a return address formation
df.groupby ([col1, col2]): Returns by a plurality of objects are grouped Groupby column.
df.groupby (col1) [col2]: RETURN group by columns col1, col2 the column means. or return address
PivotTable maximum value to create a group by columns col1, col2 and col3 and calculates a: df.pivot_table (index = col1, values = [col2, col3], aggfunc = max)
- customer_data.pivot_table (index = 'REFER', values = 'Age', aggfunc = [max, min]) . Minimum and maximum for each channel
df.groupby (col1) .agg (np.mean): Returns the mean of all the columns grouped by columns col1
- Often used to display the average value of each channel, channel by channel in each of the age of the average value (not the entire data of the maximum and minimum )
data.apply (np.mean): in each column of DataFrame application function np.mean
data.apply (np.max, axis = 1): for each row of the applied function of np.max DataFrame

Data processing: Add a new column

According to the results of the current process will result columns were added to the new / increased a
- frame['test'] = frame.apply(lamubda x: function(x.city, x.year), axis = 1)
- function is a function of the preparation

Data Merge

df1.append (df2): df2 added to the end of the row of df1
df.concat ([df1, df2], axis = 1): adding df2 columns to the end of df1
df1.join (df2, on = col1, how = 'inner'): join SQL execution forms of the columns df1 and df2 column

Statistics

df.describe (): see the data value of the column summary statistics
df.mean (): Returns the mean of all the columns
df.corr (): Returns the correlation coefficient between the rows and columns
df.count (): returns the number of non-null values of each column
df.max (): returns the maximum of each column
df.min (): returns the minimum value of each column
df.median (): Returns the median of each column
df.std (): returns the standard deviation of each column Dispersion
- Larger value indicates more dispersion data