[Python] Pandas data cleaning operation, summary of commonly used functions

serial number content
1 [Python] Introduction to Pandas, introduction to data structure Series and DataFrame, CSV file processing, JSON file processing
2 [Python] Pandas data cleaning operation, summary of commonly used functions

1. Pandas data cleaning

Data cleaning is the process of processing useless data.

Many data sets contain missing data, incorrect data formats, incorrect data, or duplicate data. If you want to make data analysis more accurate, you need to process these useless data.


Sample data is as follows:

Insert image description here

The above table contains four types of empty data:

  • n/a
  • THAT
  • -
  • already

dropna()This method can delete rows containing empty fields. The syntax is as follows:

DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Parameter Description:

  • axis: The default is 0, which means the entire row will be removed when the value is empty. If the parameter axis=1 is set, the entire column will be removed when the value is empty.
  • how: The default is 'any'. If any data in a row (or column) contains NA, the entire row will be removed. If how='all' is set, the entire row (or column) will be removed only when NA appears.
  • thresh: Set how much non-null value data is required to be retained.
  • subset: Set the column you want to check. If there are multiple columns, you can use a list of column names as parameters.
  • inplace: If set to True, the calculated value will directly overwrite the previous value and None will be returned. The source data will be modified.

Note: By default, the dropna() method returns a new DataFrame and does not modify the source data.

If you want to modify the source data DataFrame, you can use the inplace = True parameter:


isnull()Determine whether each cell is empty.

import pandas as pd
df = pd.read_csv('property-data.csv')
print (df['NUM_BEDROOMS'])
print (df['NUM_BEDROOMS'].isnull())

Insert image description here

fillna()Method to replace some empty fields:

import pandas as pd
df = pd.read_csv('property-data.csv')
df.fillna(12345, inplace = True)
print(df.to_string())

Insert image description here
mean()The , median()and mode()methods calculate the mean (the average of all values ​​added together), the median (the number in the middle of the order), and the mode (the number that occurs most frequently) of a column.

import pandas as pd
df = pd.read_csv('property-data.csv')
x = df["ST_NUM"].mean()
df["ST_NUM"].fillna(x, inplace = True)
print(df.to_string())

Insert image description here

duplicated()and drop_duplicates()methods can help us clean duplicate data. If the corresponding data is duplicated, duplicated() will return True, otherwise it will return False.

import pandas as pd
person = {
    
    
  "name": ['Google', 'Runoob', 'Runoob', 'Taobao'],
  "age": [50, 40, 40, 23]  
}
df = pd.DataFrame(person)
print(df.duplicated())

The result of running the program is:

0    False
1    False
2     True
3    False
dtype: bool

drop_duplicates()Method to remove duplicate data

import pandas as pd
persons = {
    
    
  "name": ['Google', 'Runoob', 'Runoob', 'Taobao'],
  "age": [50, 40, 40, 23]  
}
df = pd.DataFrame(persons)
df.drop_duplicates(inplace = True)
print(df)

The result of running the program is:

     name  age
0  Google   50
1  Runoob   40
3  Taobao   23

2. Commonly used functions in Pandas

1. Read data

serial number function Function
1 pd.read_csv(filename) Read CSV file
2 pd.read_excel(filename) Read Excel file
3 pd.read_sql(query, connection_object) Read data from SQL database
4 pd.read_json(json_string) Read data from JSON string
5 pd.read_html(url) Read data from HTML page

2. View data

serial number function Function
1 df.head(n) Display the first n rows of data
2 df.tail(n) Display the last n rows of data
3 df.info() Displays data information, including column names, data types, missing values, etc.
4 df.describe() Display basic statistical information of data, including mean, variance, maximum value, minimum value, etc.
5 df.shape Display the number of rows and columns of data

3. Data cleaning

serial number function Function
1 df.dropna() Remove rows or columns containing missing values
2 df.fillna(value) Replace missing values ​​with specified values
3 df.replace(old_value, new_value) Replace specified value with new value
4 df.duplicated() Check if there is duplicate data
5 df.drop_duplicates() Remove duplicate data

4. Data selection and slicing

serial number function Function
1 df[column_name] Select specified column
2 df.loc[row_index, column_name] Select data by label
3 df.iloc[row_index, column_index] Select data by location
4 df.ix[row_index, column_name] Select data by label or location
5 df.filter(items=[column_name1, column_name2]) Select specified column
6 df.filter(regex='regex') Select columns whose column names match a regular expression
7 df.sample(n) Randomly select n rows of data

5. Data sorting

serial number function Function
1 df.sort_values(column_name) Sort by value of specified column
2 df.sort_values([column_name1, column_name2], ascending=[True, False]) Sort by values ​​in multiple columns
3 df.sort_index() Sort by index

6. Data grouping and aggregation

serial number function Function
1 df.groupby(column_name) Group by specified column
2 df.aggregate(function_name) Aggregate the grouped data
3 df.pivot_table(values, index, columns, aggfunc) Generate pivot table

7. Data merging

serial number function Function
1 pd.concat([df1, df2]) Combine multiple data frames by row or column
2 pd.merge(df1, df2, on=column_name) Merge two data frames according to specified columns

8. Data selection and filtering

serial number function Function
1 df.loc[row_indexer, column_indexer] Select rows and columns by label
2 df.iloc[row_indexer, column_indexer] Select rows and columns by position
3 df[df['column_name'] > value] Select rows in a column that meet the criteria
4 df.query('column_name > value') Use a string expression to select rows in a column that meet a condition

9. Data statistics and description

serial number function Function
1 df.describe() Calculate basic statistics such as mean, standard deviation, minimum, maximum, etc.
2 df.mean() Calculate the average of each column
3 df.median() Calculate the median of each column
4 df.mode() Calculate the mode of each column
5 df.count() Count the number of non-missing values ​​in each column

Ref.

  1. Pandas Tutorial - Tutorial for Newbies
  2. Pandas - documentation

Guess you like

Origin blog.csdn.net/weixin_36815313/article/details/132138467