Python Study Notes Day 57 (Pandas Data Cleaning)

Pandas data cleaning

Data cleaning is the process of processing useless data.

Many data sets contain missing data, incorrect data formats, incorrect data, or duplicate data. If you want to make data analysis more accurate, you need to process these useless data.

In this tutorial, we will utilize the Pandas package for data cleaning.
The test data property-data.csv used in this article is as follows:

Insert image description here

The above table contains four types of empty data:

  • n/a
  • THAT
  • already

Pandas cleaning null values

If we want to delete rows containing empty fields, we can use the dropna() method with the following syntax:

DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Parameter Description:

  • axis: The default is 0, which means the entire row will be removed when the value is empty. If the parameter axis=1 is set, the entire column will be removed when the value is empty.
  • how: The default is 'any'. If any data in a row (or column) contains NA, the entire row will be removed. If how='all' is set, the entire row (or column) will be removed only if NA appears.
  • thresh: Set how much non-null data is required to be retained.
  • subset: Set the columns you want to check. If there are multiple columns, you can use a list of column names as parameters.
  • inplace: If set to True, the calculated value will directly overwrite the previous value and None will be returned. The source data will be modified.

isnull()

We can use isnull() to determine whether each cell is empty.

# 实例 1
import pandas as pd
df = pd.read_csv('property-data.csv')
print (df['NUM_BEDROOMS'])
print (df['NUM_BEDROOMS'].isnull())

In the above example, Pandas treats n/a and NA as empty data. na is not empty data and does not meet our requirements. We can specify the empty data type:

# 实例 2
import pandas as pd
missing_values = ["n/a", "na", "--"]
df = pd.read_csv('property-data.csv', na_values = missing_values)
print (df['NUM_BEDROOMS'])
print (df['NUM_BEDROOMS'].isnull())

A CSV file named 'property-data.csv' was read using the pd.read_csv function and stored in the df variable. The line df.dropna() was removed from the original DataFrame (in the variable df). After removing rows containing empty data, convert the new DataFrame containing empty data rows to strings and print them out.

# 实例 3
import pandas as pd
df = pd.read_csv('property-data.csv')
new_df = df.dropna()
print(new_df.to_string())

Note: By default, the dropna() method returns a new DataFrame and does not modify the source data.
If there are some rows in your 'property-data.csv' file that contain empty data (for example, one or more columns have null values), then these rows will be deleted and the new DataFrame (new_df) will not contain these OK.
It should be noted that dropna() will drop rows containing at least one NaN value by default. If you want to drop all NaN values ​​and keep only rows with no missing values, you can use dropna(how='all') .
Additionally, you can specify whether rows or columns should be dropped by setting the axis parameter. For example, df.dropna(axis=1) will drop columns containing null data.

If you want to modify the source data DataFrame, you can use the inplace = True parameter

# 实例 4
import pandas as pd
df = pd.read_csv('property-data.csv')
df.dropna(inplace = True)
print(df.to_string())

You can also remove rows with null values ​​in specified columns

# 实例 5
import pandas as pd
df = pd.read_csv('property-data.csv')
# 移除 ST_NUM 列中字段值为空的行
df.dropna(subset=['ST_NUM'], inplace = True)
print(df.to_string())

You can also use the fillna() method to replace some empty fields

# 实例 6
import pandas as pd
df = pd.read_csv('property-data.csv')
# 使用 12345 替换空字段
df.fillna(12345, inplace = True)
print(df.to_string())

You can also specify a column to replace data:

# 实例 7
import pandas as pd
df = pd.read_csv('property-data.csv')
# 使用 12345 替换 PID 为空数据:
df['PID'].fillna(12345, inplace = True)
print(df.to_string())

Pandas replace cells

A common way to replace empty cells is to calculate the mean, median, or mode of a column.

Pandas uses the mean(), median(), and mode() methods to calculate the mean (the average of all values), median (the number in the middle after sorting), and mode (the number with the highest frequency) of a column. .

mean()

Use the mean() method to calculate the mean of a column and replace empty cells

# 实例 8
import pandas as pd
df = pd.read_csv('property-data.csv')
x = df["ST_NUM"].mean()
df["ST_NUM"].fillna(x, inplace = True)
print(df.to_string())

median()

Use the median() method to calculate the median of a column and replace empty cells

# 实例 9
import pandas as pd
df = pd.read_csv('property-data.csv')
x = df["ST_NUM"].median()
df["ST_NUM"].fillna(x, inplace = True)
print(df.to_string())

mode()

Use the mode() method to calculate the mode of a column and replace empty cells

# 实例 10
import pandas as pd
df = pd.read_csv('property-data.csv')
x = df["ST_NUM"].mode()
df["ST_NUM"].fillna(x, inplace = True)
print(df.to_string())

Pandas cleans malformed data

Cells with incorrectly formatted data can make data analysis difficult or even impossible.

We can pass rows containing empty cells, or convert all cells in a column to the same format of data.

The following example formats a date:

# 实例 11
import pandas as pd
# 第三个日期格式错误
data = {
    
    
  "Date": ['2020/12/01', '2020/12/02' , '20201226'],
  "duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())

Pandas cleans erroneous data

Data errors are also common, and we can replace or remove erroneous data.

The following example replaces data with the wrong age:

# 实例 12
import pandas as pd
person = {
    
    
  "name": ['Google', 'Baidu' , 'Taobao'],
  "age": [50, 40, 12345]    # 12345 年龄数据是错误的
}
df = pd.DataFrame(person)
df.loc[2, 'age'] = 30 # 修改数据
print(df.to_string())

You can also set a conditional statement to set the age greater than 120 to 120

# 实例 13
import pandas as pd
person = {
    
    
  "name": ['Google', 'Baidu' , 'Taobao'],
  "age": [50, 200, 12345]    
}
df = pd.DataFrame(person)
for x in df.index:
  if df.loc[x, "age"] > 120:
    df.loc[x, "age"] = 120
print(df.to_string())

You can also delete rows with incorrect data and delete rows with age greater than 120.

# 实例 14
import pandas as pd
person = {
    
    
  "name": ['Google', 'Baidu' , 'Taobao'],
  "age": [50, 40, 12345]    # 12345 年龄数据是错误的
}
df = pd.DataFrame(person)
for x in df.index:
  if df.loc[x, "age"] > 120:
    df.drop(x, inplace = True)
print(df.to_string())

Pandas cleans duplicate data

If we want to clean duplicate data, we can use the duplicated() and drop_duplicates() methods.

duplicated()

If the corresponding data is duplicated, duplicated() will return True, otherwise it will return False.

# 实例 15
import pandas as pd
person = {
    
    
  "name": ['Google', 'Baidu', 'Baidu', 'Taobao'],
  "age": [50, 40, 40, 23]  
}
df = pd.DataFrame(person)
print(df.duplicated())

drop_duplicates()

To delete duplicate data, you can directly use the drop_duplicates() method.

# 实例 16
import pandas as pd

persons = {
    
    
  "name": ['Google', 'Runoob', 'Runoob', 'Taobao'],
  "age": [50, 40, 40, 23]  
}

df = pd.DataFrame(persons)

df.drop_duplicates(inplace = True)
print(df)

postscript

What you are learning today is Python Pandas data cleaning. Have you learned it? A summary of today’s learning content:

  1. Pandas data cleaning
  2. Pandas cleaning null values
  3. Pandas replace cells
  4. Pandas cleans malformed data
  5. Pandas cleans erroneous data
  6. Pandas cleans duplicate data

Guess you like

Origin blog.csdn.net/qq_54129105/article/details/132262160