Data cleaning is a critical step in the data analysis process, which involves identifying missing values, duplicate rows, outliers, and incorrect data types. Access to clean and reliable data is important for accurate analysis and modeling.
This article will introduce the following 6 frequently used data cleaning operations:
Check for missing values, check for duplicate rows, handle outliers, check data types of all columns, delete unnecessary columns, handle data inconsistencies
As a first step, let's import the library and dataset.
# Import libraries
import pandas as pd
# Read data from a CSV file
df = pd.read_csv('filename.csv')
Check for missing values
isnull()
method can be used to see missing values in a data frame or column.
# Check for missing values in the dataframe
df.isnull()
# Check the number of missing values in the dataframe
df.isnull().sum().sort_values(ascending=False)
# Check for missing values in the 'Customer Zipcode' column
df['Customer Zipcode'].isnull().sum()
# Check what percentage of the data frame these 3 missing values ••represent
print(f"3 missing values represents {(df['Customer Zipcode'].isnull().sum() / df.shape[0] * 100).round(4)}% of the rows in our DataFrame.")
There are 3 missing values in Zipcode column
dropna()
Any row or column that contains at least one missing value can be dropped.
# Drop all the rows where at least one element is missing
df = df.dropna()
# or df.dropna(axis=0) **(axis=0 for rows and axis=1 for columns)
# Note: inplace=True modifies the DataFrame rather than creating a new one
df.dropna(inplace=True)
# Drop all the columns where at least one element is missing
df.dropna(axis=1, inplace=True)
# Drop rows with missing values in specific columns
df.dropna(subset = ['Additional Order items', 'Customer Zipcode'], inplace=True)
fillna()
You can also replace missing values with more appropriate values, such as mean, median, or custom values.
# Fill missing values in the dataset with a specific value
df = df.fillna(0)
# Replace missing values in the dataset with median
df = df.fillna(df.median())
# Replace missing values in Order Quantity column with the mean of Order Quantities
df['Order Quantity'].fillna(df["Order Quantity"].mean, inplace=True)
check for duplicate rows
duplicate()
method to view duplicate rows.
# Check duplicate rows
df.duplicated()
# Check the number of duplicate rows
df.duplicated().sum()
drop_duplates()
You can use this method to remove duplicate rows.
# Drop duplicate rows (but only keep the first row)
df = df.drop_duplicates(keep='first') #keep='first' / keep='last' / keep=False
# Note: inplace=True modifies the DataFrame rather than creating a new one
df.drop_duplicates(keep='first', inplace=True)
Dealing with outliers
Outliers are extreme values that can significantly affect the analysis. They can be dealt with by removing them or converting them to more suitable values.
describe()
Information such as the maximum and mean of can help us find outliers.
# Get a statistics summary of the dataset
df["Product Price"].describe()
max" value: 1999. None of the other values are close to 1999, and the average is 146, so it can be determined that 1999 is an outlier that needs to be dealt with
Or you can draw a histogram to see the distribution of the data.
plt.figure(figsize=(8, 6))
df["Product Price"].hist(bins=100)
In the histogram, it can be seen that most of the price data are between 0 and 500.
Boxplots are also useful in detecting outliers.
plt.figure(figsize=(6, 4))
df.boxplot(column=['Product Price'])
You can see that the price column has several outlier data points. (values higher than 400)
Check the data type of the column
info()
You can view the data types of the columns in the dataset.
# Provide a summary of dataset
df.info()
to_datetime()
method converts the column to a datetime data type.
# Convert data type of Order Date column to date
df["Order Date"] = pd.to_datetime(df["Order Date"])
to_numeric()
Columns can be converted to numeric data types (for example, integer or floating point).
# Convert data type of Order Quantity column to numeric data type
df["Order Quantity"] = pd.to_numeric(df["Order Quantity"])
to_timedelta()
method to convert the column to timedelta data type, if the value represents a duration, you can use this function
# Convert data type of Duration column to timedelta type
df["Duration "] = pd.to_timedelta(df["Duration"])
remove unnecessary columns
drop()
method is used to delete the specified row or column from the data frame.
# Drop Order Region column
# (axis=0 for rows and axis=1 for columns)
df = df.drop('Order Region', axis=1)
# Drop Order Region column without having to reassign df (using inplace=True)
df.drop('Order Region', axis=1, inplace=True)
# Drop by column number instead of by column label
df = df.drop(df.columns[[0, 1, 3]], axis=1) # df.columns is zero-based
Data Inconsistency Handling
Data inconsistencies may be due to different formats or units. Pandas provides string methods to handle inconsistent data.
str.lower() & str.upper()
These two functions are used to convert all characters in a string to lowercase or uppercase. It helps normalize the case of strings in DataFrame columns.
# Rename column names to lowercase
df.columns = df.columns.str.lower()
# Rename values in Customer Fname column to uppercase
df["Customer Fname"] = df["Customer Fname"].str.upper()
str.strip()
function to remove any extra spaces that may appear at the beginning or end of a string value.
# In Customer Segment column, convert names to lowercase and remove leading/trailing spaces
df['Customer Segment'] = df['Customer Segment'].str.lower().str.strip()
replace()
Functions are used to replace a specific value in a DataFrame column with a new value.
# Replace values in dataset
df = df.replace({"CA": "California", "TX": "Texas"})
# Replace values in a spesific column
df["Customer Country"] = df["Customer Country"].replace({"United States": "USA", "Puerto Rico": "PR"})
mapping()
A dictionary can be created that maps inconsistent values to their normalized counterparts. This dictionary is then used with the replace() function to perform the replacement.
# Replace specific values using mapping
mapping = {'CA': 'California', 'TX': 'Texas'}
df['Customer State'] = df['Customer State'].replace(mapping)
rename()
Function to rename columns or index labels of a DataFrame.
# Rename some columns
df.rename(columns={'Customer City': 'Customer_City', 'Customer Fname' : 'Customer_Fname'}, inplace=True)
# Rename some columns
new_names = {'Customer Fname':'Customer_Firstname', 'Customer Fname':'Customer_Fname'}
df.rename(columns=new_names, inplace=True)
df.head()
Summarize
Python pandas includes a rich set of functions and methods to handle missing data, remove duplicate data, and perform other data cleaning operations efficiently.
Using pandas functions, data scientists and data analysts can simplify data cleaning workflows and ensure the quality and integrity of datasets.
https://avoid.overfit.cn/post/d594591441dd47b2b1a6264c1c71368a
Author: Python Fundamentals