An introductory example of data cleaning with Pandas

Data cleaning is a critical step in the data analysis process, which involves identifying missing values, duplicate rows, outliers, and incorrect data types. Access to clean and reliable data is important for accurate analysis and modeling.

This article will introduce the following 6 frequently used data cleaning operations:

Check for missing values, check for duplicate rows, handle outliers, check data types of all columns, delete unnecessary columns, handle data inconsistencies

As a first step, let's import the library and dataset.

 # Import libraries
 import pandas as pd
 
 # Read data from a CSV file
 df = pd.read_csv('filename.csv')

Check for missing values

isnull()

method can be used to see missing values ​​in a data frame or column.

 # Check for missing values in the dataframe
 df.isnull()
 
 # Check the number of missing values in the dataframe
 df.isnull().sum().sort_values(ascending=False)

 # Check for missing values in the 'Customer Zipcode' column
 df['Customer Zipcode'].isnull().sum()
 
 # Check what percentage of the data frame these 3 missing values ••represent
 print(f"3 missing values represents {(df['Customer Zipcode'].isnull().sum() / df.shape[0] * 100).round(4)}% of the rows in our DataFrame.")

There are 3 missing values ​​in Zipcode column

dropna()

Any row or column that contains at least one missing value can be dropped.

 # Drop all the rows where at least one element is missing
 df = df.dropna()    
 # or df.dropna(axis=0) **(axis=0 for rows and axis=1 for columns)
 
 # Note: inplace=True modifies the DataFrame rather than creating a new one
 df.dropna(inplace=True)
 
 # Drop all the columns where at least one element is missing
 df.dropna(axis=1, inplace=True)
 
 # Drop rows with missing values in specific columns
 df.dropna(subset = ['Additional Order items', 'Customer Zipcode'], inplace=True)
fillna()

You can also replace missing values ​​with more appropriate values, such as mean, median, or custom values.

 # Fill missing values in the dataset with a specific value
 df = df.fillna(0)
 
 # Replace missing values in the dataset with median
 df = df.fillna(df.median())
 
 # Replace missing values in Order Quantity column with the mean of Order Quantities
 df['Order Quantity'].fillna(df["Order Quantity"].mean, inplace=True)

check for duplicate rows

duplicate()

method to view duplicate rows.

 # Check duplicate rows
 df.duplicated()
 
 # Check the number of duplicate rows
 df.duplicated().sum()
drop_duplates()

You can use this method to remove duplicate rows.

 # Drop duplicate rows (but only keep the first row)
 df = df.drop_duplicates(keep='first') #keep='first' / keep='last' / keep=False
 
 # Note: inplace=True modifies the DataFrame rather than creating a new one
 df.drop_duplicates(keep='first', inplace=True)

Dealing with outliers

Outliers are extreme values ​​that can significantly affect the analysis. They can be dealt with by removing them or converting them to more suitable values.

describe()

Information such as the maximum and mean of can help us find outliers.

 # Get a statistics summary of the dataset
 df["Product Price"].describe()

max" value: 1999. None of the other values ​​are close to 1999, and the average is 146, so it can be determined that 1999 is an outlier that needs to be dealt with

Or you can draw a histogram to see the distribution of the data.

 plt.figure(figsize=(8, 6))
 df["Product Price"].hist(bins=100)

In the histogram, it can be seen that most of the price data are between 0 and 500.

Boxplots are also useful in detecting outliers.

 plt.figure(figsize=(6, 4))
 df.boxplot(column=['Product Price'])

You can see that the price column has several outlier data points. (values ​​higher than 400)

Check the data type of the column

info()

You can view the data types of the columns in the dataset.

 # Provide a summary of dataset
 df.info()

to_datetime()

method converts the column to a datetime data type.

 # Convert data type of Order Date column to date
 df["Order Date"] = pd.to_datetime(df["Order Date"])
to_numeric()

Columns can be converted to numeric data types (for example, integer or floating point).

 # Convert data type of Order Quantity column to numeric data type
 df["Order Quantity"] = pd.to_numeric(df["Order Quantity"])
to_timedelta()

method to convert the column to timedelta data type, if the value represents a duration, you can use this function

 # Convert data type of Duration column to timedelta type
 df["Duration "] = pd.to_timedelta(df["Duration"])

remove unnecessary columns

drop()

method is used to delete the specified row or column from the data frame.

 # Drop Order Region column
 # (axis=0 for rows and axis=1 for columns)
 df = df.drop('Order Region', axis=1)
 
 # Drop Order Region column without having to reassign df (using inplace=True)
 df.drop('Order Region', axis=1, inplace=True)
 
 # Drop by column number instead of by column label
 df = df.drop(df.columns[[0, 1, 3]], axis=1)  # df.columns is zero-based

Data Inconsistency Handling

Data inconsistencies may be due to different formats or units. Pandas provides string methods to handle inconsistent data.

str.lower() & str.upper()

These two functions are used to convert all characters in a string to lowercase or uppercase. It helps normalize the case of strings in DataFrame columns.

 # Rename column names to lowercase
 df.columns = df.columns.str.lower()

 # Rename values in  Customer Fname column to uppercase
 df["Customer Fname"] = df["Customer Fname"].str.upper()

str.strip()

function to remove any extra spaces that may appear at the beginning or end of a string value.

 # In Customer Segment column, convert names to lowercase and remove leading/trailing spaces
 df['Customer Segment'] = df['Customer Segment'].str.lower().str.strip()

replace()

Functions are used to replace a specific value in a DataFrame column with a new value.

 # Replace values in dataset
 df = df.replace({"CA": "California", "TX": "Texas"})

 # Replace values in a spesific column
 df["Customer Country"] = df["Customer Country"].replace({"United States": "USA", "Puerto Rico": "PR"})

mapping()

A dictionary can be created that maps inconsistent values ​​to their normalized counterparts. This dictionary is then used with the replace() function to perform the replacement.

 # Replace specific values using mapping
 mapping = {'CA': 'California', 'TX': 'Texas'}
 df['Customer State'] = df['Customer State'].replace(mapping)

rename()

Function to rename columns or index labels of a DataFrame.

 # Rename some columns
 df.rename(columns={'Customer City': 'Customer_City', 'Customer Fname' : 'Customer_Fname'}, inplace=True)
 # Rename some columns
 new_names = {'Customer Fname':'Customer_Firstname', 'Customer Fname':'Customer_Fname'}
 df.rename(columns=new_names, inplace=True)
 df.head()

Summarize

Python pandas includes a rich set of functions and methods to handle missing data, remove duplicate data, and perform other data cleaning operations efficiently.

Using pandas functions, data scientists and data analysts can simplify data cleaning workflows and ensure the quality and integrity of datasets.

https://avoid.overfit.cn/post/d594591441dd47b2b1a6264c1c71368a

Author: Python Fundamentals

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/132291769