Simple and Simple: The Art and Practice of Data Cleansing

What is data cleaning?

Data cleaning, also known as data cleansing, is the process of detecting, identifying and correcting (or removing) dirty data or errors from a data set. Dirty data can be incomplete, incorrect, inaccurate, or data that cannot be interpreted by predefined rules.

Why is data cleaning needed?

In machine learning and data science, there is an oft-quoted rule: "garbage in, garbage out". Even if we use state-of-the-art algorithms, if the input data is of poor quality, the results will not be very good. In fact, many data scientists consider data cleaning to be the most important step in the entire data processing pipeline.

Now, let us explore the process of data cleaning in detail through the following key steps.

1. Deduplication

Duplicate data can cause our understanding of the data to deviate from reality, especially when doing descriptive statistics or data modeling. In Python, we can use the duplicated() and drop_duplicates() functions of pandas to check for and remove duplicate values.

import pandas as pd

# 假设我们有一个名为df的数据框
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7], 
                   'B': ['a', 'b', 'b', 'c', 'd', 'e', 'e', 'e', 'f', 'g', 'g']})

# 检查重复值
print(df.duplicated())

# 删除重复值
df = df.drop_duplicates()

2. Handling missing values

Missing values ​​in data can be caused by various reasons such as errors in the data collection process, absence of some observations, etc. There are many ways to deal with missing values, such as deleting rows or columns containing missing values, imputing missing values, etc. Which method to choose depends on the specific situation, such as the number of missing values, the reason for missing values, etc.

In Python, we can check for missing values ​​in data using the isnull() function of pandas, drop rows or columns with missing values ​​using the dropna() function, or impute missing values ​​using the fillna() function.

# 假设我们有一个名为df的数据框,含有缺失值
df = pd.DataFrame({'A': [1, 2, np.nan, 4, 5, np.nan, 7, 8], 
                   'B': ['a', 'b', np.nan, 'd', 'e', 'f', 'g', np.nan]})

# 检查缺失值
print(df.isnull())

# 删除含有缺失值的行
df_dropna = df.dropna()

# 用某个值填充缺失值,例如0
df_fillna = df.fillna(0)

# 使用列的均值填充缺失值
for column in df.columns:
    df[column] = df[column].fillna(df[column].mean())

3. Detect and deal with outliers

Outliers are values ​​that are far from other observations. Outliers can be due to various reasons such as data entry errors, measurement errors, etc. Outliers can have an impact on the results of our analysis and therefore need to be dealt with.

When dealing with outliers, we first need to determine when a value should be considered an outlier. This usually requires some domain knowledge, or can be determined through exploratory analysis of the data. A common approach is to use boxplots (or interquartile ranges) to identify outliers.

import matplotlib.pyplot as plt

# 假设我们有一个名为df的数据框,只有一个数值列A
df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20]})

# 使用箱线图识别异常值
plt.boxplot(df['A'])
plt.show()

# 计算四分位数范围
Q1 = df['A'].quantile(0.25)
Q3 = df['A'].quantile(0.75)
IQR = Q3 - Q1

# 定义异常值为小于Q1-1.5IQR或大于Q3+1.5IQR的值
outliers = df[(df['A'] < Q1 - 1.5*IQR) | (df['A'] > Q3 + 1.5*IQR)]

After finding outliers, we can deal with them according to the specific situation, such as correcting outliers, deleting outliers, etc.

4. Data type conversion

Another important task of data cleaning is to ensure that the data is of the correct data type. For example, categorical variables may be misidentified as numbers, dates and times may be stored as strings, etc. In Python, we can use the astype() function of pandas to convert data types.

# 假设我们有一个名为df的数据框,其中有一个字符串列A
df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['a', 'b', 'c']})

# 转换列A为整数类型
df['A'] = df['A'].astype(int)

Overall, data cleaning is a complex task that requires a comprehensive understanding and exploration of the data. Although it may seem tedious at times, good data cleaning can greatly improve the performance of our models and the accuracy of our analysis results.

5. Working with text and string data

Text data often requires special preprocessing steps. For example, we may need to convert text to lowercase, remove punctuation or other non-alphabetic characters, remove stop words (such as "the", "a", "is", etc. that don't have much meaning in most contexts) , Stem extraction or lemmatization etc.

In Python, we can use the string methods of the standard library, or we can use more specialized libraries such as NLTK, spaCy, etc. for text processing.

import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

# 假设我们有一个文本字符串s
s = "The quick brown fox jumps over the lazy dog."

# 转换为小写
s = s.lower()

# 删除标点符号
s = s.translate(str.maketrans('', '', string.punctuation))

# 分词
tokens = s.split()

# 删除停用词
tokens = [token for token in tokens if token not in stopwords.words('english')]

# 词干提取
stemmer = PorterStemmer()
tokens = [stemmer.stem(token) for token in tokens]

# 结果
print(tokens)

in conclusion

Data cleaning is a critical step in data analysis, which is crucial to the success of the entire project. Although data cleaning may take a lot of time and effort, clean, organized data will greatly improve the efficiency of subsequent analysis and the accuracy of results. Hope this article helps you understand the importance of data cleaning and how to do basic data cleaning in Python.

Guess you like

Origin blog.csdn.net/a871923942/article/details/131418198