[Python Treasure Box] The art of data cleaning: Python library helps polish perfect data

Data cleaning and preprocessing: Python library revealed

Preface

In the field of data science, data cleaning and preprocessing are critical steps in building reliable models. This article takes an in-depth look at a series of powerful Python libraries that play an important role in handling duplicate data, string matching, data wrangling, and dimensionality reduction. By learning these libraries, data scientists can improve data quality more efficiently and lay a solid foundation for in-depth analysis and modeling.

[Python Treasure Box] From Chaos to Order: The data magic of the sortedcontainers library changes your programming experience

Welcome to subscribe to the column: Python Library Treasure Box: Unlocking the Magical World of Programming

1. Dedupe library

1.1 Duplication removal method

Dedupe library is a Python library for identifying and removing duplicate data. It calculates the similarity between records and uses a clustering algorithm to group similar records together to achieve data deduplication. Here is a simple example code:

import dedupe

# 示例数据
data = [{
    
    'name': 'John Doe', 'email': '[email protected]'},
        {
    
    'name': 'Jane Doe', 'email': '[email protected]'},
        {
    
    'name': 'John Doe', 'email': '[email protected]'}]

# 定义字段相似度配置
fields = [{
    
    'field': 'name', 'type': 'String'},
          {
    
    'field': 'email', 'type': 'String'}]

# 初始化Dedupe
deduper = dedupe.Dedupe(fields)

# 训练Dedupe模型
deduper.sample(data)

# 查找相似记录并去重
clustered_data = deduper.match(data)

print(clustered_data)

1.2 Data cleaning technology

In addition to the deduplication function, the Dedupe library also provides data cleaning functions. It can normalize and clean data by defining preprocessing functions. Here's a simple example:

import dedupe

# 示例数据
data = [{
    
    'name': 'John Doe', 'email': '[email protected]'},
        {
    
    'name': 'Jane Doe', 'email': '[email protected]'},
        {
    
    'name': 'John Doe', 'email': '[email protected]'}]

# 定义字段相似度配置
fields = [{
    
    'field': 'name', 'type': 'String'},
          {
    
    'field': 'email', 'type': 'String'}]

# 初始化Dedupe
deduper = dedupe.Dedupe(fields)

# 定义数据清理函数
def preProcess(data):
    data['name'] = dedupe.canonicalize(data['name'])
    data['email'] = dedupe.canonicalize(data['email'])
    return data

# 训练Dedupe模型,包括数据清理步骤
deduper.sample(data, preprocessor=preProcess)

# 查找相似记录并去重
clustered_data = deduper.match(data)

print(clustered_data)

To further explore the capabilities of the Dedupe library, we can learn more about its clustering configuration and how it handles large amounts of data.

1.3 Clustering configuration

In Dedupe, clustering settings can be configured to better suit different types of data and needs. This includes defining thresholds and model parameters for clustering. The following is a sample code showing how to configure Dedupe's clustering parameters:

import dedupe

# 示例数据
data = [{
    
    'name': 'John Doe', 'email': '[email protected]'},
        {
    
    'name': 'Jane Doe', 'email': '[email protected]'},
        {
    
    'name': 'John Doe', 'email': '[email protected]'}]

# 定义字段相似度配置
fields = [{
    
    'field': 'name', 'type': 'String'},
          {
    
    'field': 'email', 'type': 'String'}]

# 初始化Dedupe并配置聚类参数
deduper = dedupe.Dedupe(fields, threshold=0.5, num_cores=2)

# 训练Dedupe模型
deduper.sample(data)

# 查找相似记录并去重
clustered_data = deduper.match(data)

print(clustered_data)

In this example, thresholdthe parameters define the similarity threshold for clustering to decide whether records are considered similar. num_coresParameters allow specifying the number of cores for parallel processing, helping to speed up the processing of large-scale data.

1.4 Processing large-scale data

When processing large-scale data, Dedupe provides some techniques to improve efficiency, such as processing data in chunks. Here is an example of handling large-scale data:

import dedupe

# 示例数据生成器(模拟大规模数据)
def data_stream():
    for i in range(100000):
        yield {
    
    'name': f'John Doe {
      
      i}', 'email': f'john{
      
      i}@example.com'}

# 定义字段相似度配置
fields = [{
    
    'field': 'name', 'type': 'String'},
          {
    
    'field': 'email', 'type': 'String'}]

# 初始化Dedupe
deduper = dedupe.Dedupe(fields)

# 使用数据生成器训练Dedupe模型
deduper.sample(data_stream())

# 查找相似记录并去重(可分块处理)
clustered_data = deduper.match(data_stream(), block=True)

print(clustered_data)

This example shows how to use data generators data_stream()to simulate large-scale data and pass block=Trueparameters to enable Dedupe to process the data chunk by chunk to efficiently handle large data volumes.

The above is a way to gain a deeper understanding of the Dedupe library, from clustering configuration to processing large-scale data. These are important aspects of expanding your understanding of data deduplication and cleaning.

2. FuzzyWuzzy library

2.1 Fuzzy matching algorithm

The FuzzyWuzzy library provides a variety of fuzzy matching algorithms, the most commonly used of which are fuzzy_ratioand token_sort_ratio. Here's a simple demonstration:

from fuzzywuzzy import fuzz

# 示例数据
string1 = "Hello World"
string2 = "Halo Wold"

# 使用fuzzy_ratio计算相似度
ratio = fuzz.ratio(string1, string2)
print(f"Fuzzy Ratio: {
      
      ratio}")

# 使用token_sort_ratio计算相似度(考虑单词排序)
token_ratio = fuzz.token_sort_ratio(string1, string2)
print(f"Token Sort Ratio: {
      
      token_ratio}")

2.2 String similarity calculation

In addition to basic similarity calculation, FuzzyWuzzy also provides other methods, such as partial_ratio, partial_token_sort_ratioetc., for more flexible string similarity calculation.

from fuzzywuzzy import fuzz

# 示例数据
string1 = "Hello World"
string2 = "Hello Python World"

# 使用partial_ratio计算相似度(部分匹配)
partial_ratio = fuzz.partial_ratio(string1, string2)
print(f"Partial Ratio: {
      
      partial_ratio}")

# 使用partial_token_sort_ratio计算相似度
partial_token_ratio = fuzz.partial_token_sort_ratio(string1, string2)
print(f"Partial Token Sort Ratio: {
      
      partial_token_ratio}")

2.3 Application cases

FuzzyWuzzy's fuzzy matching function is widely used in text matching, string similarity comparison and other scenarios. Here is a simple case comparing two lists of strings for similarity:

from fuzzywuzzy import process

# 示例数据
choices = ['apple', 'banana', 'orange', 'kiwi']
query = 'kiwi fruit'

# 使用process库进行模糊匹配
best_match = process.extractOne(query, choices)

print(f"Best Match: {
      
      best_match}")

The functions of these FuzzyWuzzy libraries are powerful tools for handling string similarity and fuzzy matching. Perhaps dive into some advanced uses, such as processing large amounts of data or optimizing match results.

2.4 Large-scale data processing

When large-scale data needs to be processed, FuzzyWuzzy also provides some techniques to improve efficiency, such as using extractOne()parallel processing methods of functions. Here is an example:

from fuzzywuzzy import process
from multiprocessing import Pool

# 示例数据生成器(模拟大规模数据)
def data_generator():
    for i in range(100000):
        yield f'Target String {
      
      i}'

# 示例查询
query = 'Target String 5678'

# 使用多进程并行处理
with Pool() as pool:
    best_match = process.extractOne(query, data_generator(), scorer=fuzz.ratio, pool=pool)

print(f"Best Match: {
      
      best_match}")

This example demonstrates how to use Python multiprocessing.Poolto implement parallel processing and accelerate the fuzzy matching process of large-scale data.

2.5 Result optimization and threshold setting

FuzzyWuzzy matching results can be optimized by setting a threshold to only retain matches above a certain similarity. Here's an example showing how to filter matching results based on a threshold:

from fuzzywuzzy import process

# 示例数据
choices = ['apple', 'banana', 'orange', 'kiwi']
query = 'kiwi fruit'

# 设定阈值
threshold = 60

# 使用process库进行模糊匹配,并根据阈值筛选结果
matches = process.extract(query, choices, scorer=fuzz.ratio)
filtered_matches = [match for match in matches if match[1] >= threshold]

print(f"Filtered Matches: {
      
      filtered_matches}")

In this example, a threshold (60) is set to filter out matching results that are more similar to the query string than the threshold.

These methods allow you to more deeply utilize the FuzzyWuzzy library to handle different needs, process large-scale data, optimize matching results, and set thresholds to filter matches as needed.

3. PyJanitor library

3.1 Data sorting and cleaning tools

The PyJanitor library provides a series of tools for data organization and cleaning, making data processing easier. Here is an example of column name normalization using PyJanitor:

import pandas as pd
import janitor

# 示例数据
data = {
    
    'Column 1': [1, 2, 3], 'Column 2': [4, 5, 6]}

# 创建DataFrame
df = pd.DataFrame(data)

# 使用PyJanitor进行列名标准化
df_cleaned = df.clean_names()

print(df_cleaned)

3.2 Standardization of listings

PyJanitor's clean_namesmethods are used to standardize DataFrame column names to lowercase letters, remove spaces, etc., to improve the consistency of data processing.

import pandas as pd
import janitor

# 示例数据
data = {
    
    'First Name': ['John', 'Jane', 'Jim'], 'Last Name': ['Doe', 'Smith', 'Brown']}

# 创建DataFrame
df = pd.DataFrame(data)

# 使用PyJanitor进行列名标准化
df_cleaned = df.clean_names()

print(df_cleaned)

3.3 Data format conversion technology

PyJanitor also provides methods for data format conversion, such as convert_excel_datefor converting Excel dates to Python datetime objects.

import pandas as pd
import janitor

# 示例数据
data = {
    
    'Date': [44271, 44272, 44273], 'Value': [10, 15, 20]}

# 创建DataFrame
df = pd.DataFrame(data)

# 使用PyJanitor进行Excel日期转换
df_converted = df.convert_excel_date('Date')

print(df_converted)

PyJanitor's features really make data processing more efficient. Perhaps we can continue to explore some more complex data cleaning techniques or their application in specific scenarios.

3.4 Missing value processing

PyJanitor provides convenient methods for handling missing values, such as fill_emptyfunctions that can fill in null values.

import pandas as pd
import janitor

# 示例数据
data = {
    
    'A': [1, None, 3], 'B': [4, 5, None]}

# 创建DataFrame
df = pd.DataFrame(data)

# 使用PyJanitor填充空值
df_filled = df.fill_empty(columns=['A', 'B'], value=0)

print(df_filled)

This example shows how to use fill_emptya method to fill the null values ​​of a specified column in a DataFrame with a specific value.

3.5 Multi-table join and merge

PyJanitor also provides easy-to-use multi-table join and merge functions, such as joinfunctions.

import pandas as pd
import janitor

# 示例数据
data1 = {
    
    'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']}
data2 = {
    
    'ID': [2, 3, 4], 'Age': [25, 30, 35]}

# 创建DataFrame
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# 使用PyJanitor进行表连接
df_merged = df1.join(df2, on='ID', how='inner')

print(df_merged)

This example shows how to use joinmethods to perform table joins based on specified columns and merge two DataFrames using an inner join.

These tips can help you apply PyJanitor more flexibly, handle missing values, perform table joins and other operations, making data cleaning and organization more efficient.

4. Pandas library

4.1 Data frame processing

Pandas is a powerful data analysis library that provides DataFrame objects for processing and analyzing data. Here is a simple Pandas example demonstrating how to use a DataFrame to process data:

import pandas as pd

# 示例数据
data = {
    
    'Name': ['John', 'Jane', 'Jim'],
        'Age': [25, 30, 22],
        'Salary': [50000, 60000, 45000]}

# 创建DataFrame
df = pd.DataFrame(data)

# 显示DataFrame
print("原始数据:")
print(df)

# 访问列数据
ages = df['Age']
print("\n年龄列数据:")
print(ages)

# 计算平均薪水
average_salary = df['Salary'].mean()
print("\n平均薪水:", average_salary)

4.2 Missing value processing

Pandas provides methods for handling missing values, such as dropna()deleting rows containing missing values ​​and fillna()filling missing values. Here's a simple demonstration:

import pandas as pd

# 示例数据
data = {
    
    'Name': ['John', 'Jane', None],
        'Age': [25, None, 22],
        'Salary': [50000, 60000, 45000]}

# 创建DataFrame
df = pd.DataFrame(data)

# 显示DataFrame
print("原始数据:")
print(df)

# 删除包含缺失值的行
df_cleaned = df.dropna()

# 显示处理后的DataFrame
print("\n处理后的数据(删除缺失值):")
print(df_cleaned)

# 填充缺失值
df_filled = df.fillna(value={
    
    'Name': 'Unknown', 'Age': df['Age'].mean()})

# 显示处理后的DataFrame
print("\n处理后的数据(填充缺失值):")
print(df_filled)

4.3 Data merging and connection technology

Pandas provides a variety of data merging and joining methods, such as merge()merging based on columns and concat()merging based on indexes. Here's a simple example:

import pandas as pd

# 示例数据
data1 = {
    
    'ID': [1, 2, 3], 'Name': ['John', 'Jane', 'Jim']}
data2 = {
    
    'ID': [2, 3, 4], 'Salary': [60000, 45000, 70000]}

# 创建两个DataFrame
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# 使用merge进行基于列的合并
merged_df = pd.merge(df1, df2, on='ID', how='inner')

# 显示合并后的DataFrame
print("合并后的数据:")
print(merged_df)

4.4 Data grouping and aggregation operations

Pandas' grouping and aggregation operations make statistics and analysis of data more convenient. Here's a simple demonstration:

import pandas as pd

# 示例数据
data = {
    
    'Department': ['HR', 'IT', 'HR', 'IT', 'IT'],
        'Salary': [50000, 60000, 55000, 65000, 70000]}

# 创建DataFrame
df = pd.DataFrame(data)

# 按部门分组,计算平均薪水
average_salary_by_department = df.groupby('Department')['Salary'].mean()

# 显示分组和聚合结果
print("按部门分组,计算平均薪水:")
print(average_salary_by_department)

Pandas is indeed a very powerful data processing tool. Perhaps you can continue to explore some advanced features or broader application scenarios.

4.5 Time series processing

Pandas is very convenient for processing time series data. For example, to_datetime()functions can convert strings into date and time objects, and resample()functions are used for time resampling. Here is a simple time series processing example:

import pandas as pd

# 示例时间序列数据
dates = ['2023-01-01', '2023-01-02', '2023-01-03']
values = [100, 120, 90]

# 创建时间序列DataFrame
time_series = pd.DataFrame({
    
    'Date': dates, 'Value': values})

# 将日期列转换为日期时间对象
time_series['Date'] = pd.to_datetime(time_series['Date'])

# 将日期列设置为索引
time_series.set_index('Date', inplace=True)

# 进行每日重采样计算均值
daily_mean = time_series.resample('D').mean()

# 显示每日均值
print("每日均值:")
print(daily_mean)

4.6 Pivot tables and crosstabs

Pandas can easily create pivot tables and crosstabs for data analysis and summary. Here's a simple example:

import pandas as pd

# 示例数据
data = {
    
    'Department': ['HR', 'IT', 'HR', 'IT', 'IT'],
        'Gender': ['M', 'F', 'M', 'M', 'F'],
        'Salary': [50000, 60000, 55000, 65000, 70000]}

# 创建DataFrame
df = pd.DataFrame(data)

# 创建数据透视表
pivot_table = pd.pivot_table(df, values='Salary', index='Department', columns='Gender', aggfunc='mean')

# 显示数据透视表
print("数据透视表:")
print(pivot_table)

This example shows how to use pivot_table()functions to create a simple PivotTable that summarizes salary averages by department and gender.

These advanced functions and broader application scenarios can help you more comprehensively understand the power of the Pandas library, from time series processing to pivot table applications, and expand your understanding of the multiple possibilities of data analysis and processing.

5. NumPy library

5.1 Array operations and processing

NumPy is a basic library for scientific computing, providing powerful array operation functions. Here is a simple NumPy example:

import numpy as np

# 创建NumPy数组
arr = np.array([1, 2, 3, 4, 5])

# 数组运算
arr_squared = arr ** 2

# 显示结果
print("原始数组:", arr)
print("数组平方:", arr_squared)

5.2 Mathematical functions and statistical methods

NumPy contains a wealth of mathematical functions and statistical methods, such as mean()for calculating average and std()standard deviation. Here's a simple demonstration:

import numpy as np

# 创建NumPy数组
arr = np.array([1, 2, 3, 4, 5])

# 计算平均值和标准差
average_value = np.mean(arr)
std_deviation = np.std(arr)

# 显示结果
print("数组:", arr)
print("平均值:", average_value)
print("标准差:", std_deviation)

5.3 Linear algebra operations

NumPy provides a rich set of linear algebra operations, such as dot()for matrix multiplication. Here's a simple example:

import numpy as np

# 创建两个矩阵
matrix1 = np.array([[1, 2], [3, 4]])
matrix2 = np.array([[5, 6], [7, 8]])

# 矩阵相乘
result_matrix = np.dot(matrix1, matrix2)

# 显示结果
print("矩阵1:")
print(matrix1)
print("\n矩阵2:")
print(matrix2)
print("\n矩阵相乘结果:")
print(result_matrix)

5.4 Data type conversion technology

NumPy allows data type conversion, such as converting an array of integers to an array of floating point numbers. Here's a simple demonstration:

import numpy as np

# 创建整数数组
int_array = np.array([1, 2, 3, 4, 5])

# 将整数数组转换为浮点数数组
float_array = int_array.astype(float)

# 显示结果
print("整数数组:", int_array)
print("浮点数数组:", float_array)

These basic functions of NumPy are indeed very useful for scientific computing and data processing. Maybe you can explore some advanced features or more practical application scenarios in depth.

5.5 Random number generation

NumPy has built-in rich random number generation functions, such as random.rand()for generating random arrays that obey uniform distribution. Here's a simple example:

import numpy as np

# 生成随机数组
random_array = np.random.rand(5)

# 显示随机数组
print("随机数组:", random_array)

5.6 Data slicing and indexing techniques

NumPy allows flexible data slicing and indexing operations to obtain specific parts of the data. Here's a simple demonstration:

import numpy as np

# 创建NumPy数组
arr = np.array([1, 2, 3, 4, 5])

# 数据切片
slice_arr = arr[2:4]

# 显示切片结果
print("原始数组:", arr)
print("切片结果:", slice_arr)

5.7 Array operations and broadcast mechanism

NumPy's broadcast mechanism can perform calculations on arrays of different shapes, making operations more flexible. Here's a simple example:

import numpy as np

# 创建NumPy数组
arr1 = np.array([[1, 2, 3], [4, 5, 6]])
arr2 = np.array([10, 20, 30])

# 使用广播机制进行数组运算
result = arr1 * arr2

# 显示运算结果
print("数组1:")
print(arr1)
print("\n数组2:")
print(arr2)
print("\n广播运算结果:")
print(result)

These advanced functions and practical application scenarios can help you understand and utilize the NumPy library more comprehensively, from random number generation to the application of data slicing and broadcast mechanisms, expanding your understanding of the various possibilities of scientific computing and data processing.

6. Scikit-learn library

6.1 Feature scaling and standardization

Scikit-learn provides methods for feature scaling and normalization, such as MinMaxScalerand StandardScaler. Here's a simple demonstration:

from sklearn.preprocessing import MinMaxScaler, StandardScaler
import numpy as np

# 示例数据
data = np.array([[1, 2], [3, 4], [5, 6]])

# 使用MinMaxScaler进行特征缩放
minmax_scaler = MinMaxScaler()
minmax_scaled_data = minmax_scaler.fit_transform(data)

# 使用StandardScaler进行标准化
standard_scaler = StandardScaler()
standard_scaled_data = standard_scaler.fit_transform(data)

# 显示结果
print("原始数据:")
print(data)

print("\nMinMax缩放后的数据:")
print(minmax_scaled_data)

print("\n标准化后的数据:")
print(standard_scaled_data)

6.2 Outlier detection

Models in Scikit-learn IsolationForestcan be used to detect outliers in the data. Here's a simple demonstration:

from sklearn.ensemble import IsolationForest
import numpy as np

# 示例数据,包含一个异常值
data = np.array([[1], [2], [3], [100]])

# 创建IsolationForest模型
isolation_forest = IsolationForest(contamination=0.25)

# 训练模型并预测异常值
outliers = isolation_forest.fit_predict(data)

# 显示结果
print("原始数据:")
print(data)

print("\n异常值预测结果:")
print(outliers)

6.3 Data dimensionality reduction technology

Scikit-learn provides a variety of data dimensionality reduction methods, such as principal component analysis (PCA). Here is a simple PCA demonstration:

from sklearn.decomposition import PCA
import numpy as np

# 示例数据
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# 创建PCA模型,降维到2维
pca = PCA(n_components=2)

# 拟合模型并进行数据降维
reduced_data = pca.fit_transform(data)

# 显示结果
print("原始数据:")
print(data)

print("\n降维后的数据:")
print(reduced_data)

The sample code covers many aspects of data cleaning and preprocessing, using various Python libraries, including Dedupe, FuzzyWuzzy, PyJanitor, Pandas, NumPy, and Scikit-learn. These libraries provide a wealth of tools and functions that enable data scientists to effectively process and prepare data, providing a high-quality data foundation for subsequent analysis and modeling.
These functions provided by Scikit-learn can indeed help with feature processing, outlier detection and data dimensionality reduction. Perhaps you can continue exploring some aspects of model training, evaluation, or hyperparameter tuning.

6.4 Model training and evaluation

Scikit-learn provides a variety of machine learning models, such as LinearRegression, DecisionTreeClassifieretc., as well as methods for model evaluation, such as cross_val_score. Here is an example of a simple linear regression model:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# 示例数据
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建线性回归模型
model = LinearRegression()

# 拟合模型
model.fit(X_train, y_train)

# 预测
predictions = model.predict(X_test)

# 评估模型
mse = mean_squared_error(y_test, predictions)

# 显示结果
print("预测结果:", predictions)
print("\n均方误差:", mse)

6.5 Hyperparameter tuning

Scikit-learn can help you tune the hyperparameters of your model through grid search (GridSearchCV) or random search (RandomizedSearchCV). Here's a simple demonstration:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
import numpy as np

# 示例数据
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])

# 创建随机森林回归器
model = RandomForestRegressor()

# 定义超参数网格
param_grid = {
    
    'n_estimators': [10, 50, 100],
              'max_depth': [None, 5, 10]}

# 创建网格搜索对象
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3)

# 拟合模型
grid_search.fit(X, y)

# 获取最优参数和最佳得分
best_params = grid_search.best_params_
best_score = grid_search.best_score_

# 显示结果
print("最佳参数:", best_params)
print("\n最佳得分:", best_score)

These technologies can help you better train models, evaluate model performance, and even improve model performance through hyperparameter tuning, which is an important step in machine learning.

Summarize

In the journey of data science, data cleaning and preprocessing are key steps towards efficient models and accurate analysis. We take a deep dive into Python libraries like Dedupe, FuzzyWuzzy, PyJanitor, Pandas, NumPy, and Scikit-learn, which provide data scientists with powerful tools that make it easier to clean, organize, and analyze data. By mastering the use of these libraries, you will be able to face complex data challenges with ease and contribute to the development of the field of data science.

Guess you like

Origin blog.csdn.net/qq_42531954/article/details/135425152