Kaggle 数据清洗挑战 Day 5 - 处理不一致数据

今天是 Kaggle 数据清洗挑战的第五天，转眼最后一天啦！这次任务是处理拼写不一致的数据，例如“康涅狄格州”可能被记录为 “Connecticut”、“Coon.” 或 “Conecticutt”，这些实际代表是同一个值，而机器会将他们识别为不同的对象。今天用一个简单的方法来整理这些拼写不一致的数据，具体包括三个部分：

Get our environment set up
Do some preliminary text pre-processing
Use fuzzy matching to correct inconsistent data entry

1、搭建环境

首先还是引入需要的 lib 包：

# modules we'll use
import pandas as pd
import numpy as np

# helpful modules
import fuzzywuzzy
from fuzzywuzzy import process
import chardet

# set seed for reproducibility
np.random.seed(0)

当第一次引入 'PakistanSuicideAttacks Ver 11 (30-November-2017).csv' 文件时，出现了编码错误，所以用昨天在《Kaggle 数据清洗挑战 Day 4 - 字符编码（Character Encoding）处理》中介绍的方法来迅速查看一下该文件的编码方式：

# look at the first ten thousand bytes to guess the character encoding
with open("../input/PakistanSuicideAttacks Ver 11 (30-November-2017).csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(100000))

# check what the character encoding might be
print(result)

再使用 Windows-1252 编码规则读取文件：

# read in our dat
suicide_attacks = pd.read_csv("../input/PakistanSuicideAttacks Ver 11 (30-November-2017).csv", 
                              encoding='Windows-1252')

这次没有输出错误了～

2、对文本进行预处理

看一下 “City“ 下的数据，虽然有更高效的做法，但先来手动排查整理，感受一下过程：

# get all the unique values in the 'City' column
cities = suicide_attacks['City'].unique()

# sort them alphabetically and then take a closer look
cities.sort()
cities

从结果来看，有很多拼写不一致的数据，例如 'ATTOCK' 和 'Attock'，'D.G Khan' 和 'D.G Khan '，所以首先我们先把所有字母都转为小写，再去掉所有位于字符串前和后的空格。大小写问题和空格问题是最常见的，所以解决了这两个问题就相当于完成了 80% 的工作。

# convert to lower case
suicide_attacks['City'] = suicide_attacks['City'].str.lower()
# remove trailing white spaces
suicide_attacks['City'] = suicide_attacks['City'].str.strip()

3、使用模糊匹配处理数据不一致

继续观察 'City' 这一列，看看有没有需要进一步处理的问题：

# get all the unique values in the 'City' column
cities = suicide_attacks['City'].unique()

# sort them alphabetically and then take a closer look
cities.sort()
cities

从结果来看，还存在一些问题，如 'd. i khan' 和 'd.i khan'，这两个应该是一样的，但 'd.g khan' 是另一个城市，不能和它们两个搞混淆。

我们尝试使用 fuzzywuzzy 来识别彼此类似的字符串。这个数据集很小，我们可以手动处理错误，但如果面对很大的数据集，可能存在上千条数据不匹配，所以需要一个自动化的方法去处理。那么就来看一下 “模糊匹配” （Fuzzy Matching）是什么吧：

模糊匹配是一个在文本中寻找和目标字符串相似的字符串的自动化过程。一般来说，如果要把一个字符串变为另一个字符串，需要改变的字符越少，这两者就会判定为越接近。如 “apple” 和 “snapple”，需要改变 2 个字母。我们不能 100% 依赖模糊匹配，但至少可以节省更多的时间。

对于两个字符串，fuzzywuzzy 会返回一个比率值。字符串约相近，比率值就越接近 100。下面，我们来从 city 列表中获取与 “d.i khan" 最接近的 10 个字符串：

# get the top 10 closest matches to "d.i khan"
matches = fuzzywuzzy.process.extract("d.i khan", cities, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

# take a look at them
matches

我们看到和目标字符串最相近的字符串中，前两位是 "d. i khan" 和 "d.i khan"，相似度都为 100。另一个城市 "d.g khan" 的相似度为 88，不能将其替换，所以我们将所有相似度大于 90 的记录替换为 "d. i khan"。

为了执行这个操作，我们来写一个函数，便于多次调用：

# function to replace rows in the provided column of the provided dataframe
# that match the provided string above the provided ratio with the provided string
def replace_matches_in_column(df, column, string_to_match, min_ratio = 90):
    # get a list of unique strings
    strings = df[column].unique()
    
    # get the top 10 closest matches to our input string
    matches = fuzzywuzzy.process.extract(string_to_match, strings, 
                                         limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

    # only get matches with a ratio > 90
    # here matches[i] is a tuple where matches[i][0] is the name of the city and 
    # matches[i][1] is how close the city matches[i][0] is to the string_to_match.

    # The code below is just an inline for loop that creates an array of city names
    # if the match ratio of a given city is greater than the threshold min_ratio 
    # then the city is appended to the array.
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]

    # get the rows of all the close matches in our dataframe
    rows_with_matches = df[column].isin(close_matches)
    
    # replace all rows with close matches with the input matches 
    df.loc[rows_with_matches, column] = string_to_match
    
    # let us know the function's done
    print("All done!")

下面用这个方法来替换掉和 "d.i khan" 相似的数据：

# use the function we just wrote to replace close matches to "d.i khan" with "d.i khan"
replace_matches_in_column(df=suicide_attacks, column='City', string_to_match="d.i khan")

再来看看所有的 "City" 数据：

# get all the unique values in the 'City' column
cities = suicide_attacks['City'].unique()

# sort them alphabetically and then take a closer look
cities.sort()
cities

看起来没什么问题了。

这就是最后一天的内容啦，感觉这样五天跟着做一遍，还是挺有收获的，希望 Kaggle 以后多推出这种活动～

⬇️ 扫描下方二维码关注公众号【数据池塘】 ⬇️

回复【算法】，获取最全面的机器学习算法网络图：

王大鱼

发布了38 篇原创文章 · 获赞 23 · 访问量 7万+

私信关注