FuzzyWuzzy library has two extremely useful Python modules

In daily development work, we often encounter such a problem: we need to match a certain field in the data, but this field may have slight differences. For example, in the same recruitment position data, some of the province columns in the column say "Guangxi", some say "Guangxi Zhuang Autonomous Region", and some even say "Guangxi Province"... For this reason, a lot of code had to be added to handle these situations.

Today I would like to share with you FuzzyWuzzya simple and easy-to-use fuzzy string matching toolkit. Let you easily solve troublesome matching problems!

01 Preface

In the process of processing data, you will inevitably encounter a scenario similar to the following. What you get is a simplified version of the data field, but what you want to compare or merge is the full version of the data (sometimes the reverse is also true) )

The most common example is: when performing geographical visualization, the data collected by oneself only retains abbreviations, such as Beijing, Guangxi, Xinjiang, Tibet, etc., but the field data to be matched is Beijing City, Guangxi Zhuang Autonomous Region, Xinjiang Uygur Autonomous Region , Tibet Autonomous Region, etc., as follows. Therefore, there is a need to find a way to directly match the corresponding fields quickly and conveniently and generate a separate column of results, which can be used in the FuzzyWuzzylibrary.

02 Introduction to FuzzyWuzzy library

FuzzyWuzzyIt is a simple and easy-to-use fuzzy string matching toolkit. It Levenshtein Distancecalculates the difference between two sequences based on the algorithm.

Levenshtein DistanceAlgorithm, also called Edit Distancealgorithm, refers to the minimum number of editing operations required between two strings to convert one into the other. Permitted editing operations include replacing one character with another, inserting a character, and deleting a character. Generally speaking, the smaller the edit distance, the greater the similarity between two strings.

AnacondaThe programming environment under is used here jupyter notebook, so Anacondaenter the command in the command line to install the third-party library.

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple FuzzyWuzzy

1. fuzz module

This module mainly introduces four functions (methods), namely: simple matching ( Ratio), incomplete matching ( Partial Ratio), ignoring order matching ( Token Sort Ratio) and deduplicated subset matching ( Token Set Ratio).

Note: If you import this module directly, the system will prompt a warning. Of course, this does not mean an error is reported. The program can still run (the default algorithm used is slower). You can install the library according to the system prompts for assistance, which will help python-Levenshteinimprove The speed of calculation.

pip install python-Levenshtein

1.1 Simple matching (Ratio)

Just understand it briefly. This is not very precise and not commonly used.

# 匹配
fuzz.ratio("河南省", "河南省")
# 输出结果
100
# 匹配
fuzz.ratio("河南", "河南省")
# 输出结果
80
1.2 Partial Ratio

Try to use non-exact matching to achieve higher accuracy.

# 匹配
fuzz.partial_ratio("河南省", "河南省")
# 输出结果
100
# 匹配
fuzz.partial_ratio("河南", "河南省")
# 输出结果
100
1.3 Ignore order matching (Token Sort Ratio)

The principle is: use spaces as delimiters, lowercase all letters, and ignore other punctuation marks except spaces.

1) When using simple matching (ratio)

# 匹配
fuzz.ratio("西藏 自治区", "自治区 西藏")
# 输出结果
50
# 匹配
fuzz.ratio('I love YOU','YOU LOVE I')
# 输出结果
30

2) When using Ignore Order Matching (Token Sort Ratio)

# 匹配
fuzz.token_sort_ratio("西藏 自治区", "自治区 西藏") 
# 输出结果
100
# 匹配
fuzz.token_sort_ratio('I love YOU','YOU LOVE I')  
# 输出结果
100
1.4 Token Set Ratio

It is equivalent to a set deduplication process before comparison. Pay attention to the last two. It can be understood that this method adds the set deduplication function based on the token_sort_ratio method. The following three matches are all in reverse order.

# 1. 简单匹配
fuzz.ratio("西藏 西藏 自治区", "自治区 西藏") 
# 输出结果
40
# 2. 忽略顺序匹配
fuzz.token_sort_ratio("西藏 西藏 自治区", "自治区 西藏")
# 输出结果
80
# 3. 去重子集匹配
fuzz.token_set_ratio("西藏 西藏 自治区", "自治区 西藏")
# 输出结果
100

The final results obtained by the ratio() functions (methods) of fuzz are all numbers. If you need to obtain the string result with the highest matching degree, you still need to choose a different function according to your own data type, and then extract the result. If This method can be used to quantify the matching degree of text data, but it is not very convenient for us to extract the matching results, so there is the process module.

2. process module

Used to handle situations where there are limited candidate answers and return fuzzy matching strings and similarities.

2.1 extract extract multiple pieces of data
choices = ["河南省", "郑州市", "湖北省", "武汉市"]
process.extract("郑州", choices, limit=2)

# 输出结果
[('郑州市', 90), ('河南省', 0)]
"""
在选择列表中,字符串"郑州市"与目标字符串"郑州"最相似,相似程度为90。其次是字符串"河南省",相似程度为45。
"""

The data type after extract is a list. Even if limit=1, it is still a list in the end. Please note the difference from extractOne below.

2.2 extractOne extracts a piece of data

If you want to extract the result with the greatest matching degree, you can use extractOne. Note that the tuple type is returned here. Also, the result with the greatest matching degree is not necessarily the data we want. You can experience this through the following examples and two practical applications. one time.

choices = ["河南省", "郑州市", "湖北省", "武汉市"]
process.extractOne("郑州", choices)
# 输出结果
('郑州市', 90)

process.extractOne("北京", choices)
# 输出结果
('湖北省', 45)

3. Practical application: Fuzzy matching of company name fields

The format of the data and the data to be matched is as follows: two pieces of data, one is the full name of the company and the other is the abbreviation of the company.

Directly encapsulating the code into a function is mainly to facilitate future calls. The parameter settings here are more detailed. :

from fuzzywuzzy import process

origin = [{
    
    'id': 1, 'name': '张娟', 'province': '香港特别行政区', 'address': '辽宁省瑞市崇文西安路w座 972299', 'company': '联软网络有限公司'},
          {
    
    'id': 2, 'name': '常桂英', 'province': '澳门特别行政区', 'address': '辽宁省昆明市合川张路B座 931970', 'company': '创联世纪信息有限公司'}]  # 此处省略了其他数据
target = [{
    
    '序号': 1, '姓名': '张娟', '身份': '香港', '地址': '辽宁省瑞市崇文西安路w座 972299', '公司': '联软网络'},
          {
    
    '序号': 2, '姓名': '常桂英', '身份': '澳门', '地址': '辽宁省昆明市合川张路B座 931970', '公司': '创联世纪信息'}] # 此处省略了其他数据


# 模糊匹配
def fuzzy_merge(my_data, my_data_key, other_data, other_data_key, threshold=90, limit=2):
    """
    :param my_data: 我自己的数据
    :param my_data_key: 我自己数据里面 需要比对的key
    :param other_data: 待匹配的数据
    :param other_data_key: 待匹配的数据 需要比对的key
    :param threshold: 字符串匹配的相似度阈值,匹配程度低于该阈值的结果会被过滤掉,默认为90。
    :param limit: the amount of matches that will get returned, these are sorted high to low
    :return: 返回的匹配数量,按高到低排序,默认为2。
    """

    # 获取我自己数据 需要比对key的所有value
    _my_data = [i[my_data_key] for i in my_data]

    # 获取待匹配的数据 需要比对key的所有value
    _other_data = [i[other_data_key] for i in other_data]

    # 定义一个空列表,存储结果
    result = []
    i = 0
    for x in _my_data:
        result = process.extract(x, _other_data, limit=limit)
        my_data[i]['matches'] = [i[0] for i in result if i[1] >= threshold][0] if len(
            [i[0] for i in result if i[1] >= threshold]) > 0 else ''
        i = i + 1

    return my_data


res = fuzzy_merge(my_data=origin, my_data_key="company", other_data=target, other_data_key="公司")

The execution results are as follows

Guess you like

Origin blog.csdn.net/FloraCHY/article/details/131578325