Two easy-to-use Python modules, it is recommended to collect them!

247f9823131cae173c10aa020fddedb7.png

(Forever free, scan code to join)

Reposted from: python column

In daily development work, we often encounter such a problem: to match a certain field in the data, but this field may have slight differences. For example, in the same job posting data, in the column of provinces, some write "Guangxi", some write "Guangxi Zhuang Autonomous Region", and some even write "Guangxi Province". For this reason, a lot of codes have to be added to deal with these situations.

Today I would like to share with you FuzzyWuzzy, a simple and easy-to-use fuzzy string matching toolkit. Let you easily solve annoying matching problems!

6d3399c143aa8a42a6bccca03d8eb3cb.png

 foreword

In the process of data processing, it is inevitable that you will encounter the following similar scenarios. What you get in your hand is a simplified version of the data field, but what you want to compare or merge is the full version of the data (sometimes the reverse )

One of the most common examples is: in geographic visualization, only abbreviations are reserved for the data collected by oneself, such as Beijing, Guangxi, Xinjiang, Tibet, etc., but the field data to be matched is Beijing, Guangxi Zhuang Autonomous Region, Xinjiang Uygur Autonomous Region , Tibet Autonomous Region, etc., as follows. Therefore, it is necessary to find a way to quickly and conveniently match the corresponding fields directly and generate a separate column of the results, which can be used in the FuzzyWuzzy library.

17e971ba2803cfb51701e2c0cefc1d82.png

e4b0248bd62d1476f5c81744ed8cf168.png

 Introducing the FuzzyWuzzy Library

FuzzyWuzzy is an easy-to-use fuzzy string matching toolkit. It calculates the difference between two series according to the Levenshtein Distance algorithm.

The Levenshtein Distance algorithm, also known as the Edit Distance algorithm, refers to the minimum number of editing operations required to convert from one to another between two strings. Permissible editing operations include replacing one character with another, inserting a character, and deleting a character. Generally speaking, the smaller the edit distance, the greater the similarity between two strings.

The jupyter notebook programming environment under Anaconda is used here, so enter the command on the command line of Anaconda to install the third-party library.

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple FuzzyWuzzy


2.1 fuzz module

This module mainly introduces four functions (methods), namely: simple matching (Ratio), partial matching (Partial Ratio), ignoring sequence matching (Token Sort Ratio) and deduplication subset matching (Token Set Ratio)

Note: If you directly import this module, the system will prompt a warning, of course, this does not mean that an error is reported, the program can still run (the default algorithm used, the execution speed is slow), you can install the python-Levenshtein library according to the system prompts for assistance, this It is beneficial to improve the calculation speed.

f8389a292401c7391d4bf34d5b02dd2c.png

2.1.1 Simple matching (Ratio)

Just a simple understanding, this is not very accurate, and it is not commonly used

fuzz.ratio("河南省", "河南省")

output

100
fuzz.ratio("河南", "河南省")

output

80


2.1.2 Partial Ratio

Try to use non-exact matches, with higher precision

fuzz.partial_ratio("河南省", "河南省")

output

100
fuzz.partial_ratio("河南", "河南省")

output

100


2.1.3 Ignore order matching (Token Sort Ratio)

The principle is: use spaces as separators, lowercase all letters, and ignore other punctuation marks other than spaces

fuzz.ratio("西藏 自治区", "自治区 西藏")

output

50
fuzz.ratio('I love YOU','YOU LOVE I')

output

30
fuzz.token_sort_ratio("西藏 自治区", "自治区 西藏")

output

100
fuzz.token_sort_ratio('I love YOU','YOU LOVE I')

output

100


2.1.4 Token Set Ratio

It is equivalent to a set deduplication process before the comparison. Pay attention to the last two. It can be understood that this method adds the function of set deduplication on the basis of the token_sort_ratio method. The following three matches are in reverse order

fuzz.ratio("西藏 西藏 自治区", "自治区 西藏")

output

40
fuzz.token_sort_ratio("西藏 西藏 自治区", "自治区 西藏")

output

80
fuzz.token_set_ratio("西藏 西藏 自治区", "自治区 西藏")

output

100

The final results of these ratio() functions (methods) of fuzz are all numbers. If you need to obtain the string result with the highest matching degree, you need to choose a different function according to your own data type, and then extract the result. If but The matching degree of text data can be quantified in this way, but it is not very convenient for us to extract the matching results, so there is a process module.

c7b261a0bd7267ab5322ec975f151c9f.png

 process module

It is used to deal with the situation where the alternative answers are limited, and returns the string and similarity of the fuzzy match.

2.2.1 extract to extract multiple pieces of data

Similar to the select in the crawler, the return is a list, which will contain a lot of matching data

choices = ["河南省", "郑州市", "湖北省", "武汉市"]
process.extract("郑州", choices, limit=2)

output

[('郑州市', 90), ('河南省', 0)]

The data type after extract is a list, even if limit=1, it is still a list at the end, pay attention to the difference from extractOne below

2.2.2 extractOne extracts a piece of data

If you want to extract the result with the highest matching degree, you can use extractOne. Note that what is returned here is a tuple type, and the result with the highest matching degree is not necessarily the data we want. You can experience it through the following examples and two practical applications one time

process.extractOne("郑州", choices)

output

('郑州市', 90)
process.extractOne("北京", choices)

output

('湖北省', 45)

b58d8a2d158103271e5e5c9f21274c86.png

 3. Practical application

Here are two small examples of practical applications, the first is the fuzzy matching of the company name field, and the second is the fuzzy matching of the province and city fields

3.1 Fuzzy matching of company name field

The data and the data style to be matched are as follows: the name of the data field obtained by oneself is very concise, not the full name of the company, so it is necessary to merge the two fields

b1dbfb43ba1018166135a08f31850521.png

Directly encapsulate the code as a function, mainly for the convenience of future calls. The parameter settings here are more detailed, and the execution results are as follows:

937e3ac17cbd45554827bf0ad06aae59.png

3.1.1 Explanation of parameters:

  • The first parameter df_1 is the left data to be merged obtained by oneself (here is the data variable);

  • The second parameter df_2 is the right data to be merged to be matched (here is the company variable);

  • The third parameter key1 is the field name to be processed in df_1 (here is the 'company name' field in the data variable)

  • The fourth parameter key2 is the name of the field to be matched in df_2 (here is the 'company name' field in the company variable)

  • The fifth parameter threshold is the standard for setting the matching degree of the extraction results. Note that this is the improvement of the extractOne method. The result of the maximum matching degree extracted is not necessarily what we need, so we need to set a threshold to judge, this value is 90, as long as it is greater than or equal to 90, we can use this matching result only acceptable

  • The sixth parameter, the default parameter is to return only two matching results

  • Return value: the new DataFrame data after adding the 'matches' field to df_1

3.1.2 Core code explanation

The first part of the code is as follows, you can refer to the above explanation of the process.extract method, which is used directly here, so the returned result m is the data format of the nested ancestor in the list, the style is: [('Zhengzhou City', 90), ('Henan Province', 0)], so the data written to the 'matches' field for the first time is also in this format

Note, note: the first in the tuple is the string that matches successfully, and the second is the digital object compared with the set threshold parameter

s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
df_1['matches'] = m

The core code of the second part is as follows. With the above combing, the data type in the 'matches' field is clarified, and then the data is extracted. There are two points that need to be paid attention to in the part that needs to be processed:

  • Extract strings that match successfully, and fill empty values ​​for data whose threshold is less than 90

  • Finally add the data to the 'matches' field

m2 = df_1['matches'].apply(lambda x: [i[0] for i in x if i[1] >= threshold][0] if len([i[0] for i in x if i[1] >= threshold]) > 0 else '')
# 要理解第一个‘matches’字段返回的数据类型是什么样子的,就不难理解这行代码了
# 参考一下这个格式:[('郑州市', 90), ('河南省', 0)]
df_1['matches'] = m2
return df_1

3.2 Province field fuzzy matching

There are pictures in the background introduction of your own data and the data to be matched, and the function of fuzzy matching has been encapsulated above. Here, you can directly call the above function and input the corresponding parameters. The code and execution results are as follows:

810b5e40b9a72616607054f7b3c60e74.png

After the data processing is completed, the encapsulated functions can be placed directly under the module name file that you have customized, and you can directly import the function names in the future. You can refer to the method of encapsulating some commonly used custom functions into modules that can be called directly.

9eb586779294aa76f0e20c76c4b2a4b4.png

 4. All function codes

#模糊匹配

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    :param df_1: the left table to join
    :param df_2: the right table to join
    :param key1: key column of the left table
    :param key2: key column of the right table
    :param threshold: how close the matches should be to return a match, based on Levenshtein distance
    :param limit: the amount of matches that will get returned, these are sorted high to low
    :return: dataframe with boths keys and matches
    """
    s = df_2[key2].tolist()

    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m

    m2 = df_1['matches'].apply(lambda x: [i[0] for i in x if i[1] >= threshold][0] if len([i[0] for i in x if i[1] >= threshold]) > 0 else '')
    df_1['matches'] = m2

    return df_1

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

df = fuzzy_merge(data, company, '公司名称', '公司名称', threshold=90)
df

Many students hope to find a place to exchange Python. The content of the WeChat group cannot be settled. We have opened a free python planet, which is free forever. It is currently open for a limited time, and interested students can join as soon as possible .

0ba35ffd58185ba9f6a319612faf0d86.png

(Forever free, scan code to join)

推荐阅读:
入门: 最全的零基础学Python的问题  | 零基础学了8个月的Python  | 实战项目 |学Python就是这条捷径
干货:爬取豆瓣短评,电影《后来的我们》 | 38年NBA最佳球员分析 |   从万众期待到口碑扑街!唐探3令人失望  | 笑看新倚天屠龙记 | 灯谜答题王 |用Python做个海量小姐姐素描图 |碟中谍这么火,我用机器学习做个迷你推荐系统电影
趣味:弹球游戏  | 九宫格  | 漂亮的花 | 两百行Python《天天酷跑》游戏!
AI: 会做诗的机器人 | 给图片上色 | 预测收入 | 碟中谍这么火,我用机器学习做个迷你推荐系统电影
小工具: Pdf转Word,轻松搞定表格和水印! | 一键把html网页保存为pdf!|  再见PDF提取收费! | 用90行代码打造最强PDF转换器,word、PPT、excel、markdown、html一键转换 | 制作一款钉钉低价机票提示器! |60行代码做了一个语音壁纸切换器天天看小姐姐!|

Guess you like

Origin blog.csdn.net/cainiao_python/article/details/131078664