A very useful python magic library (project combat)

1 Introduction

In the process of processing data, it is inevitable to encounter the following similar scenarios. What you get in your hand is the simplified version of the data field, but the data to be aligned or merged is the full version of the data (sometimes the other way around). )

The most common example is: in geographic visualization, only the abbreviations of the data collected by yourself are retained, such as Beijing, Guangxi, Xinjiang, Tibet, etc., but the field data to be matched are Beijing, Guangxi Zhuang Autonomous Region, Xinjiang Uygur Autonomous Region , Tibet Autonomous Region, etc., as follows. Therefore, there needs to be a way to directly match the corresponding fields quickly and conveniently and generate a separate column for the result, so that the  FuzzyWuzzy  library can be used.

2. FuzzyWuzzy library introduction

FuzzyWuzzy is a simple-to-use fuzzy string matching toolkit. It calculates the difference between two sequences according to the Levenshtein Distance algorithm.

The Levenshtein Distance algorithm, also known as the Edit Distance algorithm, refers to the minimum number of editing operations required to convert two strings from one to the other. Permitted editing operations include replacing one character with another, inserting a character, and deleting a character. In general, the smaller the edit distance, the greater the similarity between the two strings.

The jupyter notebook programming environment under Anaconda is used here, so enter the following instructions on the command line of Anaconda to install the third-party library.

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple FuzzyWuzzy

2.1 fuzz module

This module mainly introduces four functions (methods), namely: simple matching (Ratio), partial matching (Partial Ratio), ignoring order matching (Token Sort Ratio) and deduplication subset matching (Token Set Ratio)

Note: If you import this module directly, the system will prompt a warning. Of course, this does not mean that an error is reported, and the program can still run (the default algorithm used, the execution speed is slow). You can install the python-Levenshtein library as prompted by the system. It is beneficial to improve the speed of calculation.

2.1.1 Simple matching (Ratio)

Just a simple understanding, this is not very precise and not commonly used

fuzz.ratio("河南省", "河南省")>>> 100>fuzz.ratio("河南", "河南省")>>> 80

2.1.2  Partial Ratio

Try to use non-exact matching, with higher precision

fuzz.partial_ratio("河南省", "河南省")>>> 100
fuzz.partial_ratio("河南", "河南省")>>> 100

2.1.3 Ignore order matching (Token Sort Ratio)

The principle is: use spaces as separators, lowercase all letters, and ignore other punctuation marks except spaces

fuzz.ratio("西藏 自治区", "自治区 西藏")>>> 50fuzz.ratio('I love YOU','YOU LOVE I')>>> 30
fuzz.token_sort_ratio("西藏 自治区", "自治区 西藏") >>> 100fuzz.token_sort_ratio('I love YOU','YOU LOVE I') >>> 100

2.1.4 Token Set Ratio

It is equivalent to a set deduplication process before the comparison. Pay attention to the last two. It can be understood that this method adds the function of set deduplication on the basis of the token_sort_ratio method. The following three matches are in reverse order.

fuzz.ratio("西藏 西藏 自治区", "自治区 西藏")>>> 40
fuzz.token_sort_ratio("西藏 西藏 自治区", "自治区 西藏")>>> 80
fuzz.token_set_ratio("西藏 西藏 自治区", "自治区 西藏")>>> 100

The final results obtained by these ratio() functions (methods) of fuzz are all numbers. If you need to obtain the string result with the highest matching degree, you still need to select a different function with your own data type, and then extract the result. This method can be quantified to see the matching degree of text data, but it is not very convenient for us to extract the matching results, so there is a process module.

2.2 process module

Used to deal with limited alternative answers, returns the string and similarity of fuzzy matches.

2.2.1 extract extracts multiple pieces of data

Similar to the select in the crawler, it returns a list, which will contain a lot of matching data

choices = ["河南省", "郑州市", "湖北省", "武汉市"]process.extract("郑州", choices, limit=2)>>> [('郑州市', 90), ('河南省', 0)]# extract之后的数据类型是列表,即使limit=1,最后还是列表,注意和下面extractOne的区别

2.2.2 extractOne extracts a piece of data

If you want to extract the result with the largest matching degree, you can use extractOne. Note that the tuple type is returned here, and the result with the largest matching degree is not necessarily the data we want. You can experience it through the following examples and two practical applications. one time

process.extractOne("郑州", choices)>>> ('郑州市', 90)
process.extractOne("北京", choices)>>> ('湖北省', 45)

3. Practical application

Here are two small examples of practical applications. The first is the fuzzy matching of the company name field, and the second is the fuzzy matching of the province and city field.

3.1 Fuzzy match of company name field

The data and the data to be matched are as follows: The name of the data field obtained by yourself is very concise, not the full name of the company, so you need to merge the two fields


 

The code is directly encapsulated as a function, mainly to facilitate future calls. The parameter settings here are more detailed, and the execution results are as follows:

3.1.1 Parameter explanation:

① The first parameter df_1 is the data on the left that you want to merge (here is the data variable);

② The second parameter df_2 is the right data to be merged to be matched (here is the company variable);

③ The third parameter key1 is the field name to be processed in df_1 (here is the 'company name' field in the data variable)

④ The fourth parameter key2 is the field name to be matched in df_2 (here is the 'company name' field in the company variable)

⑤ The fifth parameter threshold is to set the standard of matching degree of extraction result. Note that this is the improvement of the extractOne method. The result of the maximum matching degree extracted is not necessarily what we need, so we need to set a threshold to judge. to be acceptable

⑥ The sixth parameter, the default parameter is to return only two successful matching results

⑦ Return value: new DataFrame data after adding 'matches' field to df_1

3.1.2 Core code explanation

The first part of the code is as follows, you can refer to the process.extract method explained above, which is used directly here, so the returned result m is the data format of the nested ancestor in the list, the style is: [('Zhengzhou', 90), ('Henan ', 0)], so the data written to the 'matches' field for the first time is in this format

Note, note: The first one in the tuple is the successfully matched string, and the second one is the number object compared by the set threshold parameter

s = df_2[key2].tolist()m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    df_1['matches'] = m

The core code of the second part is as follows. With the above combing, the data type in the 'matches' field is clarified, and then the data is extracted. There are two points to be paid attention to in the part that needs to be processed:

① Extract the strings that match successfully, and fill the data with a threshold less than 90 with null values

② Finally add the data to the 'matches' field

m2 = df_1['matches'].apply(lambda x: [i[0] for i in x if i[1] >= threshold][0] if len([i[0] for i in x if i[1] >= threshold]) > 0 else '')#要理解第一个‘matches’字段返回的数据类型是什么样子的,就不难理解这行代码了#参考一下这个格式:[('郑州市', 90), ('河南省', 0)]df_1['matches'] = m2
return df_1

3.2 Fuzzy matching of province field

There are pictures in the background introduction of your own data and the data to be matched, and the fuzzy matching function has been encapsulated above. Here, you can directly call the above function and enter the corresponding parameters. The code and execution results are as follows:


 

After the data processing is completed, the encapsulated functions can be placed directly under your own custom module name file, and you can easily import the function name directly in the future. You can refer to encapsulating some custom commonly used functions into module methods that can be directly called.

4. All function code

#模糊匹配
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):    """    :param df_1: the left table to join    :param df_2: the right table to join    :param key1: key column of the left table    :param key2: key column of the right table    :param threshold: how close the matches should be to return a match, based on Levenshtein distance    :param limit: the amount of matches that will get returned, these are sorted high to low    :return: dataframe with boths keys and matches    """    s = df_2[key2].tolist()
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))        df_1['matches'] = m
    m2 = df_1['matches'].apply(lambda x: [i[0] for i in x if i[1] >= threshold][0] if len([i[0] for i in x if i[1] >= threshold]) > 0 else '')    df_1['matches'] = m2
    return df_1
from fuzzywuzzy import fuzzfrom fuzzywuzzy import process
df = fuzzy_merge(data, company, '公司名称', '公司名称', threshold=90)df

Guess you like

Origin blog.csdn.net/BYGFJ/article/details/123651866