In data development, use short text similarity matching to realize automatic mapping of field names

Introduction: Field names are automatically mapped. When developing the model, when using the original data to develop the model, part of the workload is to map the fields from the data to the model, no longer
looking at the dazzling ; principle: use the Levinstein distance (fuzzywuzzy) for short text Similarity matching, automatically find the "most" suitable field; of course, the final human flesh screening is still essential.

# 简介:字段名自动映射,开发模型的时候,利用原始数据进行模型开发时,部分工作量是从数据到模型进行字段映射,不再看花眼
# 原理:利用莱文斯坦距离(fuzzywuzzy)进行短文本相似度匹配,自动找到"最"合适的字段;当然,最后人肉筛查还是必不可少的。
# 作者:王振东
# 日期:2021-01-07

"""
依赖:
pip install python-Levenshtein
pip install fuzzywuzzy
约束:
字段必须有注释,且注释不能重复
"""
from fuzzywuzzy import fuzz

"""
准备字段信息文件:
1、模型的字段文件(model_fields_f),文件中每一行的格式为: 字段名称 字段注释
2、原始表的字段文件(orig_fields_f),文件中每一行的格式为: 字段名称 字段注释
"""
# 字段文件,文件中每一行的格式为: field_name 中文注释
model_fields_f = open('model_fields', 'r')
orig_fields_f = open('b_basy_xy', 'r')

model_str_list = model_fields_f.readlines()
model_list = [model_str.split() for model_str in model_str_list]
model_dict = {i[1]: i[0] for i in model_list}
orig_str_list = orig_fields_f.readlines()
orig_list = [orig_str.split() for orig_str in orig_str_list]
orig_dict = {i[1]: i[0] for i in orig_list}

# 通过遍历找到相似度最高的字段
result = dict()
for k in model_dict.keys():
    max_match = (0, '')
    for ok in orig_dict.keys():
        r = fuzz.ratio(k, ok)
        if r > max_match[0]:
            max_match = (r, ok)
    result[k] = max_match

# 生成字段映射
print('映射结果:')
print('SELECT ')
for k in result.keys():
    match = result[k]
    if match[1] == '':
        print('-- not match', k)
        continue
    orig_field_name = orig_dict[match[1]]
    print('\t%s, -- %s,对应模型字段(%s %s),匹配度(%d)' %(orig_dict[match[1]], match[1], model_dict[k], k,match[0]))

print('FROM ORIG_TABLE_NAME \nWHERE PARTITION_CONDITION;')

Guess you like

Origin blog.csdn.net/ManWZD/article/details/112427386