NLP basis - accurate word (application)

Disclaimer: This article is a blogger original article, shall not be reproduced without the bloggers allowed. https://blog.csdn.net/qq_29027865/article/details/90813795

About NLP related packages installed configuration, you can refer to:
NLP toolkit installation configuration
on the principle of word can refer to:
natural language processing NLP- Hidden Markov)

Here Insert Picture Description

1. Load dictionary words can be divided to ensure quasi

For some professional terms, the use of the original thesaurus may not be good to separate words, such as in the medical text classification, such as: dedicated pharmaceutical terminology and oxaliplatin, fluorouracil monotherapy and so on.

jieba load custom dictionary

No points will begin exact words into the dictionary, you can correct its word

jieba in the dictionary, is loaded through the json.loads:

jieba.load_userdict("dict.txt")

Example:

import jieba
seg_list = jieba.cut("辅助治疗中的氟尿嘧啶单药或联合奥沙利铂,并未直接纳入TNM分期系统", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))  

Here Insert Picture Description
After loading the dictionary:

import jieba
jieba.load_userdict("dict.txt")
seg_list = jieba.cut("辅助治疗中的氟尿嘧啶单药或联合奥沙利铂,并未直接纳入TNM分期系统", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))  # 精确模式

Here Insert Picture Description

hanlp load custom dictionary

hanlp custom dictionary directory: E: \ NLP \ \ data \ dictionary \ custom hanlp
can be seen from hanlp.properties file, which is loaded as a custom dictionary: resume_nouns.txt file
Here Insert Picture Description
before the changes, first CustomDictionary.txt .bin file deletion, or load files are txt.bin
Here Insert Picture Description
we can see the difference in the dictionary hanlp and jieba dictionary is included in the dictionary hanlp word, part of speech, word frequency composition, jieba only word in the dictionary.

2. The word to be separated by means of a regular match

Match for digital percent sign, etc., can not be accurately form eleven accurate word to the dictionary, it can accurately matched by the regular word:

The fundamental principle is to use regular expression matching after replacing, then replaced on a regular match word replaces the original word.

cut_data.py:

import jieba
import re
from tokenizer import cut_hanlp

def merge_two_list(a, b):
    c=[]
    len_a, len_b = len(a), len(b)
    minlen = min(len_a, len_b)
    for i in range(minlen):
        c.append(a[i])
        c.append(b[i])

    if len_a > len_b:
        for i in range(minlen, len_a):
            c.append(a[i])
    else:
        for i in range(minlen, len_b):
            c.append(b[i])
    return c
    
if __name__=="__main__":
    # 第一步首先打开需要分词的文件
    fp=open("text.txt","r",encoding="utf8")
    # 第二步创建需要保存分词后结果的文件
    fout=open("result_cut.txt","w",encoding="utf8")
    # 第三步使用正则进行清洗,将一些词进行处理
    # 匹配出非汉字的情况,并且限制在5个字符
    regex1=u'(?:[^\u4e00-\u9fa5()*&……%¥$,,。.@! !]){1,5}期'
    # 这个正则用来匹配百分号的小数{1-3}表示3位,[0-9]表示10个数字
    # 小数点.后打上问号,表示1个或者0个
    regex2=r'(?:[0-9]{1,3}[.]?[0-9]{1,3})%'
    p1=re.compile(regex1)
    p2=re.compile(regex2)
    # 读取每一行
    for line in fp.readlines():
        result1=p1.findall(line)
        # 将正则匹配的词进行替换
        if result1:
            regex_re1=result1
            line=p1.sub("FLAG1",line)
        result2=p2.findall(line)
        if result2:
            line=p2.sub("FLAG2",line)
        # 开始切完词后是迭代器类型,还需要join进行显示
        words=jieba.cut(line)
        words1=cut_hanlp(line)
        result=" ".join(words1)
        # 再把正则匹配上替换的词替换为原来的词
        if "FLAG1" in result:
            # 先通过split方法拆分为列表
            result=result.split("FLAG1")
            # 再通过merge方法将两个列表合并
            result=merge_two_list(result,result1)
            result ="".join(result)
        if "FLAG2" in result:
            result=result.split("FLAG2")
            result=merge_two_list(result,result2)
            result="".join(result)
        fout.write(result)
    fout.close()

tokenizer.py:

import os,gc,re,sys

from jpype import *

startJVM(getDefaultJVMPath(),r"-Djava.class.path=E:\NLP\hanlp\hanlp-1.5.0.jar;E:\NLP\hanlp",
         "-Xms1g",
         "-Xmx1g")

Tokenizer = JClass('com.hankcs.hanlp.tokenizer.StandardTokenizer')

def to_string(sentence,return_generator=False):
    if return_generator:
        return (word_pos_item.toString().split('/') for word_pos_item in Tokenizer.segment(sentence))
    else:
        return " ".join([word_pos_item.toString().split('/')[0] for word_pos_item in Tokenizer.segment(sentence)]   )
    
def seg_sentences(sentence,with_filter=True,return_generator=False):  
    segs=to_string(sentence,return_generator=return_generator)
    if with_filter:
        g = [word_pos_pair[0] for word_pos_pair in segs if len(word_pos_pair)==2 and word_pos_pair[0]!=' ' and word_pos_pair[1] not in drop_pos_set]
    else:
        g = [word_pos_pair[0] for word_pos_pair in segs if len(word_pos_pair)==2 and word_pos_pair[0]!=' ']
    return iter(g) if return_generator else g

def cut_hanlp(raw_sentence,return_list=True):
    if len(raw_sentence.strip())>0:
        return to_string(raw_sentence) if return_list else iter(to_string(raw_sentence))

3. dynamic adjustment of word frequency dictionary

Sometimes dictionary has been loaded, but the situation does not necessarily share the word open, then they need by means of dynamic adjustment of word frequency change.

jieba the adjustment of frequency

jieba.suggest_freq('台中' , tune=True)
import jieba
jieba.load_userdict("dict.txt")
jieba.suggest_freq('台中' , tune=True)
if __name__ = "__main__":
	string = "台中正确应该不会被切开"
	words = jieba.cut(string,HMM=False)
	result = " ".join(words)
	print(result)

However, in practice, it is impossible to load a word a word, so that you can open the dictionary file to open, and then the word dictionary file traversed one by one, using the following:

fp = open("dict.txt", 'r' , encoding = 'utf8')
for line in fp:
	lline = ine.strip()
	jieba.suggest_freq(line, true = True)

In order to run more efficiently, will list for loop instead of formula:

[jieba.suggest_freq(line.strip(), tune=True) for line in open("dict.txt",'r',encoding='utf8')]

hanlp the adjustment of frequency

For hanlp cut word, the dictionary is because hanlp while recording a word, part of speech, word frequency count, if the word appears the same number of word frequency, how to choose segmentation? Such as:
Here Insert Picture Description
Here we default to the longest match principle that priority be segmented for len long, but will be cut in the order of word dictionary loaded when Hanlp cut word, if the word is loaded first, it will be removed. So we need to words in the dictionary sorted by length.
sort_dict_by_len.py:

import os
# 第一步先打开文件
dict_file=open(r"E:\NLP\hanlp\data\dictionary\custom"+os.sep+"resume_nouns.txt",'r',encoding='utf8')
d={}
# 第二步把单词取出,对每个词和每个长度建立一个字典
[d.update({line:len(line.split(" ")[0])}) for line in dict_file]
f=sorted(d.items(), key=lambda x:x[1], reverse=True)
dict_file=open(r"E:\NLP\hanlp\data\dictionary\custom"+os.sep+"resume_nouns1.txt",'w',encoding='utf8')
[dict_file.write(item[0]) for item in f]
dict_file.close()

After further hanlp word segmentation results are as follows:

from tokenizer import cut_hanlp
if __name__=="__main__":
    string="台中正确应该不会被切开。"
    words=cut_hanlp(string)
    print(words)

Here Insert Picture Description

Guess you like

Origin blog.csdn.net/qq_29027865/article/details/90813795