nlp Chinese word segmentation

Word segmentation is the basis of Chinese natural language processing. The most commonly used word segmentation algorithms are:

1. Dr. Zhang Huaping's NShort Chinese word segmentation algorithm.

2. Chinese word segmentation algorithm based on conditional random field (CRF).

The representative toolkits of these two algorithms are the jieba word segmentation system and the LTP language technology platform of Harbin Institute of Technology. The following shows how to use these two tools respectively.

The jieba package has two word segmentation functions, cutand cut_for_search, the latter is mainly designed for search engines and has a finer granularity. jieba.cut(sentence,cut_all=False,HMM=True)The method accepts three input parameters: the string that needs to be segmented; the cut_all parameter is used to control whether to use the full mode; the HMM parameter is used to control whether to use the HMM model.

The segmentation module of the pyltp package has only one segmentation function, and Segmentor.segment(line) has only one parameter: the string to be segmented.

#coding:utf-8
import jieba
from pyltp import Segmentor

text='奎金斯距离祭台很近,加拉塔“掉落”的部分事物就在他的面前,他下意识就捡起了其中一块金属碎屑般的事物和小片黑色固体。'

segs1=jieba.cut(text)
print('|'.join(segs1))
segs1=jieba.cut_for_search(text)
print('|'.join(segs1))

segmentor=Segmentor()  #实例化分词模块
segmentor.load("D:\\ltp_data\\cws.model")
segs2=segmentor.segment(text)
print('|'.join(segs2))
segmentor.release()    #释放模型

The result of word segmentation is as follows:

Quiggins |near |altar|very close|,|plus|latta|"|fall|"|| Picked up |the|of which|pieces of |metal|debris|like |things| and |small pieces|black|solid|.
Quiggins|distance|altar|closer|,|plus|latta|"|fall|"|| Just |picked up| of which |a piece of |metal|debris|like |thing| and |small piece|black|solid|.
Quiggins|distance|altar|very|close|close|,|Galata|"|fall|"|| Just |pick up|pick up|of the|one|piece|piece|metal|debris|like|thing| and |small|piece|black|solid|.

It can be seen that the default word segmentation method still has some flaws. In the jieba word segmentation result, "galata" is divided into two parts, and in the result of pyltp, "like" is divided into two parts. For better word segmentation, both toolkits provide functions to adjust dictionaries and add dictionaries.

The function to adjust the dictionary in jieba is jieba.add_word(word, freq=None, tag=None), which accepts three parameters: new word, word frequency, and part of speech. jieba can also add a custom dictionary, jieba.load_userdict(f), f is a txt document, the requirement is utf-8 encoding. The dictionary format is that one word occupies one line, and each line is divided into three parts: word, word frequency (can be omitted), part of speech (can be omitted), separated by spaces, and the order cannot be reversed.

pyltp can load a custom dictionary while loading the model. Segmentor.load_with_lexicon(model_path,user_dict) The first parameter comes with the model file, and the second parameter is the custom dictionary. The dictionary format is one word for one line, the first column is the word, and the second to nth columns are the candidate parts of speech of the word.

In this word segmentation, the custom dictionaries of jieba and pyltp are only words, and others are omitted. The dictionary contents are as follows:

Subconscious
Galata

The word segmentation code after adjusting the dictionary is as follows:

#coding:utf-8
import jieba
from pyltp import Segmentor
from pyltp import CustomizedSegmentor

text='奎金斯距离祭台很近,加拉塔“掉落”的部分事物就在他的面前,他下意识就捡起了其中一块金属碎屑般的事物和小片黑色固体。'

jieba.add_word('奎金斯')
jieba.add_word('加拉塔')
segs1=jieba.cut(text)
print('|'.join(segs1))

jieba.load_userdict('userdict_jieba.txt')
segs1=jieba.cut(text)
print('|'.join(segs1))

segmentor=Segmentor()
cws_model="D:\\ltp_data\\cws.model"
user_dict="userdict_ltp.txt"
segmentor.load_with_lexicon(cws_model,user_dict)
segs2=segmentor.segment(text)
print('|'.join(segs2))
segmentor.release()

Word segmentation result:

Quiggins|distance|altar|very close|, |Galata|"|fall|drop|"|| Rise|made|into|pieces|metal|debris|like|things| and |small pieces|black|solid|.
Quiggins|distance|altar|very close|, |Galata|"|fall|drop|"|| Rise|made|into|pieces|metal|debris|like|things| and |small pieces|black|solid|.
[INFO] 2018-04-21 17:49:06 loaded 2 lexicon e
quiggins|distance|altar|very|close|,|galata|"|drop|"|of|parts|things|just |in front of |his|'s|, |he|subconsciously| picked up|picked up|of |a|piece|piece|metal|debris|like|thing|and |small|piece|black|solid| .

In addition to adding custom dictionaries, pyltp can also personalize word segmentation. Personalized word segmentation is to solve the problem of switching the test data to fields such as novels, finance, etc. that are different from the news field. When switching to a new domain, users only need to annotate a small amount of data. Personalized word segmentation will be incrementally trained on the basis of the original news data. In this way, the purpose of using the rich data in the news field and taking into account the specificity of the target field is achieved.

pyltp supports the use of user-trained personalized models. For the training of personalized models, LTP is required. For details and training methods, please refer to http://ltp.readthedocs.org/zh_CN/latest/theory.html#customized-cws-reference-label.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324600818&siteId=291194637