Chinese word segmentation for python

Chinese participle

Dictionary-Based Word Segmentation Method

Mechanical word segmentation method , word segmentation method for string matching .

According to a certain strategy, the Chinese character string to be segmented is matched with the entries in a sufficiently large machine dictionary.
Three elements: 1. Word segmentation dictionary 2. Text scanning order 3. Matching principle

According to the order of scanning sentences, it can be divided into forward scanning, reverse scanning and bidirectional scanning .

The matching principles mainly include maximum match, minimum match, word-by-word match and best match .

Maximum matching algorithm: (forward/reverse)

  1. Let the length of the longest word in the dictionary be n;
  2. Take a character string of length n from the sentence to be divided, and match it with the dictionary;
  3. If the match is successful, as a word;
  4. If the matching is unsuccessful, remove a Chinese character from the sentence and match again, and repeat until the matching is completed.

The forward maximum matching algorithm removes the last word each time, and the error rate is 0.6%. The reverse maximum matching algorithm removes the first word every time, and the error rate is 0.4%.

Preprocessing optimization

  1. For Sentences: Setting up Segmentation Marks

    1. Natural Segmentation Marks: Non-character symbols (punctuation marks)
    2. Unnatural Segmentation Markers: Using Affixes and Non-Words (Yeah)
  2. For dictionaries: Arrange dictionaries according to word frequency.

Disadvantages: ambiguity; unregistered words;

Statistical word segmentation method

A word is a stable combination of characters. The more frequently adjacent characters appear at the same time , the more likely they are to form a word.

Algorithm implementation: Calculate the combination frequency of adjacent words in the text , and calculate the mutual occurrence information . If it is higher than a certain threshold, it may constitute a word.

Main application models: ngram model , HMM model and maximum entropy model .

Mutual information: the degree of mutual dependence (correlation, influence) of two discrete random variables X and Y.

In practical applications, statistical-based methods are generally combined with dictionary-based methods.

Word segmentation method based on semantics and understanding

The main indicators to evaluate the effect of a word segmentation system: precision (precision rate), recall rate (recall rate), F value

N: the number of words segmented by the standard, e: the number of words incorrectly marked by the tokenizer, c: the number of words correctly marked by the tokenizer

Accuracy (precision rate), indicating the accuracy of word segmentation by the tokenizer. R = C/N

The recall rate (recall rate) indicates how complete the tokenizer is to segment the correct words. P = c/(c + e)

F value is an index that comprehensively reflects the whole. F = 2PR/(P + R)

Error rate, indicating the degree of error of the tokenizer. ER = e/N

The larger the R, P, F, the better, and the smaller the ER, the better. The P, R, and F values ​​of a perfect tokenizer are all 1, and the ER value is 0.

word segmentation tool

Jieba algorithm:

Basic principle: For the words that need to be divided, if they are in the dictionary, they can be directly read and divided; if they are not in the dictionary, the Viterbi algorithm is used for word segmentation.

Jieba's three word segmentation modes: precise mode : the most accurate division of sentences; full mode : scan all the words that can be formed into words in a sentence; search engine mode ;

The jieba.cut() function is the main function for word segmentation of Chinese sentences. Call method:

import jieba
jieba.cut(sentence, cut_all=False, HMM=True)
# sentence:需要分词处理的字符串 
# cut_all:分词模式。True全模式,False精准模式。
# HMM:是否使用HMM模型

example:

import jieba
s=jieba.cut(sentence) # 字符串
list(s)

thula word segmentation toolkit

The thulac() function in the thulac word segmentation toolkit generates a model:

thulac(user_dict=None, model_path=None, T2S=False, seg_only=False, filt=False)# 初始化程序,进行自定义设置
# user_dict:设置用户词典。

How to call the thulac model:

  1. cut() for sentence segmentation
cut(sentence, text=False)
  1. cut_f() segment the file
cut_f(Text, text=False)# text表示是否返回文本,默认False

cut_f(input_text, output_text)# 输入文件 输出文件

example:

import thulac
thu1 = thulac.thulac() # 默认模式
text = thu1.cut("我爱北京天安门", text=True) #进行一句话分词 
# 结果:我_r 爱_v 北京_ns 天安门_ns
print(text)

practise

Split the following sentences:

import thulac 
thu1 = thulac.thulac() #默认模式 

# 结果:他_r 用_v 了_u 两_m 个_q 半天_m 写_v 完_v 了_u 这_r 篇_q 文章_n 。_w
text1 = thu1.cut("他用了两个半天写完了这篇文章。", text=True) 
print(text1)

# 结果:我等_r 她_r 等_v 了_u 半天_m 。_w
text2 = thu1.cut("我等她等了半天。", text=True) 
print(text2)
import jieba

# 结果:['人们', '朝向', '不同', '的', '出口', '。']
s1 = jieba.cut("人们朝向不同的出口。", cut_all=False, HMM=True)
print(list(s1))


# 结果:['我们', '出发', '的', '时间', '不同', '。']
s2 = jieba.cut("我们出发的时间不同。", cut_all=False, HMM=True)
print(list(s2))

Guess you like

Origin blog.csdn.net/xukeke12138/article/details/111598683