The most complete history of segmentation algorithms and tools introduced

Word (Word The tokenization) , also known as word segmentation, i.e. some way to identify the individual words in the sentence and separated, so that the text from the "codeword sequence" is represented upgraded "word sequence" FIG. Segmentation technology is not just for Chinese, for English, Japanese, Korean and other languages are also applicable.

Although there is a natural word separators (spaces), but often the words and punctuation viscous other cases, such as "Hey, how are you." English is "Hey" and "you" is required and punctuation points behind separated

table of Contents

  • Why word?
  • Could regardless word?
  • Chinese word Difficulties?
  • Segmentation algorithm since ancient times: the dictionary to the pre-training
  • From the outside of the segmentation tools

Why word?

For the Chinese, if not word, then the neural network will be directly based on the original Chinese character sequence for processing and learning. However, we know the meaning of a word may be very different in different words, the medium such as "ha ha" and "husky" "Ha" means a far cry from the model during the training phase if not seen "husky", then the predicted time to it is possible that the sentence "husky" where the expression cheerful atmosphere in the ╮ (¯ ▽ ¯ "") ╭

Obviously, to solve the above problem, the word is it, after the word, the smallest unit of input is no longer the word, but the word, as "ha ha" and "husky" are two different words, there would be no natural friends in front of this issue. It follows that the word ease the "word ambiguity" problem .

In addition, the feature (feature) with the task of NLP perspective, compared word for word, is more primitive and low-level features, often associated with the mission objectives is relatively small; but to the word level, often with a strong correlation mission objectives can occur. For example, for sentiment classification task, "I go shit today" phrase in each word is not related to a positive emotional relationship, even "dog" the word is also often closely linked with negative emotions, but "shit" the word has expressed "lucky", "happy", "surprise" positive affect, therefore, segmentation can be seen as the model to provide more high-level, more direct feature, the model threw more readily available natural good performance.

Could regardless word?

The answer is, of course. The purpose can be seen from the front of the word, as long as the models themselves can learn the ambiguity of the word, and the word he learned from the words of the law, then it is equivalent to the implied built a word in the internal model , this time this built-in word is part of the solution to the network's objectives and tasks together "end to end training", so it may get even better performance.

However, this can be seen from the above description, to satisfy this condition, it is very rare. This requires training corpus is very rich, and the model is large enough (there may be additional capacity to built a segmentation model implied), we can be more than "word + word-level model is" better performance. This is why BERT and other large pre-training model is often the word level, all without the use of word breaker.

Further, the word can not do any harm to , once the word's accuracy is not sufficient, in itself is noisy or corpus (typo more messy sentence, various non-standard language), then the word will be easily forced It makes the model more difficult to learn. For example, the model finally learned to "husky" is the word, but it was the Huskies hit into "clam Ranger", the result word is not recognized, it is divided into the "clam", "Bachelor" and "surprising" these three words so we have trained this "word level model" can not see "husky" a (after all model training when the "husky" is the basic unit).

Chinese word Difficulties?

1 Ambiguity

First, the aforementioned word can ease the "word ambiguity" problem, but the word itself will face "segmentation ambiguity" of the problem. For example, splitting the title "Radio France do research"

 

v2-a5d8d3c1924a631d6bae5793b9f27338_b.jpg

Although cut into "radio / France / do / study" does not look wrong, but considering that this is a title, is clearly positive solution "Radio France / country / research" on the word is too south of the (. ︿. ) and if you do not tell you this is a title, apparently two kinds of segmented patterns look no problems.

2 unknown word problem

In addition, the Chinese dictionary is also advancing with the times, such as "green grassland", "feel tired do not love" and other online word 10 years ago is not today's word training also will encounter in the near future do not know word (ie, "unknown word" , OUT-of-Vocabulary Word), that time is easy because the word is "outdated" and appears segmentation error.

3 Standardization

Finally, the segmentation boundary when a word has been no established norm. Despite the enactment of the "modern information processing word word Chinese Standard" in 1992 the country, but this specification is easily influenced by subjective factors, but also inevitably encounter problems somewhat less in the actual scene.

Algorithms articles

1 Based dictionary

For the Chinese word problem, the simplest algorithm is based dictionary directly greedy match .

For example, we can start to look up a word directly from the beginning of the first sentence, the longest word in the dictionary to find out at the beginning of the word, and then got his first good word segmentation. For example, the phrase "small Xi Yao was talking NLP", the dictionary found "little Yao Xi" is the longest word, so get

Xi Yao small / are talking about NLP

Then resumes matching dictionary from the beginning of the next word, we found that "being" is the longest word, so

Small Xi Yao / in / speak NLP

And so on, finally get

Small Xi Yao / in / speak / NLP

This simple algorithm is the forward maximum matching (FMM)

Although the approach is very simple, but the name sounds a little high ╮ (╯ ▽ ╰) ╭

However, due to the Chinese sentence itself has the characteristics of an important set of information, from the back to match the correct segmentation rate tends to be higher than front to back , so there will be a reverse of the " backward maximum matching (BMM) ." Of course, both the FMM or BMM, are sure there are many segmentation error, therefore considering a more thoughtful approach is "the greatest two-way match" .

Bi-directional maximum matching algorithm refers to the treatment of sliced ​​sentences and the use of FMM were RMM word separately, then segmentation results do not coincide ambiguous sentences for further processing. Usually can be compared to the number of words the two methods, to take appropriate measures in accordance with the same number or not, in order to reduce the word error rate is ambiguous sentences.

2 Based on statistics

2.1 based language model

Dictionary-based method is simple, but obviously can see it too! Do not! wisdom! can! Up! Slightly more complex sentences, such as "okay, Eve small Yao home cooking.", If used at this time to the maximum matching, it will cut into "okay /, / addition / Xi Yao small / home / cooking /. "this is clearly wrong is unforgivable.

Such mistakes is the root cause is that the method is time-based dictionary does not consider the context where the words of the cut, do not proceed from the overall look for the optimal solution . In fact, this sentence is nothing more than the above two kinds of segmentation in tangled ways:

a. It does not matter /, / divide / Xi Yao small / home / cook /.
b. It does not matter /, / New Year's Eve / small Yao / home / cook /.

We speak every day in very few people say "it does not matter /, / divide / xxxx / cook /." This sentence, while the frequency of occurrence of the second sentence will be very high, such as inside the "little Yao" can be replaced to "I", "Pharaoh" and so on. Obviously Given a sentence, a variety of segmentation is a limited number of combinations, if there is one thing you can assess the existence of any reasonable value of a portfolio, it is not to find the best combination of word thing!

So, the essence of this approach is to find the most reasonable combination of word combinations at various cut , this process can be seen as to find a path in the greatest probability of word segmentation figure :

 

v2-cc1049256def5b8d1cbcaa9ea3b78c3f_b.jpg

And this may be there is a reasonable thing to scoring word sequence is called "language model" (Language Model) . This use of a language model to evaluate various combinations segmentation method is not much more attractive it is a smart ╮ (╯ ▽ ╰) ╭ after a given word in a sentence obtained word sequence {w1, w2 ... wm}, language model sentence can calculate the possibility (or rhetoric sequence) is present:

This expression can be obtained by the chain rule are expanded:

 

v2-8d5813bc2ba423436bb0dda718e309b0_b.png

Clearly, when m is a large value slightly behind several multiplication chain it becomes very difficult to calculate (estimate of probability of these items need to rely on very large corpora to ensure acceptable estimation error). Difficult to calculate how to do? Of course, a reasonable assumption to simplify the calculation , we can assume the current position such as to take depends on what word only n positions adjacent to the front, i.e.,

 

v2-be3d1226dffab7d59f39c9fbeb2ef2ab_b.png

This simplified language model is called n-gram language model . Such multiplication of each multiplier in the chain has been completed can be calculated manually labeled compound words minutes to obtain it. Of course, in the actual calculation may also introduce some smoothing techniques to compensate for the estimated error points due to the limited size of the words material, do not start talking about it here.

2.2 based on statistical machine learning

NLP is a strong binding with machine learning discipline, word problems is no exception. Chinese word can also be modeled as a " sequence labeled " problem, that is a context word Classification consideration. Therefore, the material can first be trained by a sequence labeling model points words tagged, then this model unlabeled corpus word.

Sample labels

Usually used {B: begin, M: middle , E: end, S: single} four categories to describe a category of each word in the sample word belongs. They represent the position of the word in the word. Wherein, B is the word representing the beginning word of words, M for a word in the intermediate word, E is the end word representative of the word, S represents a word into words.

A sample is shown below:

Al / b are / e often / s said / s raw / b Live / e is / s a ​​/ s unit / s teach / b Branch / m book / e

Then we can directly apply statistical machine learning model to train a word breaker friends. - statistical model is representative of the sequence is to represent the tagging formula model representative of a hidden Markov model (HMM), and discriminant model - (linear chains) Conditional Random Fields (of CRF) . These two models have been very familiar with small partners can be skipped.

Hidden Markov Models (HMM)

See for details on HMM

" If you're in love with the small Xi Yao ... (on )" / "down"

After understanding the basic concepts on HMM, HMM model we look at is how to word it ~ basic ideas: the word problem into question the classification for each position to the word, that sequence labeling problem. Wherein there are four categories (mentioned in front of B, M, E, S). After all the classified position to complete, it will be able to get word class sequence according to the results of it.

For chestnut!

Our input is a sentence:

        小Q硕士毕业于中国科学院
      

By the algorithm, we have successfully predicted word label each word corresponding to a bit:

        BEBEBMEBEBME
      

According to the state sequence we can cut words:

        BE/BE/BME/BE/BME
      

So word segmentation results are as follows:

        小Q/硕士/毕业于/中国/科学院
      

So the question again, if a perfect model of HMM word to you, then how to use this model for the character sequence input sequence mark it? First look at two core concepts HMM models: the observation sequence and the sequence of states.

Observation sequence is that I can directly see the sequence, or "Master of small Q graduated from the Chinese Academy of Sciences," the word sequence, and the sequence is the inner state sequence can not directly observable by the naked eye, which is above the corresponding phrase the results marked "BEBEBMEBEBME", and our HMM model, you can help us from the observed sequence -> turned gorgeous state sequence!

Abstract expressed mathematically as follows: The \ Lambda = \ lambda_1 \ lambda_2 ... \ lambda_nsentence represents the input, n is an sentence length, \ lambda_idenotes a word, o = o_1o_2 ... o_nlabel representing the output, then the output is the ideal:

 

v2-c7087bbc6d6887f3730ea1bd682287bd_b.png

 

3 based on neural network

is known, has succeeded in occupying the deep learning NLP, NLP swept the classification of sequence annotation and generate problems. As mentioned earlier, word can be modeled as a sequence labeling problem, so good at processing sequence data LSTM (when the length of the memory network) and the nearest pre-training model super fire can also be used for Chinese word.

3.1 Based (Bi-) LSTM

LSTM not familiar with the model of a small partner to see previously wrote this small evening of "the STEP to the STEP-by-LSTM" , the basic theory of this paper lstm not repeat it.

As described in the foregoing a language model, context information is important in alleviating word segmentation ambiguity, the longer the context to be considered, the greater the ability to resolve ambiguous nature . The front of the n-gram language model can only do a certain distance considering the context, then there is no segmentation model can be considered infinite distance of context in theory it? The answer is based on LSTM do. Of course, LSTM are directional , both to consider all the historical information (all the words on the left) in order to allow classification of each word position when, but also consider the future of all information (all the right word), we can use two-way LSTM (Bi-LSTM) to act as a skeleton model sequence annotated as

 

v2-870d31eae4b629a3d9ce92719ed9fa90_b.jpg

After completion LSTM encoded context information for each position, each position for the final completion of the classification by the classification softmax layer, thereby completed with CRF and HMM segmentation based on the sequence as denoted in Chinese.

3.2 Pre-training model based on knowledge distillation +

Recently more than a year's time, BERT, ERNIE, XLNet pre-training and other large swept most of the field of NLP, on word problems also have significant advantages.

 

v2-34a39e7e1d7a13966784668340ff2b5b_b.jpg

However, it is well known that pre-training model is too big, too much consumption of computing resources, if you want to carry out mass text word, even if the card spend 8 32G Tesla V100 will appeared to be inadequate, so one solution is that the pre-training model the word knowledge through knowledge distillation (knowledge distillation) to migrate to small models (such as LSTM, GRU) on. Recently Jieba word is in on the last line of advanced segmentation model for such a use obtained in this way (actually a common lexical analysis model), junior partner may be interested to find out on their own. Data model and pre-training knowledge distillation lot, not repeat it here.

 

Tools articles

The following column a few of the more mainstream segmentation tool (in alphabetical order, all of you to try), related paper in your subscription number "sell Meng Yao Xi small house" backstage reply [Chinese] receive word .

1 Jieba

When it comes to word tool first thought is certainly known "stutter" Chinese word, the main algorithm is based on the shortest path mentioned earlier word graph statistical segmentation, also recently built a pre-training model Baidu large-scale fly propeller + forefront of distilled segmentation model.

github project Address: https://github.com/fxsjy/jieba

 

2 THULAC(THU Lexical Analyzer for Chinese)

Language Processing Lab launched by the natural and social and human development Tsinghua calculation of a Chinese lexical analysis toolkit, with Chinese word segmentation and POS tagging function. The segmentation tool model employed was a structured perceptron. For more details, please refer algorithm github project and read the paper original.

github project Address: https://github.com/thunlp/THULAC

Papers Link: https://www.mitpressjournals.org/doi/pdf/10.1162/coli.2009.35.4.35403

Example of use:

#THULAC
#pip install thulac
import thulac

sentence = "不会讲课的程序员不是一名好的算法工程师"
thu1 = thulac.thulac(seg_only=True)  #只分词
text = thu1.cut(sentence, text=True)  #进行一句话分词
print("THULAC: " + text)

#output
#Model loaded succeed
#THULAC: 不 会 讲课 的 程序员 不 是 一 名 好 的 算法 工程师

 

3 NLPIR-ICTCLAS Chinese word system

Massive language information processing and cloud computing Beijing Institute of Technology Engineering Research Center Data Search and Mining Laboratory (Big Data Search and Mining Lab.BDSM@BIT) release. HMM is based on the level of sub-thesaurus, the word, POS, NER are all incorporated into the framework of a joint training level HMM get.

Home: http://ictclas.nlpir.org/github

Project Address: https://github.com/tsroten/pynlpir

Example of use:

#NLPIR-ICTCLAS
#pip install pynlpir
import pynlpir

sentence = "不会讲课的程序员不是一名好的算法工程师"
pynlpir.open()
tokens = [x[0] for x in pynlpir.segment(sentence)]
print("NLPIR-TCTCLAS: " + " ".join(tokens))
pynlpir.close()

#output
#NLPIR-TCTCLAS: 不 会 讲课 的 程序员 不 是 一 名 好 的 算法 工程

 

4 LTP

HIT produced, like THULAC, LTP is also based on the structure of Perceptron (Structured Perceptron, SP), segmentation model to study the guidelines of maximum entropy.

Project Home: https://www.ltp-cloud.com/github

Project Address: https://github.com/HIT-SCIR/ltp

Papers Link: http://jcip.cipsc.org.cn/CN/abstract/abstract1579.shtml

Example of use: you need to download segmentation model (prior to use http://ltp.ai/download.html )

      
 

5 HanLP

HanLP is with the "Natural Language Processing Getting Started" series of supporting open source NLP algorithms library. In addition to the classic version 1.x outside constantly updated iteration, this year also launched a new version 2.0. 1.x versions have a dictionary word tools and CRF-based word segmentation model. Version 2.0 of the open source segmentation tool based on the depth learning algorithm.

Version 1.x

github project Address: https://github.com/hankcs/pyhanlp

Version 2.0

github Address: https://github.com/hankcs/HanLP/tree/doc-zh

Example of use: Use the above claims Python 3.6

#HanLP
#v2.0
#pip install hanlp
import hanlp

sentence = "不会讲课的程序员不是一名好的算法工程师"
tokenizer = hanlp.load('PKU_NAME_MERGED_SIX_MONTHS_CONVSEG')
tokens = tokenizer(sentence)
print("hanlp 2.0: " + " ".join(tokens))
#output
#hanlp 2.0: 不 会 讲课 的 程序员 不 是 一 名 好 的 算法 工程

 

6 Stanford CoreNLP

Stanford introduced the word segmentation tool that can support multiple languages. The core algorithm is based on the model of CRF.

github project Address: https://github.com/Lynten/stanford-corenlp

Papers Link: https://nlp.stanford.edu/pubs/sighan2005.pdf

Example of use: Need to start stanford official website to download Chinese word segmentation model ( https://stanfordnlp.github.io/CoreNLP/ )

###stanford CoreNLP
#pip install stanfordcorenlp
from stanfordcorenlp import StanfordCoreNLP

sentence = "不会讲课的程序员不是一名好的算法工程师"
with StanfordCoreNLP(r'stanford-chinese-corenlp-2018-10-05-models', lang='zh') as nlp:
    print("stanford: " + " ".join(nlp.word_tokenize(sentence)))

 

Published 33 original articles · won praise 0 · Views 3274

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/104589402
Recommended