Study notes CB007: word segmentation, named entity recognition, part-of-speech tagging, syntactic parse tree

Chinese word segmentation divides the text into words, and it can also be reversed, and the words that should be spelled together can be spelled together again to find the named entity.

The conditional random field of the probabilistic graphical model is applicable to the condition of the observed value to determine that the random variable has a finite number of values. Given an observation sequence X, the probability of a particular label sequence Y, the exponential function exp(∑λt+∑μs). It conforms to the principle of maximum entropy. Conditional random field-based named entity recognition methods are supervised learning methods and are trained using large-scale annotated corpora.

Radioactivity of named entities. The pre- and post-words of the named entity.

Feature template, the characters/words/letters/numbers/punctuation of n positions before and after the current position are used as features. Based on the marked corpus, the part of speech and word form are known. Feature template selection is related to specific identified entity categories.

Named entities, person names (politicians, entertainers, etc.), place names (city, state, country, building, etc.), organization names, times, numbers, proper nouns (movie titles, book titles, project names, phone numbers, etc.). Nominal, noun, and pronoun references.

The morphological context trains the model to generate entity probabilities in the given morphological context. The part-of-speech context training model generates entity probabilities in a given part-of-speech context. Given an entity lemma string as an entity probability. Given an entity part-of-speech string as an entity probability.

Part of speech, name, movement, shape, number, quantity, generation, adverb, introduction, connection, help, sigh, onomatopoeia. Natural language processing parts of speech, distinguishing words, location words, idioms, idioms, institutional groups, time words, as many as 100 kinds. The biggest difficulty in Chinese part-of-speech tagging is "concurrent", a word has different parts of speech in different contexts, and it is difficult to identify it formally.

part-of-speech tagging process. Labeling, part-of-speech labeling according to rules or statistical methods. Calibration, consistency checking and automatic proofreading method corrections.

Statistical model part-of-speech tagging method. A large number of labeled corpora have been trained, and the appropriate mathematical model for training is selected. Probabilistic Map Hidden Markov Model (HMM) is suitable for part-of-speech tagging based on observation sequence tagging.

Hidden Markov Model parameter initialization. Model parameters are initialized, and the initial values ​​are set with the minimum cost and the goal of the closest optimal solution before using the corpus. HMM, based on the conditional probability generative model, the model parameters generate probability, assuming that the generation probability of each word is the reciprocal of the number of all possible parts of speech, the calculation is the simplest and most likely to be close to the optimal solution generation probability. All possible parts of speech for each word, there are vocabulary tags, the vocabulary generation method is simple, and the corpus has been marked, which is very good for statistics. The initial value of generation probability is set to 0.

Regular part-of-speech tagging method. Given the contextual rules of the collocation relationship, determine the actual context and mark the part of speech according to the rules. It is suitable for existing rules, and it has a good effect on recognition of concurrent parts of speech. Machine learning automatically extracts rules, the gap between the initial labeler's labeling results and the manual labeling results, generates corrected labeling conversion rules, and error-driven learning methods. After manual calibration, a large amount of useful information is summarized to supplement and adjust the rule base.

Statistical method, rule method combined with part-of-speech tagging method. Rule disambiguation, statistical labeling, and final proofreading to get the correct labeling result. Statistical methods are preferred for labeling, and the confidence or error rate is calculated at the same time to determine whether the results are suspicious. In suspicious cases, rule methods are used to resolve ambiguity to achieve the best results.

Part-of-speech tagging verification. Check the correctness and correct the result. Check part-of-speech tagging consistency. Consistency, all annotation results, the same context and the same word are labeled the same. Concurrent words are marked with different parts of speech. Non-conjunct words, manual verification or other reasons lead to different parts of speech. There are many words and many parts of speech, and the consistency index cannot be calculated by formula. Based on clustering and classification methods, the consistency index is defined according to the Euclidean distance, and the threshold is set to ensure that the consistency is within the threshold range. Automatic proofreading of part-of-speech tags. No need for human participation, directly find out the error tagging correction, apply one part-of-speech tagging for the whole text to be wrong, and the judgment of data mining and rule learning methods is relatively accurate. Large-scale training corpus generates part-of-speech proofreading decision tables, and automatically corrects all wrong part-of-speech tags throughout the text.

Syntax parse tree generation. Organize a sentence into a tree according to syntactic logic.

Syntactic analysis is divided into syntactic structure analysis and dependency analysis. Syntactic structure analysis is a phrase structure analysis, which extracts sentence noun phrases, verb phrases, etc. It can be divided into rule-based analysis method and statistical analysis method. There are many limitations of the rule-based approach. Based on statistical methods, based on Probabilistic Context Free Grammar (PCFG), terminal set, non-terminal set, rule set.

First show simple examples, feel the calculation process, and then describe the theory.

A set of terminators, indicating which words can be used as leaf nodes of the parser tree. A set of non-terminal symbols, representing tree non-page child nodes, connecting multiple nodes to express relationship nodes, and syntactic rule symbols. The rule set, the syntactic rule symbol, and the model training probability value must be equal to the left part of the probability and must be 1.

There may be many kinds of syntactic structure trees for a sentence, and only the best structure of the sentence is selected with the highest probability. Let W={ω1ω2ω3…} denote a sentence, where ω denotes a word, and use the dynamic programming algorithm to calculate the non-terminal symbol A to derive the probability of the substring ωiωi+1ωi+2…ωj in W, assuming that the probability is αij(A), recursive formula, αij(A)=P(A->ωi), αij(A)=∑∑P(A->BC)αik(B)α(k+1)j(C).

A syntactic rule extraction method and probabilistic parameter estimation of PCFG. Lots of treebanks, training data. The syntactic rules in the tree database are extracted to generate the structure form, and the processing of merging and induction is carried out to obtain a set of terminal symbols ∑, a set of non-terminal symbols, and a set of rules R. The probability parameter calculation method is given a random initial value of the parameter, using the EM iteration algorithm, continuously training the data, calculating the number of times each rule is used as the maximum likelihood calculation to obtain the probability estimate, and continuously updating the probability iteratively, and finally obtaining the probability that the maximum likelihood is met. Estimate the exact value.

References:

"Python Natural Language Processing"

http://www.shareditor.com/blogshow?blogId=82

http://www.shareditor.com/blogshow?blogId=86

http://www.shareditor.com/blogshow?blogId=87

Welcome to recommend machine learning job opportunities in Shanghai, my WeChat: qingxingfengzi

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325292282&siteId=291194637