Hidden Markov (HMM) / Perceptron / Conditional Random Field (CRF) ---- speech tagging

Notes reproduced in GitHub project : https://github.com/NLP-LOVE/Introduction-NLP

7. speech tagging

7.1 Overview of speech tagging

  1. What part of speech

    In linguistics, speech (Par-Of-Speech, Pos ) refers to the grammatical classification of words, also called parts of speech. Word of the same category have similar grammatical properties, it is referred to as the set of all speech speech tagging set. Different corpus using different sets of POS tagging, typically consists of a common part of speech adjectives, verbs, nouns. The figure is a sentence containing structured HanLP speech output.

    我/r 的/u 希望/n 是/v 希望/v 张晚霞/nr 的/u 背影/n 被/p 晚霞/n 映/v 红/a

    Behind each word with part of speech is the label:

    Speech tag Part of Speech
    r pronoun
    in verb
    n noun
    v verb
    No. Person's name
    p preposition
    a adjective
  2. Usefulness of speech

    The role of speech is to provide an abstract representation of words, the number of words are endless, but the number of parts of speech is limited. POS supports many advanced applications, when encountered OOV downstream applications, you can guess by the use of parts of speech OOV, such as the above sentence "Lin sunset" would be treated as a person's name recognition, but will not open.

    Speech may be directly used to extract some of the information, such as extraction of all adjectives, etc. describing particular goods.

  3. Speech tagging

    Speech tagging refers to the part of speech predicting a label for each word in the sentence task. It has the following two difficulties:

    • Chinese part of speech a word more common phenomenon, but in the specific context must be the only part of speech.

    • OOV any problem in natural language processing tasks.

  4. POS tagging model

    Statistical methods provide a solution for these two difficulties that we are familiar with the sequence labeling model . Chinese word simply replaced with the characters in the word, {B, M, E, S} replaced by "nouns, verbs, adjectives, etc.", denoted by the sequence immediately model can be used for speech tagging.

    Speech tagging can be seen as either the successor task Chinese word can also be integrated with the Chinese word for the same task. In which we can put together with part of speech corpora label on it, so multiple tasks at the same time the model is called the joint model . Because considering a variety of supervisory signal, a joint model on almost all the issues better than independent models.

    However, the industry was not so good, but also has word segmentation and POS tagging corpus is very small, requires a lot of manpower to mark.

7.2 Corpus speech tagging and labeling set

Like with the Chinese word, linguists disagree on the label specifications, resulting in standard Chinese POS division present there is no widely accepted. Whether it is divided particles of speech, or speech labels are not unified. On the one hand, research institutions disagree, factionalism, marked with a large number of incompatible corpus. On the other hand, part of the corpus under strict copyright control, become interior materials are not fully shared use.

This section authorized to select some loose, Corpus readily available as a case.

The following example we select the tag set "People's Daily" corpus PKU marked.

7.3 Based on Hidden Markov Model speech tagging

Before we introduced the hidden Markov model, details see: 4. Hidden Markov model and serial label

Hidden Markov Model speech tagging codes see ( the program will automatically download PKU corpus ): hmm_pos.py

https://github.com/NLP-LOVE/Introduction-NLP/tree/master/code/ch07/hmm_pos.py

After the operation code is as follows:

一阶隐马尔可夫模型:
r, u, n, v, v, v
他/r 的/u 希望/n 是/v 希望/v 上学/v
他/代词 的/助词 希望/名词 是/动词 希望/动词 上学/动词
李狗蛋/动词 的/动词 希望/动词 是/动词 希望/动词 上学/动词

二阶隐马尔可夫模型:
r, u, n, v, v, v
他/r 的/u 希望/n 是/v 希望/v 上学/v
他/代词 的/助词 希望/名词 是/动词 希望/动词 上学/动词
李狗蛋/动词 的/动词 希望/动词 是/动词 希望/动词 上学/动词

Visible hidden Markov model is successful in identifying the "Hope" speech two n and v. But OOV problem arises, you can not put "Lee naughty" adult name recognition, hidden Markov models round loser step wrong, the fundamental reason lies in the hidden Markov model can only use this word feature a state, can not be by last name "Lee" to speculate "Lee dog egg" is the name.

7.4 Based on Perceptron's speech marked

Before we introduced Perceptron model, details see: 5. Perceptron classification and labeling sequence

According to the experience when the Chinese word, Perceptron can take advantage of the rich features of the context, it is better than hidden Markov model of choice for POS tagging as well.

Perceptron Model speech tagging codes see ( the program will automatically download PKU corpus ): perceptron_pos.py

https://github.com/NLP-LOVE/Introduction-NLP/tree/master/code/ch07/perceptron_pos.py

Some will run slower, results are as follows:

李狗蛋/nr 的/u 希望/n 是/v 希望/v 上学/v
李狗蛋/人名 的/助词 希望/名词 是/动词 希望/动词 上学/动词

The operating results entirely correct, successful Perceptron recognize OOV "Lee dog egg" speech.

7.5 based on conditions with part of speech tagging airport

Before we introduced conditional random, details see: 6. Conditional Random Fields marked with serial

Conditional random speech tagging codes see ( the program will automatically download PKU corpus ): crf_pos.py

https://github.com/NLP-LOVE/Introduction-NLP/tree/master/code/ch07/crf_pos.py

Run time will be longer, the following results:

李狗蛋/nr 的/u 希望/n 是/v 希望/v 上学/v
李狗蛋/人名 的/助词 希望/名词 是/动词 希望/动词 上学/动词

Still can successfully identify OOV "Lee dog egg" speech.

7.6 POS tagging Reviews

The PKU corpus of 9: 1 partitioning the training and test sets, were used to train these three models, the accuracy of the following:

algorithm Accuracy
First-order hidden Markov model 44.99%
Second - order Hidden Markov Model 40.53%
Structured Perceptron 83.07%
Conditional Random Fields 82.12%

Seen from the chart, the structure and the conditions of the airport perceptron better than hidden Markov models, using discriminant model can be trained more features to improve the accuracy more.

7.7 Custom Speech

In engineering, many users will want some specific words to tag custom, called custom speech . For example, a user field of electronic business in the hope that some mobile phone brand marked with the appropriate label for subsequent analysis. HanLP provides custom-speech function. There are two specific implementation.

  1. Simple realization

    Can be achieved using HanLP Mount ways:

    from pyhanlp import *
    
    CustomDictionary.insert("苹果", "手机品牌 1")
    CustomDictionary.insert("iPhone X", "手机型号 1")
    analyzer = PerceptronLexicalAnalyzer()
    analyzer.enableCustomDictionaryForcing(True)
    print(analyzer.analyze("你们苹果iPhone X保修吗?"))
    print(analyzer.analyze("多吃苹果有益健康"))

    Of course, here by way of insert custom code words, it may be in the actual project dictionary file manner, the following operating results:

    你们/r 苹果/手机品牌 iPhone X/手机型号 保修/v 吗/y ?/w
    多/ad 吃/v 苹果/手机品牌 有益健康/i

    From the results, the dictionary just to match the machine, the "eating apple" but also as a mobile phone brand, made a common problem of all the rules of the system, it seems the same can not solve the dictionary speech tagging, POS tagging or should give statistical methods.

  2. Annotated data

    Parts of speech to determine the need to Context, which happens to be the statistical model is good at. In order to implement custom speech, best practice is to mark a corpus, then training a statistical model.

    As for the size of the corpus, like all machine learning problem, the more data, the registration model.

7.8 GitHub

HanLP Ho Han - "Natural Language Processing Getting Started" notes:

https://github.com/NLP-LOVE/Introduction-NLP

Project continuously updated in ......

table of Contents


chapter
Chapter 1: First Steps
Chapter 2: dictionary word
Chapter 3: bigram and Chinese word
Chapter 4: Hidden Markov model and serial label
Chapter 5: Perceptron classification and labeling sequence
Chapter 6: Conditional Random Fields marked with serial
Chapter 7: POS tagging
Chapter 8: named entity recognition
Chapter 9: information extraction
Chapter 10: Text Clustering
Chapter 11: Text Categorization
Chapter 12: dependency parsing
Chapter 13: deep learning and natural language processing

Guess you like

Origin www.cnblogs.com/mantch/p/12294619.html