8 HMM and CRF

In recent years, natural language processing, HMM (Hidden Markov Model) and CRF (Conditional Random Fields) algorithm often used word, parsing, named entity recognition, speech tagging and so on. Because there is a lot of common ground between the two, so in many applications tend to overlap, but in the field named entity, syntactic analysis CRF seem even better. In general, if you do natural language processing, both models should have to understand, let's take a look at the contents of this article.

Understood and defined formula discriminant model Bayesian model

Before understanding the HMM (Hidden Markov Model) and CRF (Conditional Random Field) model, let's look at two concepts: generative model and discriminant model.

In machine learning, generative model and discriminant models for mission-supervised learning, supervised learning is to learn from a model data (also called a classifier), the application of this model to predict a given input X corresponding output Y. The general form of this model are: the decision function Y = f (X) or the conditional probability distribution P (Y | X).

First, a brief talk from Bayes' theorem, if the note P (A), P (B) denote the probability of events A and B occur, then P (A | B) represents the case of the occurrence of the event B event A occurs probability; P (AB) represents the probability of event a and event B occurring simultaneously.

According to Bayes' formula can be drawn:

enter image description here

Model formula: estimating the joint probability distribution, P (Y, X) = P (Y | X) * P (X), the joint probability density distribution P (X, Y), and then obtaining the conditional probability distribution P ( Y | X) as a prediction model, i.e., generate the model formula is: P (Y | X) = P (X, Y) / P (X). The basic idea is to first establish a sample of the joint probability density model P (X, Y), then after another to get the posterior probability P (Y | X), and then use it to classify its main concern is given input X to produce output Y generation relations.

Discriminant model: estimated conditional probability distribution, P (Y | X), is conditional model observed variables X and Y target variable given. Learning data directly from the decision function Y = f (X) or the conditional probability distribution P (Y | X) as a model prediction, the main concern is for a given input X, the output should be predicted what Y.

Therefore, HMM generated using implicit variables observable state, which generates statistical probability have been marked collection, it is a generative model. Other common generative model has: Gaussian, Naive Bayes, Mixtures of multinomials and so on.

The CRF is like a reverse of the hidden Markov model (HMM), discrimination by observable state implicit variables, the probability is also marked by a set of statistics come, it is a discriminant model. Other common discriminant model has: K nearest neighbor, machine perception, decision trees, logistic regression models, maximum entropy models, support vector machines, upgrade methods.

HMM (Hidden Markov Model) and CRF (Conditional Random Field) theory part, recommended the book to see the watermelon Zhou Zhihua teacher's "machine learning."

Hands combat: HMM-based training their Python Chinese word breaker

Model Introduction

HMM model is a collection by a "quintuple" consisting of:

  • StatusSet: status value set, the status value set to (B, M, E, S), where B is the word the first word, M being an intermediate word word, E is the words of the last word, S is a single word, B , M, E, S represents the position of each state of the word in the word.

    For example, for "artificial intelligence development of China's climax stage" word can be labeled as: "in country B E S-human B station E Chi B can E issued B Show E into B into E high B influx E stage B section E ", the last word is the result: [ 'China', 'a', 'artificial', 'intelligent', 'development', 'enter', 'climax', 'stage'].

  • ObservedSet: a collection of observations, is a collection of observations corpus of all characters, even composed a set of punctuation.

  • TransProbMatrix: transition probability matrix, the meaning of the state transition probability matrix is ​​X transition from state probability state Y, is a matrix of 4 × 4, i.e., {B, E, M, S} × {B, E, M, S }.

  • EmitProbMatrix: emission probability matrix, each element of the emission probability is a conditional probability matrix, representing the P (Observed [i] | Status [j]) probability.

  • InitStatus: the initial state distribution, the initial state probability distribution represents probabilities of the four states {B, E, M, S} belongs to the first word of a sentence.

The HMM used in word, to solve the problem is: the case of parameters (ObservedSet, TransProbMatrix, EmitRobMatrix, InitStatus) known to solve the state value sequence.

The most famous way to solve this problem is the Viterbi algorithm.

Corpus ready

The training is expected to use  my crawling short text processing generated. 264M entire corpus size, comprising 1,116,903 data, UTF-8 encoding, between words separated by a space, used to train the word models. syj_trainCorpus_utf8.txt

Corpus has been uploaded to the CSDN resources, please click Download: Chinese natural language processing, Chinese word training corpus  .

Corpus format, separated by a space:

If you continue to allow the spread of ideological trend of bourgeois liberalization,

The party will lose its cohesion and combat effectiveness,

How to become the core of leadership people across the country?

China will become a mess,

What hope is there?

Coding

(1) a predefined

First drawn libraries, the role of these two libraries are used to model saved:

    import pickle
    import json

Next, define the state in the HMM, the probability is initialized, and Chinese word pause:

    STATES = {'B', 'M', 'E', 'S'}
    EPS = 0.0001
    #定义停顿标点 seg_stop_words = {" ",",","。","“","”",'“', "?", "!", ":", "《", "》", "、", ";", "·", "‘ ", "’", "──", ",", ".", "?", "!", "`", "~", "@", "#", "$", "%", "^", "&", "*", "(", ")", "-", "_", "+", "=", "[", "]", "{", "}", '"', "'", "<", ">", "\\", "|" "\r", "\n","\t"} 

(2) encapsulated in the object-oriented class

First, the HMM model packaged as separate classes , are given below to the structure of the class definition: HMM_Model

    class HMM_Model:
        def __init__(self): pass #初始化 def setup(self): pass #模型保存 def save(self, filename, code): pass #模型加载 def load(self, filename, code): pass #模型训练 def do_train(self, observes, states): pass #HMM计算 def get_prob(self): pass #模型预测 def do_predict(self, sequence): pass 

The first method  is a special method, constructor, or class initialization method is called when an instance of the class the method is called created, which define the data structures and initial variables, to achieve the following: __init__()

    def __init__(self): self.trans_mat = {} self.emit_mat = {} self.init_vec = {} self.state_count = {} self.states = {} self.inited = False 

Wherein the data structure definition:

  • trans_mat: State transition matrix, trans_mat[state1][state2] represents the number of state2 training focus shifted from state1 to.

  • emit_mat: Measurement matrix, emit_mat[state][char] represent the training set number of times the word char is marked as a state.

  • init_vec: The initial state distribution vector, init_vec[state] represents the state in the number of state training set appears.

  • state_count: State statistics vector state_count[state]indicates the number of state state arise.

  • word_set: Word set, containing all the words.

The second method Setup (), a method of initializing a data structure of the first specific implementation is as follows:

        #初始化数据结构    
        def setup(self): for state in self.states: # build trans_mat self.trans_mat[state] = {} for target in self.states: self.trans_mat[state][target] = 0.0 self.emit_mat[state] = {} self.init_vec[state] = 0 self.state_count[state] = 0 self.inited = True 

The third method Save (), is used to save the trained model, filename specified model name, the model name is the default hmm.json, there is provided a storage type two formats, JSON or pickle format, determined by the parameter code, code values  or , by default , the specific implementation is as follows: code='json' code = 'pickle' code='json'

    #模型保存   
    def save(self, filename="hmm.json", code='json'): fw = open(filename, 'w', encoding='utf-8') data = { "trans_mat": self.trans_mat, "emit_mat": self.emit_mat, "init_vec": self.init_vec, "state_count": self.state_count } if code == "json": txt = json.dumps(data) txt = txt.encode('utf-8').decode('unicode-escape') fw.write(txt) elif code == "pickle": pickle.dump(data, fw) fw.close() 

The fourth method Load (), () method corresponds to the third save, to load models, filename specified model name, the model name is the default hmm.json, provide two types of storage format, JSON format or pickle here, determined by the parameter code, code values  or , by default , the specific implementation is as follows: code='json'code = 'pickle' code='json'

    #模型加载
    def load(self, filename="hmm.json", code="json"): fr = open(filename, 'r', encoding='utf-8') if code == "json": txt = fr.read() model = json.loads(txt) elif code == "pickle": model = pickle.load(fr) self.trans_mat = model["trans_mat"] self.emit_mat = model["emit_mat"] self.init_vec = model["init_vec"] self.state_count = model["state_count"] self.inited = True fr.close() 

The fifth method , used to train the model, because tagging data set used, it is possible to use a simpler supervised learning algorithm, training observation sequence and function of the input sequence of states for training, in order to update each matrix data. Class model parameters are maintained in frequency instead of frequency, this design makes the model can be used online training, so that the model can always accept new training data continue to train, do not lose the results of previous training. Specific achieve the following: do_train()

    #模型训练
    def do_train(self, observes, states): if not self.inited: self.setup() for i in range(len(states)): if i == 0: self.init_vec[states[0]] += 1 self.state_count[states[0]] += 1 else: self.trans_mat[states[i - 1]][states[i]] += 1 self.state_count[states[i]] += 1 if observes[i] not in self.emit_mat[states[i]]: self.emit_mat[states[i]][observes[i]] = 1 else: self.emit_mat[states[i]][observes[i]] += 1 

The sixth method , before making a prediction, the frequency needs to be converted into a frequency data structure, specifically implemented as follows: get_prob()

    #频数转频率
    def get_prob(self):
        init_vec = {}
        trans_mat = {}
        emit_mat = {}
        default = max(self.state_count.values())  

        for key in self.init_vec: if self.state_count[key] != 0: init_vec[key] = float(self.init_vec[key]) / self.state_count[key] else: init_vec[key] = float(self.init_vec[key]) / default for key1 in self.trans_mat: trans_mat[key1] = {} for key2 in self.trans_mat[key1]: if self.state_count[key1] != 0: trans_mat[key1][key2] = float(self.trans_mat[key1][key2]) / self.state_count[key1] else: trans_mat[key1][key2] = float(self.trans_mat[key1][key2]) / default for key1 in self.emit_mat: emit_mat[key1] = {} for key2 in self.emit_mat[key1]: if self.state_count[key1] != 0: emit_mat[key1][key2] = float(self.emit_mat[key1][key2]) / self.state_count[key1] else: emit_mat[key1][key2] = float(self.emit_mat[key1][key2]) / default return init_vec, trans_mat, emit_mat 

A seventh method , using the Viterbi algorithm to obtain the optimal prediction path, specifically implemented as follows: do_predict()

    #模型预测
    def do_predict(self, sequence): tab = [{}] path = {} init_vec, trans_mat, emit_mat = self.get_prob() # 初始化 for state in self.states: tab[0][state] = init_vec[state] * emit_mat[state].get(sequence[0], EPS) path[state] = [state] # 创建动态搜索表 for t in range(1, len(sequence)): tab.append({}) new_path = {} for state1 in self.states: items = [] for state2 in self.states: if tab[t - 1][state2] == 0: continue prob = tab[t - 1][state2] * trans_mat[state2].get(state1, EPS) * emit_mat[state1].get(sequence[t], EPS) items.append((prob, state2)) best = max(items) tab[t][state1] = best[0] new_path[state1] = path[best[1]] + [state1] path = new_path # 搜索最有路径 prob, state = max([(tab[len(sequence) - 1][state], state) for state in self.states]) return path[state] 

The above class that implements the  seven methods, then we have to realize tokenizer, where the first definition of the two functions, these two functions are independent, not class. HMM_Model

(1) define a utility function

Training corpus of input tagging each word, since the training data is separated by spaces, may be transited labeling, the labeling method used in the training data, specifically implemented as follows:

    def get_tags(src):
        tags = []
        if len(src) == 1: tags = ['S'] elif len(src) == 2: tags = ['B', 'E'] else: m_num = len(src) - 2 tags.append('B') tags.extend(['M'] * m_num) tags.append('E') return tags 

(2) define a utility function

The predicted denoted sentence input sequence is divided into the list of words, i.e. predicted state sequence, resolve to return a list of list specific implementation is as follows:

    def cut_sent(src, tags):
        word_list = []
        start = -1
        started = False

        if len(tags) != len(src): return None if tags[-1] not in {'S', 'E'}: if tags[-2] in {'S', 'E'}: tags[-1] = 'S' else: tags[-1] = 'E' for i in range(len(tags)): if tags[i] == 'S': if started: started = False word_list.append(src[start:i]) word_list.append(src[i]) elif tags[i] == 'B': if started: word_list.append(src[start:i]) start = i started = True elif tags[i] == 'E': started = False word = src[start:i+1] word_list.append(word) elif tags[i] == 'M': continue return word_list 

Finally, we define the word class HMMSoyoger, inherited  class and implements Chinese word breaker training, word function, structure definition HMMSoyoger given first class: HMM_Model

    class HMMSoyoger(HMM_Model):
        def __init__(self, *args, **kwargs): pass #加载训练数据 def read_txt(self, filename): pass #模型训练函数 def train(self): pass #模型分词预测 def lcut(self, sentence): pass 

The first method init (), constructor initializes the variable is defined, specifically implemented as follows:

    def __init__(self, *args, **kwargs): super(HMMSoyoger, self).__init__(*args, **kwargs) self.states = STATES self.data = None 

The second method , loading training corpus, the read file is txt, and UTF-8 encoding, Chinese prevent distortion, specifically implemented as follows: read_txt()

    #加载语料
    def read_txt(self, filename): self.data = open(filename, 'r', encoding="utf-8") 

Train third method (), based on the words and generating a sequence of observed state sequences, and through the parent class do_train() training methods, specifically implemented as follows:

    def train(self):
            if not self.inited: self.setup() for line in self.data: line = line.strip() if not line: continue #观测序列 observes = [] for i in range(len(line)): if line[i] == " ": continue observes.append(line[i]) #状态序列 words = line.split(" ") states = [] for word in words: if word in seg_stop_words: continue states.extend(get_tags(word)) #开始训练 if(len(observes) >= len(states)): self.do_train(observes, states) else: pass 

The fourth method lcut (), after a good model training, carried out by the word test method, specifically to achieve the following:

    def lcut(self, sentence):
            try: tags = self.do_predict(sentence) return cut_sent(sentence, tags) except: return sentence 

Through the above two classes and two methods to complete the Chinese word HMM-based encoder, let the model training and testing.

Trainer

First class is instantiated HMMSoyoger, and then  load the corpus method, and then by train () online training, if the training corpus is relatively large, it may take a little time, specifically to achieve the following: read_txt()

    soyoger = HMMSoyoger()
    soyoger.read_txt("syj_trainCorpus_utf8.txt")
    soyoger.train()

Model test

After completion of the training model, we can test:

    soyoger.lcut("中国的人工智能发展进入高潮阶段。")

The results obtained are:

[ 'China', 'a', 'artificial', 'intelligent', 'development', 'enter', 'climax' 'stage' '. ']

soyoger.lcut("中文自然语言处理是人工智能技术的一个重要分支。")

The results obtained are:

[ 'Chinese', 'natural', 'language', 'processing', 'human', 'workers wisdom', 'energy technology', 'surgery', 'a', 'important', 'branch', ' . ']

Visible, the end result is good, if you want to get better results, you can prepare yourself larger and more abundant training data set.

CRF practice based open-source Chinese word segmentation tool Genius

Genius is a CRF-based open-source Chinese word segmentation tools, and training to make use of Wapiti sequence tagging, support for Python 2.x, Python 3.x.

installation

(1) Download the source code

In Github  on Download Source address, extract the source, and then through the installation.  python setup.py install

(2) Pypi Installation

By executing the command: easy_install genius or  installation. pip install genius

Participle

First introduced Genius, then text text word.

import genius
text = u"""中文自然语言处理是人工智能技术的一个重要分支。"""
seg_list = genius.seg_text(
    text,
    use_combine=True,
    use_pinyin_segment=True,
    use_tagging=True, use_break=True ) print(' '.join([word.text for word in seg_list]) 

Which genius.seg_text function accepts five parameters, which is a required text parameters:

  • The first parameter is the required text word words.
  • use_break On behalf of the word processing structure interrupted, the default value True.
  • use_combine Whether to use dictionary words on behalf of the merger, the default value is False.
  • use_tagging Whether on behalf of speech tagging, the default value is True.
  • use_pinyin_segment Whether representatives of the pinyin word processing, the default value True.

to sum up

Firstly, by Bayes' theorem, understand the difference between a discriminant model and a generative model, followed by hands-on combat - HMM-based training their own Python Chinese word breaker, and model validation. Finally, based on the CRF open-source Chinese word segmentation tool.

Guess you like

Origin www.cnblogs.com/chen8023miss/p/11977224.html
HMM