NLP basics: enumeration and Viterbi building word segmentation

1. Task introduction

mission details

  • Use enumeration to achieve word segmentation, that is, first list all possible word segmentation results, and then use the Unigram model to select the best word segmentation structure (the difficulty in this part is how to generate all possible word segmentation results) (through Perplexity) Judge what is best)
  • Use Viterbi algorithm to achieve word segmentation. This part first needs to create a directed graph, and then calculate the best word segmentation result according to the Viterbi algorithm. The creation of the directed graph and the Viterbi part in this part requires some thinking

data set

Comprehensive Chinese thesaurus.xlsx: Contains Chinese words and the frequency of occurrence of words, used as a dictionary, and can also be used to calculate the probability of each word in the dictionary;
Demo

2. Principle Introduction

Generally speaking, there are two main categories of word segmentation methods: maximum matching and semantic consideration.

Maximum match

Maximum matching is divided into forward maximum matching, backward maximum matching, and two-way maximum matching. All three of them need to use an existing dictionary to match the input , but the difference is that the order is different. To forward the largest match as an example:
Example: To word sentence, "We often have differences of opinion", the existing dictionary is: [We often have, have any comments, opinions, disagreements]

  1. First set the maximum matching length, max_len = 5;
  2. Take max_len characters from the front to the back of the sentence to be segmented, the first time: "we often have", and then in this window from back to front to reduce character by character, stop matching until you find a matching word in the dictionary, and then Get'we often'->'our sutra'->'we', so the participle gets'we';
  3. Starting from the next character of the matched word, continue to take max_len:'Always have opinions', and then narrow down character by character from back to front in this window, and finally match to'frequently', so the final word segmentation result is: [' We','often','have opinions']

Because the word segmentation method of maximum matching does not consider semantics, it will not be implemented here.

Consider semantics

Enumeration

The basic framework to consider semantics is to enumerate all possible segmentation situations for the input sentence according to the existing dictionary, and then judge the probability of each segmentation sentence through the language model, and finally output the segmentation result corresponding to the highest probability.
Then there are two issues to consider here:
Q1 . How to generate all possible segmentation situations;
Q2 . How to obtain the probabilities of various segmentation sentences;
for Q1, all possible segmentation situations can be generated in a recursive manner, here you can Refer to the two topics of LeetCode139 and LeetCode140 to generate all word segmentation matches based on the string. For detailed explanation, please see Huahua Video.

LeetCode 139 Word Split I

Given a character string and a dictionary, judge whether the character string can find a kind of division, so that the divided words all exist in the dictionary.
Word Split I

class Solution:
	def wordBreak(self, s: str, wordDict: List[str]) -> bool:
		def canBreak(s, m, wordDict):
			‘’‘
			递归终止条件:
			1. 如果memory字典keys中有这个s,那么直接返回这个key代表的状态值;
			2. 如果字符串s在给出的字典wordDict中,那么这个字符串就可以分为空字符串和其本身,可以返回true
			’‘’
			if s in m: return m[s]
			if s in wordDict:
				m[s] = True
				return True
			# 开始尝试每一个分割点
			for i in range(1, len(s)):
				# 将字符串分为左右两部分,r为右半部分
				r = s[i:]
				# 如果右半部分在字典中,切左半部分有解的话,则认为有解;
				if r in wordDict and canBreak(s[0:i], m, wordDict):
					m[s] = True
					return True
			# 否则无解
			m[s] = False
			return False
		return canBreak(s, {
    
    }, set(wordDict))

LeetCode 140 Word Split Ⅱ

To split the string, each word in the sentence is required to be in the dictionary, and all possible situations are output.
Word split Ⅱ

class Solution:
  def wordBreak(self, s, wordDict):
  	# 存储为set,用来判断单词是否存在于字典中
  	words = set(wordDict)
  	# 定义哈希表存储,key为字符串的str,value为字符串的所有解
  	mem = {
    
    }
  	def wordBreak(s):
  		# 递归出口:如果字符串s在字典memory中,就返回该字符串的所有解
  		if s in mem: return mem[s]
  		# 因为mem是解的并集,因此需要当作全局变量使用,中间不能进行改变
  		# ans用来存储当前字符串s的解
  		ans = []
  		# 如果字符串s在字典中,即s可以被分为空字符串和其自身,判断为可以分
  		if s in words: ans.append(s)
  		# 开始尝试每一个分割点
  		for i in range(1, len(s)):
  			right = s[i:]
  			# 如果右半部分不在字典里,跳过继续遍历
  			if right not in words: continue
  			ans += [w + ' ' + right for w in wordBreak(s[0:i])]
  		mem[s] = ans
  		return mem[s]
  	return wordBreak(s)

For Q2, the probability of each word segmentation sentence can be obtained through the language model method:
Unigram model method:

  1. First, the probability of each word segmentation is obtained by counting the frequency of the single word and the total number of words in the corpus;
  2. Then multiply the probability of each word segmentation to get the probability of the overall sentence, and then compare the probability of the sentence;
  3. In order to prevent the occurrence of Underflow, the obtained probability is added to log, and multiplied into addition, that is, the probability of the sentence is compared by Perplexity.
  4. Optimization: Because continuous multiplication or continuous addition may result in a more refined segmentation effect, the average operation can be added to divide by the number of words segmented in the entire sentence.

Viterbi

The disadvantage of the above enumeration method is: when the sentence to be segmented is very long, then there are many possibilities for permutation and combination to generate all possible segmentation, and the operation of generating all the possibilities is generated recursively, so you need to consider whether it can be The probability of each word segmentation is calculated in advance, and then the problem of which word segmentation method has the largest probability is transformed into a problem of finding the shortest path from the first word to the last word.

Then this problem can also be divided into two steps:
Q1. How to calculate the probability of each word segmentation in advance, and use the minimization of the target result instead of maximization;
the same as the enumeration method, by counting the frequency of the given word bank, Count the probability of each word. According to the dictionary, the input sentence and word_prob are used to create a weighted directed graph. The edges with no value are either removed or given a very large value.

Q2. How to find the shortest path from a node to the last node :
Use the Viterbi algorithm to find the best PATH, which is the best sentence.
At the beginning of the moving rule, fill in the negative probability from left to right, and record the pointer with direction. After recording, derive the shortest path according to the opposite direction of the pointer.

3. Realization

1. Build Chinese word segmentation tool based on enumeration method

The first step: read all Chinese words from xlsx, and calculate the frequency of each word to calculate -log value

# 读取词典
print("Reading dic...")
# 获取一个Book对象
workbook = xlrd.open_workbook("./data/综合类中文词库.xlsx")
dic_words = []  # 保存词典库中读取的单词
dic_freq = []
# 获取第一个sheet对象的列表
booksheet = workbook.sheet_by_index(0)
rows = booksheet.get_rows()
for row in rows:
    dic_words.append(row[0].value)
    dic_freq.append(int(row[2].value.strip()))
dic_freq = np.array(dic_freq)
dic_freq = list(dic_freq/np.sum(dic_freq))
# 统计每一个单词出现的概率
word_prob = dict(zip(dic_words, dic_freq))
# 计算-log(x)
for word in word_prob.keys():
    word_prob[word]= round(-math.log(word_prob[word]),2)
print("len:" + str(len(dic_words)))

Insert picture description hereInsert picture description here
Step 2: Calculate all possible word segmentation results, and ensure that each segmented word exists in the dictionary:

def wordBreak(s, wordDict):
	words = set(wordDict)
	mem = {
    
    }
	def wordBreak(s):
    	if s in mem: return mem[s]
    		ans = []
    	if s in words: ans.append(s)
    	for i in range(1, len(s)):
    		right = s[i:]
        	if right not in words: continue        
        	ans += [w + " " + right for w in wordBreak(s[0:i])]
      	mem[s] = ans
      	return mem[s]
    return wordBreak(s)

Step 3: Write the word_segment_naive function to achieve word segmentation of the input string

#  分数(10)
## TODO 请编写word_segment_naive函数来实现对输入字符串的分词
def word_segment_naive(input_str):
    """
    1. 对于输入字符串做分词,并返回所有可行的分词之后的结果。
    2. 针对于每一个返回结果,计算句子的概率
    3. 返回概率最高的最作为最后结果
    
    input_str: 输入字符串   输入格式:“今天天气好”
    best_segment: 最好的分词结果  输出格式:["今天","天气","好"]
    """
    # TODO: 第一步: 计算所有可能的分词结果,要保证每个分完的词存在于词典里 
	res = wordBreak(input_str, dic_words)
    segments = [i.split(' ') for i in res]  # 存储所有分词的结果。如果次字符串不可能被完全切分,则返回空列表(list)
    # 格式为:segments = [["今天",“天气”,“好”],["今天",“天“,”气”,“好”],["今“,”天",“天气”,“好”],...]
    # TODO: 第二步:循环所有的分词结果,并计算出概率最高的分词结果,并返回
    best_segment = []
    best_score = math.inf  # 因为是-log,故将best_score设为无穷大
    for seg in segments:
        score = 0
        for word in seg:
            if word in word_prob.keys():
                score += word_prob[word]    
            else:
                # 如果没有的话就记为0.00001
                score += round(-math.log(0.00001),2)
        if score < best_score:
            best_score = score
            best_segment = seg
    return best_segment

The test results are as follows:
Insert picture description here

2. Optimized based on Viterbi algorithm

Write the word_segment_viterbi function to realize the word segmentation of the input string:

def word_segment_viterbi(input_str):
    """
    1. 基于输入字符串,词典,以及给定的unigram概率来创建DAG(有向图)。
    2. 编写维特比算法来寻找最优的PATH
    3. 返回分词结果
    
    input_str: 输入字符串   输入格式:“今天天气好”
    best_segment: 最好的分词结果  输出格式:["今天","天气","好"]
    """
    #TODO: 第一步:根据词典,输入的句子,以及给定的unigram概率来创建带权重的有向图(Directed Graph)
    #有向图的每一条边是一个单词的概率(只要存在于词典里的都可以作为一个合法的单词),这些概率在 word_prob,如果不在word_prob里的单词但在
    #词典里存在的,统一用概率值0.00001。
    #注意:思考用什么方式来存储这种有向图比较合适? 不一定有只有一种方式来存储这种结构。 
    graph ={
    
    }
    N = len(input_str)
    for i in range(N,0,-1):
        k=i-1
        in_list=[]
        flag=input_str[k:i]
        while k>=0 and flag in dic_words:
            in_list.append(k)
            k-=1
            flag = input_str[k:i]
        graph[i]=in_list
    
    #TODO: 第二步: 利用维特比算法来找出最好的PATH, 这个PATH是P(sentence)最大或者 -log P(sentence)最小的PATH。
	mem=[0]* (N+1)
    last_index=[0]*(N+1)
    for i in range(1,N+1):
        min_dis=math.inf
        for j in graph[i]:
            if input_str[j:i] in word_prob.keys():
                #      有向图的每一条边是一个单词的概率(只要存在于词典里的都可以作为一个合法的单词),这些概率在 word_prob,如果不在word_prob里的单词但在
                #      词典里存在的,统一用概率值0.00001。
                if min_dis > mem[j]+round(word_prob[input_str[j:i]],1):
                    min_dis=mem[j]+round(word_prob[input_str[j:i]],1)
                    last_index[i]=j
            else:
                if min_dis > mem[j]+round(-math.log(0.00001),1):
                    min_dis=mem[j]+round(-math.log(0.00001),1)
                    last_index[i]=j
        mem[i]=min_dis
    
    # TODO: 第三步: 根据最好的PATH, 返回最好的切分
	best_segment=[]
    j=N
    while True:
        best_segment.append(input_str[last_index[j]:j])
        j=last_index[j]
        if j==0 and last_index[j]==0:
            break
    best_segment.reverse()
    return best_segment    

The time complexity of the enumeration method is n square log n, and the time complexity of the Viterbi method is n square.

Guess you like

Origin blog.csdn.net/qq_29027865/article/details/104567149