统计分词（unigram）

1 词库

词库中最长字符串为 $m$ ，输入字符串长度为 $n$ ，一般 $n \gg m$

import xlrd

def create_dic_words(file_path, sheet_index=0):
    workbook = xlrd.open_workbook(filename=file_path)
    worksheet  = workbook.sheet_by_index(sheet_index)
    
    dic_words = {}
    max_len_word = 0
    for idx in range(worksheet.nrows):
        word = worksheet.row(idx)[0].value.strip()
        dic_words[word] = 0.00001
        
        len_word = len(word)
        if len_word > max_len_word:
            max_len_word = len_word
            
    return dic_words, max_len_word

# TODO: 第一步： 从综合类中文词库.xlsx 中读取所有中文词。
#  hint: 思考一下用什么数据结构来存储这个词典会比较好？ 要考虑我们每次查询一个单词的效率。 
dic_path = "./综合类中文词库.xlsx"
dic_words, max_len_word = create_dic_words(file_path=dic_path)    # 保存词典库中读取的单词

# 以下是每一个单词出现的概率。为了问题的简化，我们只列出了一小部分单词的概率。
# 在这里没有出现的的单词但是出现在词典里的，统一把概率设置成为0.00001
# 比如 p("学院")=p("概率")=...0.00001

word_prob = {
    "北京": 0.03, "的": 0.08, "天": 0.005, "气": 0.005, "天气": 0.06, "真":0.04, "好": 0.05,
    "真好": 0.04, "啊": 0.01, "真好啊": 0.02, "今": 0.01, "今天": 0.07, "课程": 0.06, "内容": 0.06,
    "有": 0.05, "很": 0.03, "很有": 0.04, "意思": 0.06, "有意思": 0.005, "课": 0.01, "程": 0.005,
    "经常": 0.08, "意见": 0.08, "意": 0.01, "见": 0.005, "有意见": 0.02, "分歧": 0.04, "分": 0.02, "歧": 0.005
}

print (sum(word_prob.values()))
print(max_len_word)

for key, value in word_prob.items():
    dic_words[key] = value

1.0000000000000002
16

2 枚举方法

此项目需要的数据：

综合类中文词库.xlsx：包含了中文词，当做词典来用
以变量的方式提供了部分unigram概率 word_prob

举个例子：给定词典=[我们学习人工智能人工智能未来是]，另外我们给定unigram概率：p(我们)=0.25, p(学习)=0.15, p(人工)=0.05, p(智能)=0.1, p(人工智能)=0.2, p(未来)=0.1, p(是)=0.15

2.1 对于给定字符串：”我们学习人工智能，人工智能是未来“, 找出所有可能的分割方式

[我们，学习，人工智能，人工智能，是，未来]
[我们，学习，人工，智能，人工智能，是，未来]
[我们，学习，人工，智能，人工，智能，是，未来]
[我们，学习，人工智能，人工，智能，是，未来]
…

2.2 计算各分词概率，返回概率最大的分词

p(我们，学习，人工智能，人工智能，是，未来)= -log p(我们)-log p(学习)-log p(人工智能)-log p(人工智能)-log p(是)-log p(未来)
p(我们，学习，人工，智能，人工智能，是，未来)=-log p(我们)-log p(学习)-log p(人工)-log p(智能)-log p(人工智能)-log p(是)-log p(未来)
p(我们，学习，人工，智能，人工，智能，是，未来)=-log p(我们)-log p(学习)-log p(人工)-log p(智能)-log p(人工)-log p(智能)-log p(是)-log p(未来)
p(我们，学习，人工智能，人工，智能，是，未来)=-log p(我们)-log p(学习)-log p(人工智能)-log p(人工)-log p(智能)-log(是)-log p(未来)
…

import numpy as np

def segment_recur(input_str):
    """
    segment the input sentence recursively in forward way
    """
    len_input_str = len(input_str)
    
    segments = []
    # if the input_str is empty, return []
    if len_input_str == 0:
        return segments
    

    # maximum length of a possible word
    max_split = min(len_input_str, max_len_word) + 1
    for idx in range(1, max_split):
        word = input_str[0 : idx]
        
        # whether the word is contained in the dictionary
        # if it is true, segment the left sub string recursively
        if word in dic_words:
            segments_substr = segment_recur(input_str[idx :])
            # whether the sentance has been segmented completely
            # if so, add the word into the segments list
            # else, return []
            if (len(segments_substr) == 0) & (len(input_str[idx :]) == 0):
                segments.append([word])
            else:
                for seg in segments_substr:
                    seg = [word] + seg
                    segments.append(seg)
                
    return segments

print(segment_recur("今天天气好f"))
print(segment_recur("今天天气好"))

[]
[['今', '天', '天', '气', '好'], ['今', '天', '天气', '好'], ['今', '天天', '气', '好'], ['今天', '天', '气', '好'], ['今天', '天气', '好']]

EPSILON = 1e-10

## TODO 请编写word_segment_naive函数来实现对输入字符串的分词
def word_segment_naive(input_str):
    """
    1. 对于输入字符串做分词，并返回所有可行的分词之后的结果。
    2. 针对于每一个返回结果，计算句子的概率
    3. 返回概率最高的最作为最后结果
    
    input_str: 输入字符串   输入格式：“今天天气好”
    best_segment: 最好的分词结果  输出格式：["今天"，"天气"，"好"]
    """

    len_input_str = len(input_str)
    
    # TODO： 第一步： 计算所有可能的分词结果，要保证每个分完的词存在于词典里，这个结果有可能会非常多。 
    segments = []  # 存储所有分词的结果。如果次字符串不可能被完全切分，则返回空列表(list)
                   # 格式为：segments = [["今天"，“天气”，“好”],["今天"，“天“，”气”，“好”],["今“，”天"，“天气”，“好”],...]
    segments = segment_recur(input_str)
    
    # TODO: 第二步：循环所有的分词结果，并计算出概率最高的分词结果，并返回
    best_score = np.inf
    best_segment = []
    for seg in segments:
        # TODO ...
        log_prob = -1 * np.sum(np.log([dic_words[word] + EPSILON for word in seg]))
        
        # best score: minimum log probability
        if log_prob < best_score:
            best_score = log_prob
            best_segment = seg
            
    return best_segment

print(word_segment_naive("今天天气好"))

['今天', '天气', '好']

# 测试
print(word_segment_naive("北京的天气真好啊"))
print(word_segment_naive("今天的课程内容很有意思"))
print(word_segment_naive("经常有意见分歧"))

['北京', '的', '天气', '真好啊']
['今天', '的', '课程', '内容', '很有', '意思']
['经常', '有意见', '分歧']

2.3 复杂度分析

（1）时间复杂度：

$T(n) = \sum_{i = 1}^{m} (T(n - i) + 1) = \sum_{i = 1}^{m} T(n - i) + m = \mathcal{O} (m^{n})$

（2）空间复杂度：

$S(n) = \mathcal{O} (n)$

3 维特比算法

此项目需要的数据：

综合类中文词库.xlsx：包含了中文词，当做词典来用
以变量的方式提供了部分unigram概率word_prob

3.1 根据词典，输入的句子和 word_prob来创建带权重的有向无环图（directed acyclic graph）

有向图的每一条边是一个单词的概率（只要存在于词典里的都可以作为一个合法的单词），这些概率已经给出（存放在word_prob）。
注意：思考用什么方式来存储这种有向图比较合适？不一定只有一种方式来存储这种结构。

import pandas as pd

def create_graph(input_str):
    """
    create a directed acyclic graph (dag) for the input_str
    each edge represents the probability from char i to char j
    
    the graph is structed as a tree
    """
    
    chars = list(input_str)
    num_chars = len(chars)
    
    # valid words
    word_list = []
    # valid edges, each of which represents a word split with [start, end]
    # e.g. split points [0, 1, 2, 3, 4, 5] for the string "今天天气好"
    #      the word "今" is represented as [0, 1],
    #      the word "今天" is represented as [0, 2],
    #      the word "天气" is represented as [2, 4],
    #      the word "好" is represented as [4, 5],
    edge_list = []
    for idx_start in range(num_chars):
        max_split = min(len(input_str[idx_start :]), max_len_word) + 1
        for idx_end in range(1, max_split):
            idx_end += idx_start
            word = input_str[idx_start : idx_end]
            if word in dic_words:
                word_list.append(word)
                edge_list.append([idx_start, idx_end])
                
    # create the dag matrix
    # row - start point, indices: 0, 1, 2, ..., n - 1
    # column - destination, indices: 1, 2, 3, ..., n
    # each element represents a possible edge with the corresponding (- log_prob)
    graph = pd.DataFrame(
        data=np.zeros(shape=(num_chars, num_chars)),
        index=list(range(num_chars)),
        columns=list(range(1, num_chars + 1))
    )
    for idx, edge in enumerate(edge_list):
        word = word_list[idx]
        graph.loc[edge[0], edge[1]] = dic_words[word]
    graph = -1 * np.log(graph + EPSILON)
    
    return graph

graph = create_graph("今天天气好")
print(graph)

           1          2          3          4          5
0   4.605170   2.659260  23.025851  23.025851  23.025851
1  23.025851   5.298317  11.512915  23.025851  23.025851
2  23.025851  23.025851   5.298317   2.813411  23.025851
3  23.025851  23.025851  23.025851   5.298317  23.025851
4  23.025851  23.025851  23.025851  23.025851   2.995732

3.2 维特比算法（viterbi）算法查找最短路径

## TODO 请编写word_segment_viterbi函数来实现对输入字符串的分词
def word_segment_viterbi(input_str):
    """
    1. 基于输入字符串，词典，以及给定的unigram概率来创建DAG(有向图）。
    2. 编写维特比算法来寻找最优的PATH
    3. 返回分词结果
    
    input_str: 输入字符串   输入格式：“今天天气好”
    best_segment: 最好的分词结果  输出格式：["今天"，"天气"，"好"]
    """
    
    # TODO: 第一步：根据词典，输入的句子，以及给定的unigram概率来创建带权重的有向图（Directed Graph） 参考：课程内容
    #      有向图的每一条边是一个单词的概率（只要存在于词典里的都可以作为一个合法的单词），
    #      这些概率在 word_prob，如果不在word_prob里的单词但在词典里存在的，统一用概率值0.00001。
    #      注意：思考用什么方式来存储这种有向图比较合适？ 不一定有只有一种方式来存储这种结构。 
    graph = create_graph(input_str)
    
    # TODO： 第二步： 利用维特比算法来找出最好的PATH， 这个PATH是P(sentence)最大或者 -log P(sentence)最小的PATH。
    #              hint: 思考为什么不用相乘: p(w1)p(w2)...而是使用negative log sum:  -log(w1)-log(w2)-...
    num_chars = len(input_str)
    
    # the minimum distace between 0 to the split point j
    optimal_path_distances = pd.Series(
        data=np.zeros(shape=(num_chars + 1, )),
        index=list(range(num_chars + 1))
    )
    # optimal path to each split point
    optimal_path = pd.Series(
        data=np.zeros(shape=(num_chars + 1, ), dtype=np.int),
        index=list(range(num_chars + 1))
    )
    for idx_end in range(1, num_chars + 1):
        possible_path_distances = np.zeros(shape=(idx_end, ))
        for idx_start in range(idx_end):
            possible_path_distances[idx_start] = \
                optimal_path_distances[idx_start] + graph.loc[idx_start, idx_end]
        optimal_path_distances[idx_end] = np.min(possible_path_distances)
        optimal_path[idx_end] = np.argmin(possible_path_distances)
    
    # TODO: 第三步： 根据最好的PATH, 返回最好的切分
    best_segment = []
    
    idx = num_chars
    while idx > 0:
        best_segment.append(input_str[optimal_path.loc[idx]: idx])
        idx = optimal_path.loc[idx]
    best_segment.reverse()
    
    return best_segment

best_segment = word_segment_viterbi("今天天气好")

# 测试
print(word_segment_viterbi("北京的天气真好啊"))
print(word_segment_viterbi("今天的课程内容很有意思"))
print(word_segment_viterbi("经常有意见分歧"))

['北京', '的', '天气', '真好啊']
['今天', '的', '课程', '内容', '很有', '意思']
['经常', '有意见', '分歧']

3.3 复杂度分析

（1）时间复杂度：

Graph： $\mathcal{O} (n^{2})$ ；Viterbi： $\mathcal{O} (m n)$ ；backtrace： $\mathcal{O} (n)$ ，因此时间复杂度为 $\mathcal{O} (n^{2})$

（2）空间复杂度：

Graph： $\mathcal{O} (n^{2})$ ；Viterbi： $\mathcal{O} (n)$ ；backtrace： $\mathcal{O} (1)$ ，因此空间复杂度为 $\mathcal{O} (n^{2})$

K5niper

发布了103 篇原创文章 · 获赞 162 · 访问量 5万+

私信关注

维特比最短路径问题 - 统计分词（unigram）