Use machine learning to build bot (f) the principles of articles

This article is to use machine learning to build bot sixth series of articles, introduces the principle of correlation algorithm used in the code. Understand the principles of algorithm that allows us to know that behind the program have done, why sometimes occur which algorithm to select the next error, and what the scene would be more appropriate.

  • word2vec
    word vector model we use is based word2vec training, word2vec is Google launched in 2013 a NLP tool, it is characterized by all of the words to quantify, it is possible to measure them quantitatively between this word and the word of the relationship between mining links between words.

    Based on the idea of distributed characterized word2vec (Dristributed Representation) is compared to the word One hot can be represented by a lower dimension vector. There is an interesting study shows that the word vector representation of our word, we can find: King - Man + Woman = Queen .
    word2vec realized CBOW and Skip-Gram two neural network models, CBOW model SkyAAE in the training word vector is used.

    CBOW (Continuous Bag-of-Words , continuous bag of words) is the core idea around a word can be used to predict the word the word. The training input context model is one word feature vectors associated word corresponding to the word, the word and the output is a vector that particular word. For example, this passage: "... an efficient method for [learning] high quality distributed vector ...", we want to predict learning, you can select context size is 4, which is the word we need vector output, corresponding to a context there are eight words, before and after 4, 8 this word is entered into our model. Because CBOW using a bag of words model, so these eight words are equal, that is, without regard to the size of the distance between them and the words of our attention, as long as we are within the context of. And CBOW Correspondingly Skip-Gram model, the core idea is to predict the word around the word with the current word.
    As can be seen, when word2vec model training good, vector characterizing word also determined, when used again later, enter a word, give you the model to determine the vector. So when we have a sentence, we need to word, then the word vectors are taken out, and then some of the ways these vectors together to represent the entire sentence, such as the points accumulated divided by the length of the word list.
  • Cosine similarity
    algorithm that we use in semantic matching stage is the cosine similarity. Cosine similarity refers to a measure of similarity between them by measuring the angle between two vectors cosine. We know that the cosine of 0 degrees is 1, and any other cosine of the angle is not greater than 1; and its minimum value is -1, but is typically used for the positive cosine similarity space, thus giving a value of 0 to between 1. When the two vectors have the same orientation, the cosine similarity is 1; angle between two vectors is 90 °, the cosine similarity is 0. Therefore, the cosine of the angle between two vectors may be determined two whether the vector pointing substantially in the same direction. Specific formula is as follows:
  • 朴素贝叶斯
    我们在意图分类阶段使用了多项式朴素贝叶斯算法来将输入的问题分到对应的意图类别下,让我们先来看看什么式朴素贝叶斯。朴素贝叶斯算法是基于贝叶斯定理与特征条件独立假设的分类 方法。 贝叶斯公式推导过程:

    c
    随机事件的其中一种情况,比如电影领域问答中的意图分类可能包括:闲聊,评分,上映时间,演员等,把用户问问题看成是随机事件,则用户问评分的问题就是随机事件的其中一种情况。
    x
    泛指与随机事件相关的因素,这里做为概率的条件。
    P(c|x)
    条件 x 下,c 出现的概率。比如 P(“评分”|“功夫这部电影评分怎么样?”)就是表示问题“功夫这部电影评分怎么样?”的意图是“评分”的概率。
    P(x|c)
    知出现 c 情况的条件下,条件 x 出现的概率,后验概率,可以根据历史数据计算得出。
    P(c)
    不考虑相关因素,c 出现的概率。
    P(x)
    不考虑相关因素,x 出现的概率。
    由推导过程可以得到
    P(c|x) = P(c)P(x|c)/P(x)
    假设我们有电影领域问题和所属意图分类的数据集,那么P(c(i))=c(i)出现的次数/所有情况出现的总次数,(例如:c(i)可能是‘评分’意图或者‘上映时间’意图);
    根据特征条件独立假设的朴素思想可以得出如下式子:
    p(x|c) = Πp(xi|c) (1<=i<=d),d 为属性的个数
    至此得到朴素贝叶斯的具体公式:(这里的 c 就是 c(i))

    利用该公式进行分类的思想就是计算所有的 p(c(i)|x),然后取值(概率)最大的 c(i)做为所属分类。用公式表达如下:

    h 是基于朴素贝叶斯算法训练出来的 hypothesis(假设),它的值就是贝叶斯分类器对于给定的 x 因素下,最可能出现的情况c。y 是 c 的取值集合。这里去掉了 P(x)是因为它和 c 的概率没有关系,不影响取最大的 c。
    朴素贝叶斯直观上理解,就是和样本属性以及样本类别的出现频率有关,利用已有的样本属性和样本类别计算出的各个概率,来代入新的样本的算式中算出属于各类别的概率,取出概率最大的做为新样本的类别。
    所以为了计算准确,要满足如下几个条件:
    • 各类别下的训练样本数量尽可能均衡
    • 各训练样本的属性取值要覆盖所有可能的属性的值
    • 引入拉普拉斯修正进行平滑处理。
  • 多项式朴素贝叶斯
    再选择朴素贝叶斯分类的时候,我们使用了one-hot的思想来构建句向量,其中的值都是0或1的离散型特征,所以使用多项式模型来计算 p(xi|c)会更合适(对于连续性的值,选用高斯模型更合适):

    Dc 表示训练集 D 中第 c 类样本组成的集合,外加两条竖线 表示集合的元素数量;
    Dc,xi 表示 Dc 中第 i 个特征上取值为 xi 的样本组成的集 合。
    为避免出现某一维特征的值 xi 没在训练样本中与 c 类别同时出 现过,导致后验概率为 0 的情况,会做一些平滑处理:

    K表示总的类别数;
    Ni表示第 i 个特征可能的取值的数量。
  • 莱文斯坦距离
    chatterbot的默认语义匹配算法采用的就是莱文斯坦距离,该算法又称Levenshtein距离,是编辑距离的一种。指两个字串之间,由一个转成另一个所需的最少编辑操作次数。允许的编辑操作包括将一个字符替换成另一个字符,插入一个字符,删除一个字符。例如将kitten转成sitting:
    kitten(k→s)sitten (e→i)sittin  (→g)sitting
    该算法的逻辑清晰简洁,但做为聊天机器人的语义匹配算法还是太简单了,所以我们并没有选择使用,具体原因在《手把手教你打造聊天机器人(三) 设计篇》中已经详细介绍,这里不再赘述。

本文是"手把手教你打造聊天机器人"系列的最后一篇,介绍了我们打造的聊天机器人的相关算法原理,下一篇会对本系列做一个总结。

ok,本篇就这么多内容啦~,感谢阅读O(∩_∩)O。

Guess you like

Origin www.cnblogs.com/anai/p/12074358.html