Do-it-yourself NLP for bot developers

我相信在大多数情况下，聊天机器人的开发者构建自己的自然语言解析器，而不是使用第三方云端API，是有意义的选择。这样做有很好的战略性和技术性方面的依据，我将向你展示自己实现NLP有多么简单。这篇文章包含3个部分：

为什么要自己做
最简单的实现也很有效
你可以真正用起来的东西

那么要实现一个典型的机器人，你需要什么样的NLP技术栈？假设您正在构建一项服务来帮助人们找到餐馆。你的用户可能会这样说：

I’m looking for a cheap Mexican place in the centre.

为了回答这个问题，你需要做两件事：

了解用户的意图（intent）：他们正在寻找一家餐厅，而不是说“你好”，“再见”或“感谢”。
提取cheap ， Mexican和center作为你的查询字段。

在之前的文章中，我提到像wit和LUIS这样的工具使得意图分类（Intent Classification）和实体提取（Entity Extraction）变得非常简单，以至于在参加黑客马拉松期间你就可以快速构建一个聊天机器人。我是这些云端服务以及背后团队的忠实粉丝，但是并不是说它们适用于任何场景。

1.使用NLP库而不是云端API的三个理由

首先，如果你真的想建立一个基于对话软件的业务，那么把用户告诉你的所有东西都传给Facebook或者微软可能不是一个好策略。其次，我不相信Web API是开发中每一个问题的解决方案。 https调用速度很慢，并且始终受到API设计的限制。其次，本地库是可以深入探索的（hackable）。第三，在自己的数据和用例上，你有机会实现更好的性能。请记住，通用API必须在每个问题上都做得很好，而你只需要做好你的工作。

2. 词向量 +启发式 - 复杂性=工作代码

首先，我们将不使用任何库（numpy除外）来构建一个最简单的模型，以便了解它是如何工作的。

我坚信，在机器学习中你唯一能做的就是找到一个好的表示（presentation）。如果这一点不是很明了，我现在正在写另一篇文章对此进行解释，所以请稍后再回来看看。重点是，如果你有一个有效的方式来表示你的数据，那么即使是非常简单的算法也能完成这项工作。

我们将使用词向量（word vector），它是包含几十或几百个浮点数的向量，可以在某种程度上捕捉单词的含义。事实上，完全可以做到这一点，而这些模型的训练方式都是非常有趣的。就我们的目的而言，这意味着我们已经完成了艰苦的工作：像word2vec或GloVe这样的词嵌入（word embedding）是表示文本数据的有力方式。我决定使用GloVe进行这些实验。你可以从他们的仓库中下载训练好的模型，我使用了最小维数（50维）的预训练模型。

下面的代码基于GloVe仓库中的python示例。实现的就是把整个词表载入内存：

class Embedding(object):
    def __init__(self,vocab_file,vectors_file):
        with open(vocab_file, 'r') as f:
            words = [x.rstrip().split(' ')[0] for x in f.readlines()]

        with open(vectors_file, 'r') as f:
            vectors = {}
            for line in f:
                vals = line.rstrip().split(' ')
                vectors[vals[0]] = [float(x) for x in vals[1:]]

        vocab_size = len(words)
        vocab = {w: idx for idx, w in enumerate(words)}
        ivocab = {idx: w for idx, w in enumerate(words)}

        vector_dim = len(vectors[ivocab[0]])
        W = np.zeros((vocab_size, vector_dim))
        for word, v in vectors.items():
            if word == '<unk>':
                continue
            W[vocab[word], :] = v

        # normalize each word vector to unit variance
        W_norm = np.zeros(W.shape)
        d = (np.sum(W ** 2, 1) ** (0.5))
        W_norm = (W.T / d).T

        self.W = W_norm
        self.vocab = vocab
self.ivocab = ivocab

现在让我们尝试使用这些词向量来完成第一项任务：在句子I’m looking for a cheap Mexican place in the centre.中提取Mexican作为菜系名。我们将尽可能使用最简单的方法：在句子中寻找与给出的菜系样例最相似的单词。我们将遍历句子中的单词，并挑选出与参考单词的平均余弦相似度高于某个阈值的单词：

def find_similar_words(embed,text,refs,thresh):

    C = np.zeros((len(refs),embed.W.shape[1]))

    for idx, term in enumerate(refs):
        if term in embed.vocab:
            C[idx,:] = embed.W[embed.vocab[term], :]


    tokens = text.split(' ')
    scores = [0.] * len(tokens)
    found=[]

    for idx, term in enumerate(tokens):
        if term in embed.vocab:
            vec = embed.W[embed.vocab[term], :]
            cosines = np.dot(C,vec.T)
            score = np.mean(cosines)
            scores[idx] = score
            if (score > thresh):
                found.append(term)
    print scores

return found

让我们试一下例句。

vocab_file ="/path/to/vocab_file"
vectors_file ="/path/to/vectors_file"

embed = Embedding(vocab_file,vectors_file)

cuisine_refs = ["mexican","chinese","french","british","american"]
threshold = 0.2

text = "I want to find an indian restaurant"

cuisines = find_similar_words(embed,cuisine_refs,text,threshold)
print(cuisines)
# >>> ['indian']

令人惊讶的是，上面的代码足以正确地泛化，并根据其与参考词的相似性来挑选Indian作为菜系类型。因此，这就是为什么我说，一旦有了好的表示，问题就变得简单了。

现在来分类用户的意图。我们希望能够把句子分成“打招呼”，“感谢”，“请求餐馆”，“指定位置”，“拒绝建议”等类别，以便我们可以告诉机器人的后端运行哪些代码。我们可以通过很多方法通过组合词向量来建立句子的表示，不过再一次，我们决定采用最简单的方法：把词向量加起来。我知道也许你对这一方法的意义与作用有所质疑，附录中解释了这么做的原因。

我们可以为每个句子创建这些词袋（bag-of-words）向量，并再次使用简单的距离对它们进行分类。再一次，令人惊讶的是，它已经可以泛化处理之前从未见过的句子了：

import numpy as np

def sum_vecs(embed,text):

    tokens = text.split(' ')
    vec = np.zeros(embed.W.shape[1])

    for idx, term in enumerate(tokens):
        if term in embed.vocab:
            vec = vec + embed.W[embed.vocab[term], :]
    return vec


def get_centroid(embed,examples):

    C = np.zeros((len(examples),embed.W.shape[1]))
    for idx, text in enumerate(examples):
        C[idx,:] = sum_vecs(embed,text)

    centroid = np.mean(C,axis=0)
    assert centroid.shape[0] == embed.W.shape[1]
    return centroid


def get_intent(embed,text):
    intents = ['deny', 'inform', 'greet']
    vec = sum_vecs(embed,text)
    scores = np.array([ np.linalg.norm(vec-data[label]["centroid"]) for label in intents ])
    return intents[np.argmin(scores)]


embed = Embedding('/path/to/vocab','/path/to/vectors')


data={
  "greet": {
    "examples" : ["hello","hey there","howdy","hello","hi","hey","hey ho"],
    "centroid" : None
  },
  "inform": {
    "examples" : [
      "i'd like something asian",
      "maybe korean",
      "what mexican options do i have",
      "what italian options do i have",
      "i want korean food",
      "i want german food",
      "i want vegetarian food",
      "i would like chinese food",
      "i would like indian food",
      "what japanese options do i have",
      "korean please",
      "what about indian",
      "i want some vegan food",
      "maybe thai",
      "i'd like something vegetarian",
      "show me french restaurants",
      "show me a cool malaysian spot"
    ],
    "centroid" : None
  },
  "deny": {
    "examples" : [
      "nah",
      "any other places ?",
      "anything else",
      "no thanks"
      "not that one",
      "i do not like that place",
      "something else please",
      "no please show other options"
    ],
    "centroid" : None
  }
}


for label in data.keys():
    data[label]["centroid"] = get_centroid(embed,data[label]["examples"])


for text in ["hey you","i am looking for chinese food","not for me"]:
    print "text : '{0}', predicted_label : '{1}'".format(text,get_intent(embed,text))

# output
# >>>text : 'hey you', predicted_label : 'greet'
# >>>text : 'i am looking for chinese food', predicted_label : 'inform'
# >>>text : 'not for me', predicted_label : 'deny'

我所展示的解析和分类方法都不是特别鲁棒，所以我们将继续探索更好的方向。但是，我希望我已经证明，没有什么神秘的，实际上很简单的方法已经可以工作了。

3.你可以实际使用的东西

有很多事情我们可以做得更好。例如，将文本转换为token而不是仅仅基于空白字符进行拆分。一种方法是使用SpaCy /textacy的组合来清理和解析文本，并使用scikit-learn来构建模型。在这里，我将使用MITIE （MIT信息抽取库）的Python接口来完成我们的任务。

有两个类我们可以直接使用。首先，一个文本分类器（Text Classifier）：

import sys, os
from mitie import *

trainer = text_categorizer_trainer("/path/to/total_word_feature_extractor.dat")

data = {} # same as before  - omitted for brevity

for label in training_examples.keys():
  for text in training_examples[label]["examples"]:
    tokens = tokenize(text)
    trainer.add_labeled_text(tokens,label)

trainer.num_threads = 4
cat = trainer.train()

cat.save_to_disk("my_text_categorizer.dat")

# we can then use the categorizer to predict on new text
tokens = tokenize("somewhere that serves chinese food")
predicted_label, _ = cat(tokens)

其次，一个实体识别器（Entity Recognizer）：

import sys, os
from mitie import *
sample = ner_training_instance(["I", "am", "looking", "for", "some", "cheap", "Mexican", "food", "."])

sample.add_entity(xrange(5,6), "pricerange")
sample.add_entity(xrange(6,7), "cuisine")

# And we add another training example
sample2 = ner_training_instance(["show", "me", "indian", "restaurants", "in", "the", "centre", "."])
sample2.add_entity(xrange(2,3), "cuisine")
sample2.add_entity(xrange(6,7), "area")


trainer = ner_trainer("/path/to/total_word_feature_extractor.dat")

trainer.add(sample)
trainer.add(sample2)

trainer.num_threads = 4

ner = trainer.train()

ner.save_to_disk("new_ner_model.dat")


# Now let's make up a test sentence and ask the ner object to find the entities.
tokens = ["I", "want", "expensive", "korean", "food"]
entities = ner.extract_entities(tokens)


print "\nEntities found:", entities
print "\nNumber of entities detected:", len(entities)
for e in entities:
    range = e[0]
    tag = e[1]
    entity_text = " ".join(tokens[i] for i in range)
    print "    " + tag + ": " + entity_text

# output 
# >>> Number of entities detected: 2
# >>>     pricerange: expensive
# >>>     cuisine: korean

MITIE库非常复杂，使用多种词嵌入而不单是GloVe。文本分类器是一个简单的SVM，而实体识别器使用结构化SVM。如果您有兴趣，在github仓库中有相关文献的链接。

正如你所期望的那样，使用像这样的库（或者SpaCy加上你最喜欢的ML库）比起我在开始时发布的实验代码提供了更好的性能。事实上，根据我的经验，你可以很快地超越wit或LUIS的表现，因为你可以根据自己数据集进行相应的参数调整。

结论

我希望我已经说服你，在构建聊天机器人时创建自己的NLP模块是值得的。请在下面添加你的想法、建议和问题。我期待着讨论。如果你喜欢这个文章，可以在这里赞一下，或者在twitter上，那会更好。

感谢 Alex ，Kate， Norman 和 Joey 的阅读草稿！

附录：稀疏恢复（`sparse recovery`）

你怎么可能把一个句子中的单词向量加起来（或平均），就可以作为句子的表示？这就好像告诉你，一个班上的10个学生，在测试中平均得分为75％，你却试图找出每个人的成绩。好吧，差不多。事实证明，这是关于高维几何的那些违反直觉的事情之一。

如果从一组1000个样本中抽取10个向量，那么只要知道平均值，就可以真正地找出你选择哪个向量，如果向量具有足够高的维数（比如说300）。这归结于R³⁰⁰中有很多空间的事实，所以如果随机抽样一对向量，你可以期望它们（几乎）是线性独立的。

我们对单词向量的长度不感兴趣（它们在上面的代码中被归一化了），所以我们可以把这些向量看作单位球面上的点。假设我们有一组N个向量V⊂ℝ^d，它们是单位d球体上的独立同分布（iid）。问题是，给定V的一个子集S，如果只知道x = Σ v _i，我们需要多大的D才能恢复所有的S？只有当x和v（v ∉ S）之间的点积足够小，并且S中的向量有v · x〜1 时，我们才能够恢复原始的数据。

我们可以使用一个叫做度量集中（concentration of measure）的结果，它告诉我们我们需要什么。对于单位d球上的iid点，任意两点之间点积的期望值E（ v · w ）= 1 /√d。而点积大于a的概率是P（ v · w > a）≤（1-a²）^（d / 2）。所以我们可以写出概率ε，即就空间的维度而言，某个向量v ∉S太靠近向量v ∈S。。这给出了减少ε失败概率的结果，我们需要d> S log（NS /ε）。因此，如果我们想要从总共1000个具有1％容错的10个向量中恢复一个子集，我们可以在138个维度或更高的维度上完成。

回到测试分数的比喻，按这个思路进行思考的话可能会使事情变得更清楚。现在我们得到的是每个问题的平均分数，而不是平均总分。现在，你可以从10位学生那里获得平均每个问题的分数，我们所说的是当测试中包含更多问题时，分辨哪些学生变得更容易。毕竟这不是很违反直觉。

感谢 Alexander Weidauer 。

原文：Do-it-yourself NLP for bot developers