Operations on word vectors（词向量处理）

因为词嵌入在训练上成本很高，所以大多数ML从业者将加载预训练的嵌入集（pre-trained set of embeddings）。

完成此作业后，您将能够：

加载预先训练的单词向量，并使用余弦相似度测量相似度
使用单词嵌入来解决单词类比问题（word analogy），例如Man is to Woman as the King is to ______。
修改单词嵌入以减少其性别偏见（ender bias）

让我们开始吧！运行以下单元格以加载所需的包。

import numpy as np
from w2v_utils import *

Next, lets load the word vectors. For this assignment, we will use 50-dimensional GloVe vectors to represent words. Run the following cell to load the word_to_vec_map.

接下来，让我们加载单词向量。对于这个任务（assignment），我们将使用50维GloVe向量来表示单词。运行以下单元格以加载word_to_vec_map。

words, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')

你加载了：

words: 词汇表中的单词集。
word_to_vec_map: 字典映射到他们的GloVe向量表示。

你已经看到一个one-hot向量并没有很好地捕捉相似的单词。 GloVe向量提供了有关单个单词含义的更多有用信息。现在让我们看看如何使用GloVe向量来确定两个单词的相似程度。

1 - Cosine similarity（余弦相似度）

为了测量两个单词的相似程度，我们需要一种方法来测量两个单词的两个嵌入向量之间的相似度。给定两个向量 $u$ 和 $v$ ，余弦相似度定义如下：

$\text{CosineSimilarity(u, v)} = \frac {u . v} {||u||_2 ||v||_2} = cos(\theta) \tag{1}$
在这里插入图片描述

Figure 1: 两个向量之间角度的余弦是它们有多相似的度量

练习: 实现函数cosine_similarity()来评估单词向量之间的相似性。

扫描二维码关注公众号，回复： 5721209 查看本文章

提醒: $u$ 的范数被定义为 $||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}$

# GRADED FUNCTION: cosine_similarity

def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similariy between u and v
        
    Arguments:
        u -- a word vector of shape (n,)          
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """
    
    distance = 0.0
    
    ### START CODE HERE ###
    # Compute the dot product between u and v (≈1 line)
    dot = np.dot(u, v)
    # Compute the L2 norm of u (≈1 line)
    norm_u = np.linalg.norm(u)
    # Compute the L2 norm of v (≈1 line)
    norm_v = np.linalg.norm(v)
    
    # Compute the cosine similarity defined by formula (1) (≈1 line)
    cosine_similarity = dot / (norm_u * norm_v)
    ### END CODE HERE ###
    
    return cosine_similarity

在获得正确的预期输出后，请随意修改输入并测量其他字对之间的余弦相似度！利用其他输入的余弦相似性可以更好地了解单词向量的行为方式。

2 - Word analogy task（单词类比任务）

在类比任务（analogy task）这个词中，我们完成了句子 “a is to b as c is to ____”. 一个例子就是 ‘man is to woman as king is to queen’ . 详细地说（In detail），我们试图找到一个单词d，使得相关的单词向量 $e_a, e_b, e_c, e_d$ 以下列方式相关： $e_b - e_a \approx e_d - e_c$ . 我们将使用余弦相似度衡量 $e_b - e_a$ 和 $e_d - e_c$ 之间的相似性

练习: 完成下面的代码，以便能够执行单词类比！

# GRADED FUNCTION: complete_analogy

def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
    """
    Performs the word analogy task as explained above: a is to b as c is to ____. 
    
    Arguments:
    word_a -- a word, string
    word_b -- a word, string
    word_c -- a word, string
    word_to_vec_map -- dictionary that maps words to their corresponding vectors. 
    
    Returns:
    best_word --  the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity
    """
    
    # convert words to lower case
    word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
    
    ### START CODE HERE ###
    # Get the word embeddings v_a, v_b and v_c (≈1-3 lines)
    e_a, e_b, e_c = word_to_vec_map[word_a], word_to_vec_map[word_b], word_to_vec_map[word_c]
    ### END CODE HERE ###
    
    words = word_to_vec_map.keys()
    max_cosine_sim = -100              # Initialize max_cosine_sim to a large negative number
    best_word = None                   # Initialize best_word with None, it will help keep track of the word to output

    # loop over the whole word vector set
    for w in words:        
        # to avoid best_word being one of the input words, pass on them.
        if w in [word_a, word_b, word_c] :
            continue
        
        ### START CODE HERE ###
        # Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c)  (≈1 line)
        cosine_sim = cosine_similarity(e_b - e_a, word_to_vec_map[w] - e_c)
        
        # If the cosine_sim is more than the max_cosine_sim seen so far,
            # then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word (≈3 lines)
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            best_word = w
        ### END CODE HERE ###
        
    return best_word

运行下面的单元格来测试您的代码，这可能需要1-2分钟。

triads_to_try = [('italy', 'italian', 'spain'), ('india', 'delhi', 'japan'), ('man', 'woman', 'boy'), ('small', 'smaller', 'large')]
for triad in triads_to_try:
    print ('{} -> {} :: {} -> {}'.format( *triad, complete_analogy(*triad,word_to_vec_map)))

#####################################
italy -> italian :: spain -> spanish
india -> delhi :: japan -> tokyo
man -> woman :: boy -> girl
small -> smaller :: large -> larger

一旦获得正确的预期输出，请随意修改上面的输入单元格以测试您自己的类比。尝试找到其他一些有用的类比对（analogies pairs），但也找到一些算法没有给出正确答案的地方：例如，你可以尝试small-> small as big-> ?.

Congratulations!

你已经完成了这项任务。以下是您应该记住的要点：

余弦相似性是比较单词矢量对之间的相似性的好方法。（虽然L2距离也有效）
对于NLP应用程序，使用来自互联网的预先训练的一组单词向量通常是开始使用的好方法。

即使您已完成评分部分（graded portions），我们也建议您仔细查看本笔记本的其余部分。

恭喜你完成了这款笔记本的评分部分！

3 - Debiasing word vectors (OPTIONAL/UNGRADED) （词向量去偏）

在下面的练习中，您将检查可以在单词嵌入中反映的性别偏差（gender biases），并探索减少偏差的算法。除了了解脱离主题之外，这个练习还有助于磨练你对单词向量正在做什么的直觉。这部分涉及一些线性代数，尽管你可以完成它，即使不是线性代数专家，我们鼓励你试一试。笔记本电脑的这一部分是可选的，不进行评分。

Lets first see how the GloVe word embeddings relate to gender. You will first compute a vector 让我们首先看看GloVe单词嵌入与性别的关系。您将首先计算向量
$g = e_{woman}-e_{man}$ , $e_{woman}$ 表示对应于单词woman的词向量, $e_{man}$ 表示对应于单词man的词向量。结果向量 $g$ 是单词 "gender"的词向量. (如果计算，您可能会获得更准确的表示 $g_1 = e_{mother}-e_{father}$ , $g_2 = e_{girl}-e_{boy}$ , 等. 他们的平均值。但只是使用 $e_{woman}-e_{man}$ 现在将给出足够好的结果。)

g = word_to_vec_map['woman'] - word_to_vec_map['man']
print(g)

[-0.087144    0.2182     -0.40986    -0.03922    -0.1032      0.94165
 -0.06042     0.32988     0.46144    -0.35962     0.31102    -0.86824
  0.96006     0.01073     0.24337     0.08193    -1.02722    -0.21122
  0.695044   -0.00222     0.29106     0.5053     -0.099454    0.40445
  0.30181     0.1355     -0.0606     -0.07131    -0.19245    -0.06115
 -0.3204      0.07165    -0.13337    -0.25068714 -0.14293    -0.224957
 -0.149       0.048882    0.12191    -0.27362    -0.165476   -0.20426
  0.54376    -0.271425   -0.10245    -0.32108     0.2516     -0.33455
 -0.04371     0.01258   ]

现在，您将考虑使用 $g$ 的不同单词的余弦相似度（cosine similarity）。考虑相似性的正值对负余弦相似性的意义。

print ('List of names and their similarities with constructed vector:')

# girls and boys name
name_list = ['john', 'marie', 'sophie', 'ronaldo', 'priya', 'rahul', 'danielle', 'reza', 'katy', 'yasmin']

for w in name_list:
    print (w, cosine_similarity(word_to_vec_map[w], g))

List of names and their similarities with constructed vector:
john -0.23163356146
marie 0.315597935396
sophie 0.318687898594
ronaldo -0.312447968503
priya 0.17632041839
rahul -0.169154710392
danielle 0.243932992163
reza -0.079304296722
katy 0.283106865957
yasmin 0.233138577679

正如您所看到的（As you can see），女性名字往往与我们构造的向量 $g$ 具有正余弦相似性，而男性名字往往具有负余弦相似性（negative cosine similarity）。这并不令人惊讶（This is not suprising），结果似乎可以接受（acceptable）。

但是让我们尝试一些其他的话。

print('Other words and their similarities:')
word_list = ['lipstick', 'guns', 'science', 'arts', 'literature', 'warrior','doctor', 'tree', 'receptionist', 
             'technology',  'fashion', 'teacher', 'engineer', 'pilot', 'computer', 'singer']
for w in word_list:
    print (w, cosine_similarity(word_to_vec_map[w], g))
Other words and their similarities:
lipstick 0.276919162564
guns -0.18884855679
science -0.0608290654093
arts 0.00818931238588
literature 0.0647250443346
warrior -0.209201646411
doctor 0.118952894109
tree -0.0708939917548
receptionist 0.330779417506
technology -0.131937324476
fashion 0.0356389462577
teacher 0.179209234318
engineer -0.0803928049452
pilot 0.00107644989919
computer -0.103303588739
singer 0.185005181365

你注意到什么令人惊讶吗？令人惊讶的是，这些结果如何反映某些不健康的性别陈规定型观念。（stereotypes）例如，“计算机”更接近“男人”，而“文学”更接近“女人”。

我们将在下面看到如何使用Boliukbasi et al，2016的算法来减少这些向量的性别歧视。注意，诸如“演员”/“女演员”或“祖母”/“祖父”之类的一些单词对应保持性别特定，而诸如“接待员”或“技术”之类的其他单词应该被中和，即不与性别相关。当去除debiasing时，你将不得不区别对待这两种类型的单词。