机器学习笔记 - 使用预训练词嵌入进行文本相似性分析

1、概述

使用预训练的词嵌入模型 (Word2Vec) 来探索词嵌入如何让我们探索词之间的相似性和词之间的关系。例如，查找一个国家的首都或一家公司的主要产品。最后，演示使用 t-SNE 在 2D 图上绘制高维空间。

我们将首先从 Google 新闻下载预训练模型并解压。

下载地址如下
https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gzhttps://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz

2、加载词向量

import os
#from tensorflow.keras.utils import get_file
import gensim
import subprocess
import numpy as np
import matplotlib.pyplot as plt
from IPython.core.pylabtools import figsize
figsize(10, 10)

model = gensim.models.KeyedVectors.load_word2vec_format('D:\GoogleNews-vectors-negative300.bin', binary=True)

3、查找相似的词

让我们通过下面语句查看与浓缩咖啡最相似的东西：

model.most_similar(positive=['espresso'])

输出如下

[('cappuccino', 0.6888187527656555), ('mocha', 0.6686208248138428), ('coffee', 0.6616825461387634), ('latte', 0.653675377368927), ('caramel_macchiato', 0.6491268277168274), ('ristretto', 0.6485546231269836), ('espressos', 0.6438629031181335), ('macchiato', 0.6428250074386597), ('chai_latte', 0.6308028101921082), ('espresso_cappuccino', 0.6280542612075806)]

4、定义关系函数

定义一个查找函数，如果国王像男人，那么女人像什么？

def A_is_to_B_as_C_is_to(a, b, c, topn=1):
    a, b, c = map(lambda x:x if type(x) == list else [x], (a, b, c))
    res = model.most_similar(positive=b + c, negative=a, topn=topn)
    if len(res):
        if topn == 1:
            return res[0][0]
        return [x[0] for x in res]
    return None

调用方法，会得到queen。

A_is_to_B_as_C_is_to('man', 'woman', 'king')

如果德国的首都是柏林，那么意大利的首都是哪里？

for country in 'Italy', 'France', 'India', 'China':
    print('%s is the capital of %s' % 
          (A_is_to_B_as_C_is_to('Germany', 'Berlin', country), country))

输出如下：

Rome is the capital of Italy
Paris is the capital of France
Delhi is the capital of India
Beijing is the capital of China

或者我们可以对特定公司的重要产品做同样的事情。在这里，我们使用两种产品为产品方程播种，iPhone 代表 Apple，Starbucks_coffee 代表 Starbucks。请注意，在嵌入模型中，数字被替换为 #：

for company in 'Google', 'IBM', 'Boeing', 'Microsoft', 'Samsung':
    products = A_is_to_B_as_C_is_to(
        ['Starbucks', 'Apple'], 
        ['Starbucks_coffee', 'iPhone'], 
        company, topn=3)
    print('%s -> %s' % 
          (company, ', '.join(products)))

输出如下：

Google -> personalized_homepage, app, Gmail
IBM -> DB2, WebSphere_Portal, Tamino_XML_Server
Boeing -> Dreamliner, airframe, aircraft
Microsoft -> Windows_Mobile, SyncMate, Windows
Samsung -> MM_A###, handset, Samsung_SCH_B###

4、使用TSNE进行聚类

让我们通过选择三类物品、饮料、国家和运动来进行一些聚类：

beverages = ['espresso', 'beer', 'vodka', 'wine', 'cola', 'tea']
countries = ['Italy', 'Germany', 'Russia', 'France', 'USA', 'India']
sports = ['soccer', 'handball', 'hockey', 'cycling', 'basketball', 'cricket']

items = beverages + countries + sports
len(items)

并查找他们的向量：

item_vectors = [(item, model[item]) 
                    for item in items
                    if item in model]
len(item_vectors)

现在使用 TSNE 进行聚类：

vectors = np.asarray([x[1] for x in item_vectors])
lengths = np.linalg.norm(vectors, axis=1)
norm_vectors = (vectors.T / lengths).T

tsne = TSNE(n_components=2, perplexity=10, verbose=2).fit_transform(norm_vectors)

输出如下：

[t-SNE] Computing 17 nearest neighbors...
[t-SNE] Indexed 18 samples in 0.000s...
[t-SNE] Computed neighbors for 18 samples in 0.052s...
[t-SNE] Computed conditional probabilities for sample 18 / 18
[t-SNE] Mean sigma: 0.581543
[t-SNE] Computed conditional probabilities in 0.003s
[t-SNE] Iteration 50: error = 70.7682343, gradient norm = 0.2115109 (50 iterations in 0.017s)
[t-SNE] Iteration 100: error = 50.6931763, gradient norm = 0.0493703 (50 iterations in 0.012s)
[t-SNE] Iteration 150: error = 72.4811478, gradient norm = 0.2386879 (50 iterations in 0.011s)
[t-SNE] Iteration 200: error = 61.9339905, gradient norm = 0.1581545 (50 iterations in 0.012s)
[t-SNE] Iteration 250: error = 64.9977417, gradient norm = 0.1168961 (50 iterations in 0.011s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 64.997742
[t-SNE] Iteration 300: error = 0.9344296, gradient norm = 0.0008748 (50 iterations in 0.013s)
[t-SNE] Iteration 350: error = 0.7699002, gradient norm = 0.0005786 (50 iterations in 0.011s)
[t-SNE] Iteration 400: error = 0.6187575, gradient norm = 0.0004745 (50 iterations in 0.012s)
[t-SNE] Iteration 450: error = 0.5282804, gradient norm = 0.0003208 (50 iterations in 0.011s)
[t-SNE] Iteration 500: error = 0.4986507, gradient norm = 0.0001888 (50 iterations in 0.011s)
[t-SNE] Iteration 550: error = 0.3673418, gradient norm = 0.0004975 (50 iterations in 0.012s)
[t-SNE] Iteration 600: error = 0.2507115, gradient norm = 0.0007413 (50 iterations in 0.011s)
[t-SNE] Iteration 650: error = 0.1724875, gradient norm = 0.0002562 (50 iterations in 0.011s)
[t-SNE] Iteration 700: error = 0.1552246, gradient norm = 0.0001649 (50 iterations in 0.012s)
[t-SNE] Iteration 750: error = 0.1389877, gradient norm = 0.0000916 (50 iterations in 0.012s)
[t-SNE] Iteration 800: error = 0.1303239, gradient norm = 0.0000812 (50 iterations in 0.011s)
[t-SNE] Iteration 850: error = 0.1220449, gradient norm = 0.0000533 (50 iterations in 0.010s)
[t-SNE] Iteration 900: error = 0.1205731, gradient norm = 0.0000198 (50 iterations in 0.011s)
[t-SNE] Iteration 950: error = 0.1201564, gradient norm = 0.0000153 (50 iterations in 0.011s)
[t-SNE] Iteration 1000: error = 0.1198082, gradient norm = 0.0000120 (50 iterations in 0.012s)
[t-SNE] KL divergence after 1000 iterations: 0.119808

Process finished with exit code 0

使用matplotlib显示结果

可以看到，这些国家、运动和饮料都形成了自己的小群落，可以说板球和印度相互吸引，但可能不太清楚，葡萄酒、法国、意大利和浓缩咖啡。

机器学习笔记 - 使用预训练词嵌入进行文本相似性分析

1、概述

2、加载词向量

3、查找相似的词

4、定义关系函数

4、使用TSNE进行聚类

猜你喜欢