Tensorflow2.0学习（18）：Embedding预处理——构建词表索引

Embedding

词向量one-hot编码：
例：某词集里共有10个词，那么其中第二个词的one-hot编码为[0,1,0,0,0,…],这个向量为稀疏向量。
Dense embedding：
例: 某词集中的某词为[2.9, 1.1, -1.5,…]，它可以表示这个词，同时这些词向量的学习都是后期通过模型训练得来的，当他训练好了之后才具有了真正的代表性意义。

数据集的载入和构建词表索引

导包

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import sklearn
import pandas as pd
import os
import sys
import time
import tensorflow as tf
from tensorflow import keras
print(tf.__version__)
print(sys.version_info)
for module in mpl, np ,pd, sklearn, tf, keras:
    print(module.__name__, module.__version__)

2.1.0
sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)
matplotlib 3.1.1
numpy 1.16.5
pandas 0.25.1
sklearn 0.21.3
tensorflow 2.1.0
tensorflow_core.python.keras.api._v2.keras 2.2.4-tf

加载查看数据集

# 在keras上下载数据集
imdb = keras.datasets.imdb
# 词语个数
vocab_size = 10000
# id的偏移量
index_from = 3
# 会按照词频进行统计，然后取排名为前一万个的
(train_data, train_labels),(test_data, test_labels) = imdb.load_data(
    num_words = vocab_size, index_from = index_from)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
17465344/17464789 [==============================] - 4s 0us/step

# 打印第一个样本和第一个样本标签
print(train_data[0], train_labels[0])
# 打印样本的大小
print(train_data.shape, train_labels.shape)
# 打印一个样本和第二个样本向量的大小
print(len(train_data[0]), len(train_data[1]))

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32] 1
(25000,) (25000,)
218 189

print(test_data.shape, test_labels.shape)

(25000,) (25000,)

查看词表的索引和对应名称

# 载入词表
word_index = imdb.get_word_index()
print(len(word_index))
print(list(word_index.items())[:5]) 
# print(word_index)

88584
[('fawn', 34701), ('tsukino', 52006), ('nunnery', 52007), ('sonja', 16816), ('vani', 63951)]

增添三个槽位

# 因为之前设置的index_from = 3，所以要将id + 3
word_index = {k:(v+3) for k, v in word_index.items()}

加入特殊字符

# id偏移了3之后，就有了特殊的槽位增添特殊字符
word_index['<PAD>'] = 0 
word_index['<START>'] = 1 
word_index['<UNK>'] = 2
word_index['<END>'] = 3

解析训练集中的第一个数据

reverse_word_index = dict(
    [(value, key) for key, value in word_index.items()])
# 将id解析成文本
def decode_review(text_ids):
    return " ".join(
    [reverse_word_index.get(word_id,"<UNK>") for word_id in text_ids])

decode_review(train_data[0])

"<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"

一枚小白的日常

发布了35 篇原创文章 · 获赞 3 · 访问量 2493

私信关注

Tensorflow2.0学习（18）：Embedding预处理——构建词表索引

Embedding

数据集的载入和构建词表索引

猜你喜欢