NLP | 词袋模型 Bag of words model

词袋模型是用于自然语言处理和信息检索（IR）的简化表示。在这个模型中，一个文本（比如一个句子或文档）表示为它的词袋，不考虑语法，甚至语序，但保持多样性。

词袋模型通常用于文档分类方法，其中每个单词的出现（频率）被用作训练分类器的特征。

（1） John likes to watch movies. Mary likes movies too.
（2） John also likes to watch football games.

通常，NLP无法一下子处理完整的段落或句子，因此，第一步往往是分句和分词。这里只有句子，因此我们只需要分词即可。对于英语句子，可以使用NLTK中的word_tokenize函数，对于中文句子，则可使用jieba模块。故第一步为分词。

sent1 = "John likes to watch movies. Mary likes movies too."
sent2 = "John also likes to watch football games."

from nltk import word_tokenize
sents = [sent1, sent2]
texts = [[word for word in word_tokenize(sent)] for sent in sents]

print(texts)

输出结果为：

[['John', 'likes', 'to', 'watch', 'movies', '.', 'Mary', 'likes', 'movies', 'too', '.'], ['John', 'also', 'likes', 'to', 'watch', 'football', 'games', '.']]

构建语料库，即所有句子中出现的单词及标点。

words = []

for text in texts:
    words += text
    
corpus = set(words)
print(corpus)

输出结果为：

{'games', 'movies', '.', 'also', 'football', 'to', 'watch', 'Mary', 'John', 'too', 'likes'}

语料库中包含11个元素（单词及标点）。接下来，对语料库中的单词及标点建立数字映射，便于后续的句子的向量表示。

corpus_dict = dict(zip(corpus, range(len(corpus))))
print(corpus_dict)

输出结果：

{'too': 9, 'football': 4, 'movies': 1, 'likes': 10, 'watch': 6, 'to': 5, 'also': 3, 'Mary': 7, 'John': 8, '.': 2, 'games': 0}

建立句子的向量表示。这个表示向量并不是简单地以单词或标点出现与否来选择0，1数字，而是把单词或标点的出现频数作为其对应的数字表示，结合刚才的语料库字典，句子的向量表示的代码如下。

# 建立句子的向量表示

def vector(text, corpus_dict):
    vec = []
    for key in corpus_dict.keys():
        if key in text:
            vec.append((corpus_dict[key], text.count(key)))
        else:
            vec.append((corpus_dict[key], 0))
    vec = sorted(vec, key= lambda x: x[0])
    return vec

vec1 = vector(texts[0], corpus_dict)
vec2 = vector(texts[1], corpus_dict)
print(vec1)
print(vec2)

输出结果为:

[(0, 0), (1, 2), (2, 2), (3, 0), (4, 0), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 2)]
[(0, 1), (1, 0), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 0), (8, 1), (9, 0), (10, 1)]

其中，（10,2）代表索引为10的单词likes在这句话中出现了2次。

NLP中，如果得到了两个句子的向量表示，那么，一般会选择用余弦相似度作为它们的相似度，而向量的余弦相似度即为两个向量的夹角的余弦值。

import math

def similarity(vec1, vec2):
    inner = 0
    square_vec1 = 0
    square_vec2 = 0
    for tup1, tup2 in zip(vec1, vec2):
        inner += tup1[1]*tup2[1]
        square_vec1 += tup1[1]**2
        square_vec2 += tup2[1]**2

    return (inner / math.sqrt(square_vec1*square_vec2))

cosine = similarity(vec1, vec2)
print('两个句子的余弦相似度为： %.4f。'%cosine)

输出结果为：

两个句子的余弦相似度为： 0.6002。

使用gensim实现词袋模型

from gensim import corpora
from gensim.similarities import Similarity

# 语料库
dictionary = corpora.Dictionary(texts)

# 利用doc2bow作为词袋模型
corpus = [dictionary.doc2bow(text) for text in texts]
similarity = Similarity('-Similarity-index', corpus, num_features=len(dictionary))
print(similarity)

# 获取句子的相似度
new_sensence = sent1
test_corpus_1 = dictionary.doc2bow(word_tokenize(new_sensence))

cosine_sim = similarity[test_corpus_1][1]
print("利用gensim计算得到两个句子的相似度： %.4f。"%cosine_sim)

输出结果：

Similarity index with 2 documents in 0 shards (stored under -Similarity-index)
利用gensim计算得到两个句子的相似度： 0.7303。

有时候也常常用0/1向量表示，对应索引上的单词存在，则对应索引位置的值为1。tensorflow实现垃圾短信分类实例。

# 调库
import tensorflow as tf
import matplotlib.pyplot as plt
import os
import numpy as np
import csv
import string
import requests
import io
from zipfile import ZipFile
from tensorflow.contrib import learn        # 自带分词器
from tensorflow.python.framework import ops
ops.reset_default_graph()
sess = tf.Session()

if not os.path.exists('temp'):
    os.makedirs('temp')
    
# 存储数据
save_file_name = os.path.join('temp', 'temp_spam_data.csv')
if os.path.isfile(save_file_name):
    text_data = []
    with open(save_file_name, 'r') as temp_output_file:
        reader = csv.reader(temp_output_file)
        for row in reader:
            text_data.append(row)
else:
    zip_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
    r = requests.get(zip_url)
    z = ZipFile(io.BytesIO(r.content))
    file = z.read('SMSSpamCollection')
    text_data = file.decode()
    text_data = text_data.encode('ascii',errors='ignore')
    text_data = text_data.decode().split('\n')
    text_data = [x.split('\t') for x in text_data if len(x)>=1]
    with open(save_file_name, 'w') as temp_output_file:
        writer = csv.writer(temp_output_file)
        writer.writerows(text_data)
print(text_data[0:5])
texts = [x[1] for x in text_data if len(x)>0]
target = [x[0] for x in text_data if len(x)>0]
target = [1 if x=='spam' else 0 for x in target]

# 为了减少词汇量大小，我们对文本进行规则化处理。移除文本中大小写，标点和数字的影响 
texts = [x.lower() for x in texts]
texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]
texts = [''.join(c for c in x if c not in '0123456789') for x in texts]
texts = [' '.join(x.split()) for x in texts]

# 计算最长句子大小，使用文本数据集的文本长度直方图，并取最佳截止点（25）
text_lengths = [len(x.split()) for x in texts]
text_lengths = [x for x in text_lengths if x < 50]
plt.hist(text_lengths, bins=25)
plt.title('Histogram of # of Words in Texts')
sentence_size = 25 # 30, 40 均可
min_word_freq = 3

# tensorflow自带分词器
vocab_processor = learn.preprocessing.VocabularyProcessor(sentence_size, min_frequency=min_word_freq)
vocab_processor.transform(texts)
transformed_texts = np.array([x for x in vocab_processor.transform(texts)])
embedding_size = len(np.unique(transformed_texts))  #unique()保留数组中不同的值

# 划分数据集
train_indices = np.random.choice(len(texts), round(len(texts)*0.8), replace=False)
test_indices = np.array(list(set(range(len(texts))) - set(train_indices)))
texts_train = [x for ix, x in enumerate(texts) if ix in train_indices]
texts_test = [x for ix, x in enumerate(texts) if ix in test_indices]
target_train = [x for ix, x in enumerate(target) if ix in train_indices]
target_test = [x for ix, x in enumerate(target) if ix in test_indices]

# 声明词嵌入矩阵。将句子单词转换成索引，再将索引转换成one_hot向量，该向量为单位矩阵
identity_mat = tf.diag(tf.ones(shape=[embedding_size]))

# 最后会用逻辑回归进行预测，因此需要声明变量和占位符
A = tf.Variable(tf.random_normal(shape=[embedding_size,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
x_data = tf.placeholder(shape=[sentence_size], dtype=tf.int32)
y_target = tf.placeholder(shape=[1, 1], dtype=tf.float32)

# 使用tensorflow的嵌入查找函数来映射句子中的单词为单位矩阵的one-hot向量。然后把前面的词向量求和
# tf.nn.embedding_lookup(params, ids, partition_strategy='mod', name=None, validate_indices=True, max_norm=None)
x_embed = tf.nn.embedding_lookup(identity_mat, x_data)
x_col_sums = tf.reduce_sum(x_embed, 0)

# 建立逻辑回归
x_col_sums_2D = tf.expand_dims(x_col_sums, 0)
model_output = tf.add(tf.matmul(x_col_sums_2D, A), b)

# 损失函数
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=model_output, labels=y_target))

# 预测
prediction = tf.sigmoid(model_output)

# 优化器
my_opt = tf.train.GradientDescentOptimizer(0.001)
train_step = my_opt.minimize(loss)

# 初始化变量
init = tf.global_variables_initializer()
sess.run(init)

# train
loss_vec = []
train_acc_all = []
train_acc_avg = []
for ix,t in enumerate(vocab_processor.fit_transform(texts_train)):
    y_data = [[target_train[ix]]]
    sess.run(train_step, feed_dict={x_data:t, y_target:y_data})
    temp_loss = sess.run(loss, feed_dict={x_data:t, y_target:y_data})
    loss_vec.append(temp_loss)
    if (ix+1) % 10 == 0:
        print('Training Observation #{}, Loss = {}'.format(ix+1, temp_loss))
    [[temp_pred]] = sess.run(prediction, feed_dict={x_data:t, y_target:y_data})
    train_acc_temp = target_train[ix]==np.round(temp_pred)
    train_acc_all.append(train_acc_temp)
    if len(train_acc_all) >= 50:
        train_acc_avg.append(np.mean(train_acc_all[-50:]))

# 测试
test_acc_all = []
for ix, t in enumerate(vocab_processor.fit_transform(texts_test)):
    y_data = [[target_test[ix]]]
    
    if (ix + 1) % 50 == 0:
        print('Test Observation #{}'.format(str(ix+1)))
    [[temp_pred]] = sess.run(prediction, feed_dict={x_data:t, y_target:y_data})
    test_acc_temp = target_test[ix]==np.round(temp_pred)
    test_acc_all.append(test_acc_temp)
print('\nOverall Test Accuracy: {}'.format(np.mean(test_acc_all)))

# Plot
plt.plot(range(len(train_acc_avg)), train_acc_avg, 'k-', label='Train Accuracy')
plt.title('Avg Training Acc Over Past 50 Generations')
plt.xlabel('Generation')
plt.ylabel('Training Accuracy')
plt.show()

NLP | 词袋模型 Bag of words model

猜你喜欢