TensorFlow与NLP（词袋模型：垃圾短信检测）

开篇

昨天没有更新TensorFlow系列，有点跳票的意思，最近一直在参加一些小厂的算法工程师的面试，比较尴尬的是，昨天西山居的算法部门的老大直接觉得我比较水了，觉得我们在学校里面处理的文本和公司里面的差距太大了，搞得我一度无法释怀，老实说中文的文本处理还是存在很多的问题，而我也深度反省了一下自己的实战能力，同时也更加坚定地去完成这一系列博客。
既然是NLP，那么我还是先分享一下一个大厂的面试题，试着阐述一下RNN的缺点，LSTM解决的一些问题？这边先留一个疑问，当我们去实现RNN和LSTM的时候，再去解释它们的一些优缺点。结束我的废话，下面正式开始今天的博客。

词袋模型

词是文本最基本的元素，当然中文的话就是字啦，不过我们处理中文的时候也是喜欢把文本分成一个个词语，也就是我们的分词任务，以后如果有机会分享一些基本的NLP任务的时候，我会具体的讲分词任务，并且分析一些代码。下面正式开始介绍我们今天的主角，词袋模型。还是老样子，刻板的大堆原理我不讲，大家通过代码感受，感性的介绍一下词袋模型，首先是字面意思，把一大堆词放在一个袋子里面，那么其实就默认它们其实是没有任何顺序的，每个词给个编号作为身份，大家无序地放在袋子里面。也算是一种词向量的表示啦，具体操作看代码啦。

任务

介绍完我们的词袋模型，看看我们今天需要完成怎么样的任务。

数据集：sms+spam+collection，这边我会直接通过代码去下载
具体任务：垃圾短信检测
模型：我们之前讲过的逻辑回归模型，典型的二分类任务
特征：one-hot+词频

代码

数据集的下载

save_file_name = os.path.join('C:\\Users\\Dave','temp_spam_data.csv')#路径名
if os.path.isfile(save_file_name):
    text_data = []
    with open(save_file_name, 'r') as temp_output_file:
        reader = csv.reader(temp_output_file)
        for row in reader:
            text_data.append(row)
else:
    zip_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
    r = requests.get(zip_url)
    z = ZipFile(io.BytesIO(r.content))
    file = z.read('SMSSpamCollection')
    # Format Data
    text_data = file.decode()
    text_data = text_data.encode('ascii',errors='ignore')
    text_data = text_data.decode().split('\n')
    text_data = [x.split('\t') for x in text_data if len(x)>=1]

    # And write to csv
    with open(save_file_name, 'w') as temp_output_file:
        writer = csv.writer(temp_output_file)
        writer.writerows(text_data)

处理后的csv文件

分为两列：0列是label，1列是text，我们将ham，spam分别用数字表示，以便后期的二分类处理

数据的预处理

去除标点符号和数字还有一些空格，反正就是去掉非单词的东西

# Normalize text
# Lower case
texts = [x.lower() for x in texts]

# Remove punctuation
texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]

# Remove numbers
texts = [''.join(c for c in x if c not in '0123456789') for x in texts]

# Trim extra whitespace
texts = [' '.join(x.split()) for x in texts]

统计一下每一行文本的长度，主要用来控制单词的维度，来完成词袋模型

# Plot histogram of text lengths
text_lengths = [len(x.split()) for x in texts]
text_lengths = [x for x in text_lengths if x < 50]
plt.hist(text_lengths, bins=25)#每一格占25
plt.title('Histogram of # of Words in Texts')

很显然大部分的句子长度都是在25以下的，所以我们设定句子的最大长度为25

sentence_size = 25
min_word_freq = 3
#词的频率最小为3，低于三的词需要剔除（目的在于生成词典）
vocab_processor = learn.preprocessing.VocabularyProcessor(sentence_size, min_frequency=min_word_freq)
#tf.contrib.learn.preprocessing.VocabularyProcessor (max_document_length, min_frequency=0, vocabulary=None, tokenizer_fn=None)
#max_document_length: 文档的最大长度。如果文本的长度大于最大长度，那么它会被剪切，反之则用0填充。
#min_frequency: 词频的最小值，出现次数小于最小词频则不会被收录到词表中。 
#vocabulary: CategoricalVocabulary 对象。
#tokenizer_fn：分词函数

这边是个比较重要的函数，具体的细节可以参考

vocabularyprocessor

vocab_processor.fit_transform(texts)
embedding_size = len(vocab_processor.vocabulary_)
#这里是为了生成词典，也就是下面词的维度，2108
放入所有的文本，生成词典，这里大概是2108个词

整个过程就是为了生成我们的词典，词典在NLP任务中扮演着及其重要的角色，以后我会慢慢提及它的重要作用。

划分数据集

    # Split up data set into train/test 划分数据集
    train_indices = np.random.choice(len(texts), round(len(texts)*0.8), replace=False)
    test_indices = np.array(list(set(range(len(texts))) - set(train_indices)))
    texts_train = [x for ix, x in enumerate(texts) if ix in train_indices]
    texts_test = [x for ix, x in enumerate(texts) if ix in test_indices]
    target_train = [x for ix, x in enumerate(target) if ix in train_indices]
    target_test = [x for ix, x in enumerate(target) if ix in test_indices]

这边是根据索引来划分的，二八分。

定义模型

# Setup Index Matrix for one-hot-encoding
identity_mat = tf.diag(tf.ones(shape=[embedding_size]))
#http://blog.csdn.net/lenbow/article/details/52152766
#返回一个给定对角值的对角tensor
生成一个2108*2108的对角线为1，其他都为0的对角矩阵

# Create variables for logistic regression
A = tf.Variable(tf.random_normal(shape=[embedding_size,1]))
维度为2108*1
b = tf.Variable(tf.random_normal(shape=[1,1]))

# Initialize placeholders
x_data = tf.placeholder(shape=[sentence_size], dtype=tf.int32)
维度为25，数值为每一个词在词典中的位置
y_target = tf.placeholder(shape=[1, 1], dtype=tf.float32)

# Text-Vocab Embedding
x_embed = tf.nn.embedding_lookup(identity_mat, x_data)
按照x_data的每一个数值抽取2108*2108矩阵中的那一行，合成一个嵌入矩阵
#x_data是25维的向量，x_embed是2108*25（one—hot编码）
x_col_sums = tf.reduce_sum(x_embed, 0)
#所有行加在一起了，变成2108维
#https://www.zhihu.com/question/51325408?from=profile_question_card

# Declare model operations
x_col_sums_2D = tf.expand_dims(x_col_sums, 0) 扩展为1*2108

model_output = tf.add(tf.matmul(x_col_sums_2D, A), b)

# Declare loss function (Cross Entropy loss)
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=model_output, labels=y_target))

# Prediction operation
prediction = tf.sigmoid(model_output)

# Declare optimizer
my_opt = tf.train.GradientDescentOptimizer(0.001)
train_step = my_opt.minimize(loss)

训练模型

init = tf.global_variables_initializer()
sess.run(init)

# Start Logistic Regression
print('Starting Training Over {} Sentences.'.format(len(texts_train)))
loss_vec = []
train_acc_all = []
train_acc_avg = []
for ix, t in enumerate(vocab_processor.fit_transform(texts_train)):
    y_data = [[target_train[ix]]]


    sess.run(train_step, feed_dict={x_data: t, y_target: y_data})
    temp_loss = sess.run(loss, feed_dict={x_data: t, y_target: y_data})
    loss_vec.append(temp_loss)

    if (ix+1)%10==0:
        print('Training Observation #' + str(ix+1) + ': Loss = ' + str(temp_loss))

    # Keep trailing average of past 50 observations accuracy
    # Get prediction of single observation
    [[temp_pred]] = sess.run(prediction, feed_dict={x_data:t, y_target:y_data})
    # Get True/False if prediction is accurate
    train_acc_temp = target_train[ix]==np.round(temp_pred)
    train_acc_all.append(train_acc_temp)
    if len(train_acc_all) >= 50:
        train_acc_avg.append(np.mean(train_acc_all[-50:]))

# Get test set accuracy
print('Getting Test Set Accuracy For {} Sentences.'.format(len(texts_test)))
test_acc_all = []
for ix, t in enumerate(vocab_processor.fit_transform(texts_test)):
    y_data = [[target_test[ix]]]

    if (ix+1)%50==0:
        print('Test Observation #' + str(ix+1))    

    # Keep trailing average of past 50 observations accuracy
    # Get prediction of single observation
    [[temp_pred]] = sess.run(prediction, feed_dict={x_data:t, y_target:y_data})
    # Get True/False if prediction is accurate
    test_acc_temp = target_test[ix]==np.round(temp_pred)
    test_acc_all.append(test_acc_temp)

print('\nOverall Test Accuracy: {}'.format(np.mean(test_acc_all)))

# Plot training accuracy over time
plt.plot(range(len(train_acc_avg)), train_acc_avg, 'k-', label='Train Accuracy')
plt.title('Avg Training Acc Over Past 50 Generations')
plt.xlabel('Generation')
plt.ylabel('Training Accuracy')
plt.show()

时间比较仓促，具体的代码解析都在注释中给出了，后期会好好修改，ok，今天就到这里。

TensorFlow与NLP（词袋模型：垃圾短信检测）

开篇

词袋模型

任务

代码

猜你喜欢