Code source: tensorflow machine learning practical guide (translated by Zeng Yiqiang, September 2017) - Chapter 7: Natural Language Processing
Code address: https://github.com/nfmcclure/tensorflow-cookbook
Problem Solved: Use "tfidf" for Spam Prediction (using logistic regression algorithm)
Cons: Word order not taken into account
TF-IDF: TF Term Frequency, IDF Inverse Document Frequency.
TF represents the frequency of the term appearing in document d.
The main idea of IDF is: if there are fewer documents containing the term t, that is, the smaller the denominator, the larger the IDF, then the term t has a good ability to distinguish between categories.
Calculation of tfidf value of i word in j document
|D| is the total number of documents
The denominator is the number of documents with the i word, and sometimes the denominator will be 0. Laplace smoothing is used, and it is processed as +1.
Proceed as follows:
step1: import the required packages
step2: Prepare the dataset
step3: word segmentation and build text vector
step4: Split the dataset
step5: Build the graph
step6: Training effect changes
step1: import the required packages
import tensorflow as tf import matplotlib.pyplot as plt import csv import numpy as np import os import string import requests import io import nltk from zipfile import ZipFile from sklearn.feature_extraction.text import TfidfVectorizer from tensorflow.python.framework import ops ops.reset_default_graph() # Start a graph session sess = tf.Session() #define batch size and feature vector length batch_size = 200 max_features = 1000
step2: Prepare the dataset
Refer to the use of tensorflow in text processing - bag of words
step3: word segmentation and build text vector
# Define tokenizer def tokenizer(text): words = nltk.word_tokenize(text) return words # Create TF-IDF of texts tfidf = TfidfVectorizer(tokenizer=tokenizer, stop_words='english', max_features=max_features) sparse_tfidf_texts = tfidf.fit_transform(texts)
At this point, sparse_tfidf_texts has converted each text into a 1000-dimensional vector, and multiple texts form a matrix (note that the matrix is a sparse matrix, use sparse_tfidf_texts.todense() to view the value)
step4: Split the dataset
# Split up data set into train/test train_indices = np.random.choice(sparse_tfidf_texts.shape[0], round(0.8*sparse_tfidf_texts.shape[0]), replace=False) test_indices = np.array(list(set(range(sparse_tfidf_texts.shape[0])) - set(train_indices))) texts_train = sparse_tfidf_texts[train_indices] texts_test = sparse_tfidf_texts[test_indices] target_train = np.array([x for ix, x in enumerate(target) if ix in train_indices]) target_test = np.array([x for ix, x in enumerate(target) if ix in test_indices])
step5: Build the graph
# Create variables for logistic regression Set weights and bias terms A = tf.Variable(tf.random_normal(shape=[max_features,1 ])) b = tf.Variable(tf.random_normal(shape=[1,1])) # Initialize placeholders set data placeholders x_data = tf.placeholder(shape=[None, max_features], dtype= tf.float32) y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32) # Declare logistic model (sigmoid in loss function) model_output = tf.add(tf.matmul(x_data, A), b) # Declare loss function (Cross Entropy loss)损失函数计算 loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(model_output, y_target)) # Actual Prediction 预测结果 prediction = tf.round(tf.sigmoid(model_output)) predictions_correct = tf.cast(tf.equal(prediction, y_target), tf.float32) accuracy = tf.reduce_mean(predictions_correct) # Declare optimizer update the weights with the GD optimization algorithm to minimize the loss my_opt = tf.train.GradientDescentOptimizer(0.0025 ) train_step = my_opt.minimize(loss)
step6: Training effect changes
# Intitialize Variables init = tf.initialize_all_variables() sess.run(init) # Start Logistic Regression train_loss = [] test_loss = [] train_acc = [] test_acc = [] i_data = [] for i in range(10000): rand_index = np.random.choice(texts_train.shape[0], size=batch_size) rand_x = texts_train[rand_index].todense() rand_y = np.transpose([target_train[rand_index]]) sess.run(train_step, feed_dict={x_data: rand_x, y_target: rand_y}) # Only record loss and accuracy every 100 generations, 100 records, 500 output states if (i+1)%100== 0: i_data.append(i+1) train_loss_temp = sess.run(loss, feed_dict={x_data: rand_x, y_target: rand_y}) train_loss.append(train_loss_temp) test_loss_temp = sess.run(loss, feed_dict={x_data: texts_test.todense(), y_target: np.transpose([target_test])}) test_loss.append(test_loss_temp) train_acc_temp = sess.run(accuracy, feed_dict={x_data: rand_x, y_target: rand_y}) train_acc.append(train_acc_temp) test_acc_temp = sess.run(accuracy, feed_dict={x_data: texts_test.todense(), y_target: np.transpose([target_test])}) test_acc.append(test_acc_temp) if (i+1)%500==0: acc_and_loss = [i+1, train_loss_temp, test_loss_temp, train_acc_temp, test_acc_temp] acc_and_loss = [np.round(x,2) for x in acc_and_loss] print('Generation # {}. Train Loss (Test Loss): {:.2f} ({:.2f}). Train Acc (Test Acc): {:.2f} ({:.2f})'.format(*acc_and_loss))
The result is as follows:
Image display
# Plot loss over time plt.plot(i_data, train_loss, 'k-', label='Train Loss') plt.plot(i_data, test_loss, 'r--', label='Test Loss', linewidth=4) plt.title('Cross Entropy Loss per Generation') plt.xlabel('Generation') plt.ylabel('Cross Entropy Loss') plt.legend(loc='upper right') plt.show() # Plot train and test accuracy plt.plot(i_data, train_acc, 'k-', label='Train Set Accuracy') plt.plot(i_data, test_acc, 'r--', label='Test Set Accuracy', linewidth=4) plt.title('Train and Test Accuracy') plt.xlabel('Generation') plt.ylabel('Accuracy') plt.legend(loc='lower right') plt.show()