The use of tensorflow in text processing - TF-IDF algorithm

Code source: tensorflow machine learning practical guide (translated by Zeng Yiqiang, September 2017) - Chapter 7: Natural Language Processing

Code address: https://github.com/nfmcclure/tensorflow-cookbook

Problem Solved: Use "tfidf" for Spam Prediction (using logistic regression algorithm)

Cons: Word order not taken into account


 TF-IDF: TF Term Frequency, IDF Inverse Document Frequency.

TF represents the frequency of the term appearing in document d.

The main idea of ​​IDF is: if there are fewer documents containing the term t, that is, the smaller the denominator, the larger the IDF, then the term t has a good ability to distinguish between categories.

Calculation of tfidf value of i word in j document

|D| is the total number of documents

The denominator is the number of documents with the i word, and sometimes the denominator will be 0. Laplace smoothing is used, and it is processed as +1.


 Proceed as follows:

step1: import the required packages

step2: Prepare the dataset

step3: word segmentation and build text vector

step4: Split the dataset

step5: Build the graph

step6: Training effect changes


step1: import the required packages

import tensorflow as tf
import matplotlib.pyplot as plt
import csv
import numpy as np
import os
import string
import requests
import io
import nltk
from zipfile import ZipFile
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.python.framework import ops
ops.reset_default_graph()

# Start a graph session
sess = tf.Session()


#define batch size and feature vector length 
batch_size = 200 max_features = 1000

 step2: Prepare the dataset

Refer to the use of tensorflow in text processing - bag of words

 step3: word segmentation and build text vector

# Define tokenizer
def tokenizer(text):
    words = nltk.word_tokenize(text)
    return words

# Create TF-IDF of texts
tfidf = TfidfVectorizer(tokenizer=tokenizer, stop_words='english', max_features=max_features)
sparse_tfidf_texts = tfidf.fit_transform(texts)

 At this point, sparse_tfidf_texts has converted each text into a 1000-dimensional vector, and multiple texts form a matrix (note that the matrix is ​​a sparse matrix, use sparse_tfidf_texts.todense() to view the value)

step4: Split the dataset

# Split up data set into train/test
train_indices = np.random.choice(sparse_tfidf_texts.shape[0], round(0.8*sparse_tfidf_texts.shape[0]), replace=False)
test_indices = np.array(list(set(range(sparse_tfidf_texts.shape[0])) - set(train_indices)))
texts_train = sparse_tfidf_texts[train_indices]
texts_test = sparse_tfidf_texts[test_indices]
target_train = np.array([x for ix, x in enumerate(target) if ix in train_indices])
target_test = np.array([x for ix, x in enumerate(target) if ix in test_indices])

step5: Build the graph

# Create variables for logistic regression Set weights and bias terms 
A = tf.Variable(tf.random_normal(shape=[max_features,1 ]))
b = tf.Variable(tf.random_normal(shape=[1,1]))

# Initialize placeholders set data placeholders 
x_data = tf.placeholder(shape=[None, max_features], dtype= tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)

# Declare logistic model (sigmoid in loss function)
model_output = tf.add(tf.matmul(x_data, A), b)

# Declare loss function (Cross Entropy loss)损失函数计算
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(model_output, y_target))

# Actual Prediction 预测结果
prediction = tf.round(tf.sigmoid(model_output))
predictions_correct = tf.cast(tf.equal(prediction, y_target), tf.float32)
accuracy = tf.reduce_mean(predictions_correct)

# Declare optimizer update the weights with the GD optimization algorithm to minimize the loss 
my_opt = tf.train.GradientDescentOptimizer(0.0025 )
train_step = my_opt.minimize(loss) 

step6: Training effect changes

# Intitialize Variables
init = tf.initialize_all_variables()
sess.run(init)

# Start Logistic Regression
train_loss = []
test_loss = []
train_acc = []
test_acc = []
i_data = []
for i in range(10000):
    rand_index = np.random.choice(texts_train.shape[0], size=batch_size)
    rand_x = texts_train[rand_index].todense()
    rand_y = np.transpose([target_train[rand_index]])
    sess.run(train_step, feed_dict={x_data: rand_x, y_target: rand_y})
    
    # Only record loss and accuracy every 100 generations, 100 records, 500 output states 
    if (i+1)%100== 0:
        i_data.append(i+1)
        train_loss_temp = sess.run(loss, feed_dict={x_data: rand_x, y_target: rand_y})
        train_loss.append(train_loss_temp)
        
        test_loss_temp = sess.run(loss, feed_dict={x_data: texts_test.todense(), y_target: np.transpose([target_test])})
        test_loss.append(test_loss_temp)
        
        train_acc_temp = sess.run(accuracy, feed_dict={x_data: rand_x, y_target: rand_y})
        train_acc.append(train_acc_temp)
    
        test_acc_temp = sess.run(accuracy, feed_dict={x_data: texts_test.todense(), y_target: np.transpose([target_test])})
        test_acc.append(test_acc_temp)
    if (i+1)%500==0:
        acc_and_loss = [i+1, train_loss_temp, test_loss_temp, train_acc_temp, test_acc_temp]
        acc_and_loss = [np.round(x,2) for x in acc_and_loss]
        print('Generation # {}. Train Loss (Test Loss): {:.2f} ({:.2f}). Train Acc (Test Acc): {:.2f} ({:.2f})'.format(*acc_and_loss))

 The result is as follows:

 

Image display

# Plot loss over time
plt.plot(i_data, train_loss, 'k-', label='Train Loss')
plt.plot(i_data, test_loss, 'r--', label='Test Loss', linewidth=4)
plt.title('Cross Entropy Loss per Generation')
plt.xlabel('Generation')
plt.ylabel('Cross Entropy Loss')
plt.legend(loc='upper right')
plt.show()

# Plot train and test accuracy
plt.plot(i_data, train_acc, 'k-', label='Train Set Accuracy')
plt.plot(i_data, test_acc, 'r--', label='Test Set Accuracy', linewidth=4)
plt.title('Train and Test Accuracy')
plt.xlabel('Generation')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.show()

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325847279&siteId=291194637