Introduction to Natural Language Processing (NLP) (2)

Connect to a blog

V. Build a model that recognizes text emotions

Using the title data set published by kaggle users, the titles have been divided into'sarcastic' and'not sarcastic'.'sarcastic' is represented by 1, and'not sarcastic' is represented by 0.
Purpose: train a text classifier and test whether a sentence is ironic Means the
first step: the internal data of the data set is stored in json, we must first convert it to python format
Insert picture description here
——————>
Insert picture description here
Step 2: Extract the tags, text, and article links in the json file respectively

import json#调用json库
with open('sarcasm.json','r') as f:#用json库加载讽刺标题json文件
    datastors= json.load(f)

sentences=[]#为标题、标签、文章链接创建列表
labels=[]
urls=[]

for item in datastore:#将json文件中对应的所需值加载到python列表中
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])
    urls.append(item['article_link'])

Step 3: Preprocessing the text
1. Extract all the words in the label and correspond to their identifiers

from tensorflow.keras.preprocessing.test import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer=Tokenizer(ovv_token="<OVV>")
tokenizer.fit_on_texts(sentences)
word_index= tokenizer.word_index
print(word_index)

Output a list of all words and their identifiers
Insert picture description here
2. Convert the sentences in the tags into identification sequences, and fill in 0 from the back

sequences=tokenizer.texts_to_sequences(sentences)
padded= pad_sequences(sequences,padding='post')
print(padded[0])#打印第一个句子对应的标识列表
print(padded.shape)#显示整个语料库的详情

Output: The
Insert picture description here
entire corpus has a total of 26709 sentences = 26709 sequences, and each sequence is filled with 40 words

Step 4: Improve the preprocessing of the text and divide it into a training set and a test set to ensure that the neural network has only seen the training data, and that the tokenizer is only used for training data

training_sentences=sentences[0:training_size]#设置training_size为20000,则前20000个句子用于训练集
testing_sentences=sentences[training_size:]#20000至后面的句子用于测试集
training_labels=labels[0:training_size]
testing_labels=labels[training_size:]

tokenizer=Tokenizer(num_words=vocab_size,ovv_token="oov_tok")#实例化分词器
tokenizer.fit_on_texts(training_sentences)#让分词器只训练分出来的训练集
word_index= tokenizer.word_index

training_sentences=tokenizer.texts_to_sequences(training_sentences)#创建训练序列
training_padded= pad_sequences(training_sentences,maxlen=max_length,padding=padding_type,truncating=trunc_type)
testing_sentences=tokenizer.texts_to_sequences(testing_sentences)#创建测试序列
testing_padded= pad_sequences(testing_sentences,maxlen=max_length,padding=padding_type,truncating=trunc_type)

Step 5: Build a network model, and use the embedding layer (Embedding)

Embedding layer (Embedding) Role: words appearing in satirical sentences will have a strong tilt in the direction of'sarcastic', appearing in non-sarcastic sentences Vocabulary, there will be a strong tilt in the direction of'not sarcastic'. When the neural network is loaded and more and more sentences are trained, if we input a sentence into the trained neural network, the neural network will Each word in the sentence is sentimentally judged and the coordinate vector of each word is given, and then the vectors of each word are added together as the sentiment judgment of the sentence. This process is embedding.

#vocab_size:字典大小
#embedding_dim:本层的输出大小,也就是生成的embedding的维数
#input_length:输入数据的维数,因为输入数据会做padding处理,所以一般是定义的max_length
model=tf.kr=eras.Sequential([
    tf.keras.layers.Embedding(vocab_size,embedding_dim,inout_length=max_length),#每个单词的情感方向都会在这里面被一次又一次的学习,#定义矩阵(vocab_size * embedding_dim), 输出(batch_size, max_length, embedding_dim), 将词表表示的句子转化为embedding
    tf.keras.layers.GlobalAvgPool1D(),#将每个单词的向量相加
    tf.keras.layers.Dense(24,activation='relu'),#放入一个普通的深度神经网络中
    tf.keras.layers.Dense(1,activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adma',metrics=['accuracy'])
num_epoch=30
history=model.fit(training_padding,training_labels,epochs=num_epoch,validation_data=(testing_padded,testing_labels),verbose=2)

Training results:
Insert picture description here
Step 6: Test the model to determine the emotional color of the new sentence

sentence=['geanny starting to fear spiders in the garden might be real',
          'the weather today is bright']
sequences=tokenizer.texts_to_sequences(sentence)
padded= pad_sequences(sentences,maxlen=max_length,padding=padding_type,truncating=trunc_type)
print(model.predict(padded))

forecast result:
Insert picture description here

Question: What is the role of tf.keras.layers.GlobalAvgPool1D()??? How to add the vectors???
Answer: Use the trained weights to perform global average pooling on the vectors

Guess you like

Origin blog.csdn.net/qq_45234219/article/details/114494467