Song lyrics generation based on deep learning

"Natural Language Processing" course report

Summary

Song Ci is one of the new style poems relative to the ancient style poetry. It is the essence of the wisdom of the Confucian scholars and literati in the Song Dynasty, marking the highest achievement of the Song Dynasty literature. Song Ci generation belongs to the text generation module in the field of natural language processing. The current field of text generation mainly includes natural language generation based on language models and natural language generation using deep learning methods. Based on the tensorflow framework of deep learning, this project builds different song lyrics generation models, trains and evaluates them, and draws conclusions through comparison.
First, 5,000 Song lyrics were collected on github as a data set, and then the data set was preprocessed, including missing value processing, abnormal symbol processing, and dictionary establishment. Then, using words as the granularity, the sliding window method is used to generate training and testing sets. Use the word2Vec model to perform unsupervised learning with all Song lyrics as text data, and get the word vector of each word, which will be used to initialize the weight of the embedding layer of the neural network later. Then use TextCNN, LSTM, Attention, Transformer and other network structures and methods to build and establish Song Ci generation models. This project implements six models, namely: TextCNN, BiLSTM, BiLSTM+Attention, CNN+BiLSTM+Attention, Tex-tCNN+ BiLSTM+Attention and Transformer models. Finally, the models are trained. The accuracy rates of the above models on the test set are: 0.63, 0.68, 0.73, 0.81, 0.69, and 0.85. Among them, the Transformer model has the best effect, with an accuracy rate of 85%. Taking "evening wind blowing gently" as input, the output Song Ci is: "Evening wind blows gently in Chu Township, the moon is at the door, and the cool heart is here." Thousands of immortals are separated from the painting maze, with white hair flowing back, resting in a high place. There are thousands of waves in the red, and the lonely building is short of the moon. It is always difficult to stay in the swallow. Try to turn on the light to write, but still can't bear it. It can be seen that the generated Song Ci has a better effect. Finally, the results of the above models are summarized and analyzed in combination with their principles.

1. Project introduction

Song Ci is one of the new style poems relative to the ancient style poetry. It is the essence of the wisdom of the Confucian scholars and literati in the Song Dynasty, marking the highest achievement of the Song Dynasty literature. Song Ci generation belongs to the text generation module in the field of natural language processing. The current field of text generation mainly includes natural language generation based on language models and natural language generation using deep learning methods. The application prospect of text generation technology is broad. This project uses word2Vec, TextCNN, LSTM, Attention, Transformer and other network structures and methods to establish Song Ci generation models, and compares and analyzes the effects of various models.

2. Project implementation plan

This project is based on word2Vec, TextCNN, LSTM, Attention, Transformer and other networks and mechanisms. By changing the network structure, the overall flow chart of the project is realized, as shown in Figure 1.

Figure 1 Project flow chart

2.1 Preliminary preparation

2.1.1 Dataset source

Data source: chinese-poetry/chinese-poetry: The most comprehensive database of Chinese poetry The most comprehensive database of ancient Chinese poetry, nearly 14,000 ancient poets in Tang and Song dynasties, nearly 55,000 Tang poems plus 260,000 Song poems. Two Song Dynasties During the period, there were 1564 poets and 21050 first words. (github.com), this project selects 5,000 Song lyrics as a data set, the data set format is in json format, contains 5 json files, and each file contains 1,000 Song lyrics.
insert image description here

2.1.2 Data preprocessing

After getting the data, the missing value processing was first performed, and the fields without content were deleted, and then some wrong standard symbols were removed, and the data was cleaned as a whole. After the processing is completed, the length of Song lyrics is counted, as shown in Figure 3. It can be seen from the figure that the length of most Song lyrics is between 50-100 characters. A few exceed 150 characters.
![Insert picture description here](https://img-blog.csdnimg.cn/74ed0c1a57614bf893be887d5d595df9.png

Since Song Ci is an ancient article, it is not suitable to use modern word segmentation tools. Forcibly using word segmentation tools will lead to unsatisfactory results. Therefore, this project uses the Tokenizer tool in keras.preprocessing.text to preprocess Song Ci documents. Generate a dictionary and its corresponding index for each word. After statistics, there are 4394 words in total. Then, use the texts_to_sequences tool for each Song Ci to convert the corresponding words into dictionary index subscripts. Next, the training set and test set are generated by means of sliding windows. Take 10 as the window, and slide each Song Ci from the beginning to the right. The first 10 words are the input value, and the last 1 word is the predicted value. Sampling is performed. The sampling method is shown in Figure 4, and finally 74693 pieces of sample data are obtained. The training set and test set are divided by 8:2, and 59754 training set data and 14939 test set data are obtained.

Figure 4 Sentence sampling

2.1.3 Environment Construction

This project uses Tensorflow, Keras and Keras_transformer framework
version: Python3.6.3 Tensorflow2.6.2 Keras2.6.0 keras_transformer0.40.0
Here are some packages used

import numpy
from keras import backend as K
from keras import activations
from gensim.models import word2vec
from keras.layers import layer
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.models import Model
from keras.layers import Input, Dense, embedding, LSTM, Bidirectional, MaxPooling1D,Flatten,concatenate,Conv1D
from tensorflow.keras.utils import tocategorical
import tensorflow as tf
from keras.preprocessing.sequence import padsequences
from keras.layers import embedding, LSTM, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku 
from sklearn.modelselection import traintestsplit
from tensorflow.keras.utils import tocategorical

2.3 Model preparation

2.2.1 Word2vec

Word2vec is an NLP tool launched by Google in 2013. It is characterized by being able to convert words into vector representations, so that words can quantitatively measure the relationship between them and mine the connection between words. Since Word2vec will consider the context, compared with the previous Embedding method, it is faster and more versatile, and can be used in various NLP tasks. The figure below is a Word2vec structure diagram.
insert image description here

The training model of Word2Vec is essentially a neuron network with only one hidden layer. Its input is a vocabulary vector encoded by One-Hot, and its output is also a vocabulary vector encoded by One-Hot. Use all the samples to train the neuron network. After convergence, the weights from the input layer to the hidden layer are the word vectors of each word using Distributed Representation. The network is represented by words, and it needs to guess the input words in adjacent positions. After the training is completed, the word2vec model can be used to map each word to a vector, which can be used to represent the relationship between words and words. This vector is the hidden layer of the neural network. . In this project, Word2vec is used to train all Song words first, the window size is set to 5, and the word vector dimension is set to 100. All words are considered, and the trained weights are used as the initialization weight values ​​of the embedding layer in the subsequent neural network.

2.2.2 CNN and TextCNN

Convolutional Neural Network (CNN) is a feed-forward neural network that consists of several convolutional layers and pooling layers. The basic structure of CNN consists of input layer, convolutional layer, pooling layer, fully connected layer and output layer. The convolutional layer and the pooling layer generally take several, and the convolutional layer and the pooling layer are alternately set, that is, one convolutional layer is connected to a pooling layer, and the pooling layer is connected to a convolutional layer. The convolutional neural network is evolved from the multi-layer perceptron (MLP). For high-dimensional input, it is not practical to fully connect all neurons with the neurons of the previous layer, so partial connections (local perception) are used. As can be seen from the figure below, the number of connections is doubled, and the parameters are also doubled.
In 2014, Yoon Kim made some deformations for the input layer of CNN and proposed a text classification model textCNN. Compared with the CNN network of traditional images, textCNN has no change in network structure (even simpler). From Figure 6, it can be seen that textCNN actually only has one layer of convolution, one layer of max-pooling, and finally connects the output to softmax. n classification. The biggest advantage of TextCNN is its simple network structure. With such a simple model network structure, it still has a very good effect by introducing trained word vectors, surpassing benchmarks on multiple data sets. Disadvantages of textCNN: The interpretability of the model is not strong. When tuning the model, it is difficult to adjust the specific features according to the training results, because there is no feature importance similar to the gbdt model in textCNN. concept, so it is difficult to evaluate the importance of each feature.

Figure 6 TextCNN Architecture Diagram (Source: https://zhuanlan.zhihu.com/p/129808195)

In this project, using part of the sentence to predict the next word is essentially a classification task, so we tried to use TextCNN to extract text features and predict the next word, and also used a model combining CNN and BiLSTM to predict the next word Make predictions.

2.2.3 LSTM and BiLSTM

The full name of LSTM is Long Short-Term Memory, which is a type of RNN (Recurrent Neural Network). Due to its design characteristics, LSTM is very suitable for modeling time series data, such as text data. BiLSTM is the abbreviation of Bi-directional Long Short-Term Memory, which is composed of forward LSTM and backward LSTM. Both are commonly used to model contextual information in natural language processing tasks. On many problems, LSTMs have achieved considerable success and are widely used. LSTM avoids long-term dependency problems by deliberate design. Remembering long-term information is in practice the default behavior of LSTMs, not an ability to acquire at great cost! All RNNs have a chain form of repeated neural network modules, and the LSTM structure is shown in Figure 7.

insert image description here

LSTM can only predict the output of the next moment based on the timing information of the previous moment, but in some problems, the output of the current moment is not only related to the previous state, but also may be related to the future state. For example, predicting the missing words in a sentence not only needs to be judged based on the previous text, but also needs to consider the content behind it, so that the judgment is truly based on the context. BiLSTM consists of two LSTMs stacked up and down, and the output is determined by the states of the two LSTMs, as shown in Figure 8.

insert image description here

2.2.4 Attention mechanism

The attention mechanism was first applied to the image field, and then Bahdanau et al. in the paper "Neural Machine Translation by Jointly Learning to Align and Translate", used an attention-like mechanism to translate and align simultaneously on machine translation tasks. Their work It is the first to apply the attention mechanism to the NLP field. Then the attention mechanism is widely used in various NLP tasks based on neural network models such as RNN/CNN. The essence of the attention mechanism is to get inspiration from the human visual attention mechanism. Generally speaking, when our vision perceives things, we usually don’t watch a scene from the beginning to the end every time, but often observe and pay attention to specific objects according to the needs. a part of. And when we find that something we want to observe often appears in a certain part of a scene, we will learn to focus on that part when similar scenes appear in the future.
insert image description here

The attention function has three steps to complete to get the attention value. Q and K perform similarity calculations to obtain weights to normalize the upper weights, and use the normalized weights and V weighted summation to get the attention mechanism working process from the attention function above. Now to understand it from another angle, we regard the attention mechanism as soft addressing. That is to say, each element in the sequence is stored in the memory by a key (address) and value (element) data pair. When there is a query=key query, the value of the element (that is, the attention value of the query query) needs to be taken out. Different from the traditional addressing, it does not extract the value according to the address, it completes the addressing by calculating the similarity between the key and the query. This is the so-called soft addressing. It may take out the value (value) of all addresses (keys). The similarity calculated in the previous step determines the importance of the taken out value, and then merges the value according to the importance to get attention. value.
The attention mechanism can flexibly capture global and local connections, and it can be done in one step. On the other hand, it can be seen from the attention function that it first compares each element of the sequence with other elements. In this process, the distance between each element is one, so it is more recursive than the time series RNNs step by step. Getting long-term dependencies is much better, the longer the sequence RNNs are the weaker they are at capturing long-term dependencies. Parallel computing reduces model training time. Attention mechanism Each step of calculation does not depend on the calculation result of the previous step, so it can be processed in parallel like CNN. But CNN only captures local information each time, and obtains a global connection enhancement view through cascading. The model complexity is small and the parameters are few.

2.2.5 Transformer

The traditional CNN and RNN are abandoned in the Transformer, and the entire network structure is completely composed of the Attention mechanism. More precisely, Transformer consists of and only consists of self-Attenion and Feed Forward Neural Network. A Transformer-based trainable neural network can be built by stacking Transformers. By building an encoder and a decoder with 6 layers each, a total of 12 layers of Encoder-Decoder, it has achieved a new high in BLEU in machine translation.

insert image description here

Transformer is different from RNN and can be trained in parallel better. Transformer itself cannot use the order information of words, so it is necessary to add position Embedding to the input, otherwise Transformer is a bag of words model. The focus of Transformer is the Self-Attention structure, where the Q, K, and V matrices used are obtained by linear transformation of the output. There are multiple Self-Attentions in Multi-Head Attention in Transformer, which can capture the attention score of the correlation coefficient in multiple dimensions between words.
Although Transformer did not escape the traditional learning routine in the end, Transformer is only a combination of full connection (or one-dimensional convolution) and Attention. But its design is innovative enough, because it abandons the most fundamental RNN or CNN in NLP and has achieved very good results. The design of the algorithm is very exciting, and it is worthy of careful study and taste by every person involved in deep learning. The biggest key to performance improvement in Transformer's design is to set the distance between any two words to 1, which is very effective in solving the difficult long-term dependency problem in NLP. Transformer can not only be applied in the field of machine translation of NLP, but not even limited to the field of NLP. It is a direction with great scientific research potential. The parallelism of the algorithm is very good, which is in line with the current hardware (mainly referring to GPU) environment.

2.4 Model building

2.4.1 TextCNN

The TextCNN model is used to generate Song lyrics. The first layer of the model is the input layer, and the second layer is the embedding layer. The initial value of the weight matrix of the embedding layer is pre-trained by word2Vec, and three convolution kernels are introduced. The length of the convolution kernel is the same as The dimensions of the words are the same, the width of the convolution kernel is set to 3, 4, and 5 respectively, and equal-width convolution is used, and the channel is set to 64. Then, the maximum pooling is performed on the three convolutional matrices, and then the pooling The final matrix is ​​spliced ​​together through the concatenate layer, and then flattened by a flatten layer. Finally, after the dropout, the fully connected layer is connected, and the softmax multi-category is used to predict the next word. The network structure is shown in Figure 11.
insert image description here

2.4.2 BiLSTM

The BiLSTM model is used to generate Song lyrics. The first layer of the model is the input layer, the second layer is the embedding layer, the initial value of the weight matrix of the embedding layer is pre-trained by word2Vec, and the third layer is a bidirectional LSTM with a dimension of 128. The flat layer flattens the output vector, and finally after Dropout, it is connected to the fully connected layer for classification. The specific network structure is shown in Figure 12.

insert image description here

2.4.3 BiLSTM+Attention

In the structure of BiLSTM, the attention layer is added after the bidirectional LSTM layer. Through the operation of the attention layer, the 10x256-dimensional vector is converted into a 1x256-dimensional vector, which is directly connected to the fully connected layer for classification and prediction of the next word. The specific network structure is shown in Figure 13. .

insert image description here

2.4.4 CNN+BiLSTM+Attention

Combine CNN with BiLSTM and Attention, add a layer of convolution and a layer of pooling after the embedding layer, and then connect the bidirectional LSTM. The subsequent structure is the same as 2.4.3. The specific network structure is shown in Figure 14.
insert image description here

2.4.5 TextCNN+BiLSTM+Attention

Combine TextCNN with BiLSTM and Attention. Unlike 2.4.4, TextCNN has multiple convolution kernels. The structure of the convolution kernel is the same as 2.4.1. The vector after each convolution is spliced ​​through the concatnate layer, and then input to BiLSTM+ Attention Network, the specific network structure is shown in Figure 15.
insert image description here

2.4.6 Transformer
This model builds a transformer model by using the get_model module in keras_transformer. The parameters in get_model are set as follows: token_num=10000, embed_dim=128, en-codernum=3, decoder_num=2, head_num=4, hidden_dim=256 , attention_activation='relu', feed_forward_activation='relu', dropout_rate=0.1. Using the cross-entropy loss function, the optimizer is adam. Since the results of all transformer networks are too large, the general structure is shown, as shown in Figure 16.
insert image description here

3. Experimental results
After training the above model, the results are shown in Table 1
insert image description here

As can be seen from the above table, the Transformer model has the highest accuracy rate on the test set, followed by the CNN+BiLSTM+Attention model, and the lowest is the TextCNN model. Then, using "Evening Breeze" as the model input, apply the above model to generate Song Ci, the results are shown in Table 2.
insert image description here

It can be seen from the above table that the Song Ci generated by Transformer has a relatively regular sentence structure, and TextCNN and BiLSTM cannot correctly generate Song Ci when the sentence is too long. After adding the Attention mechanism, the effect of Song Ci generation is improved, which may be related to the inability to capture longer information, and the Attention mechanism can improve the characteristics that the previous information cannot be remembered when the sequence is too long, making the encoding more comprehensive.

4. Project Summary

4.1 Project Evaluation and Innovation

This project uses six models including TextCNN, BiLSTM, BiLSTM+Attention, CNN+BiLSTM+Attention, TextCNN+BiLSTM+Attention and Transformer, among which Transformer and CNN+BiLSTM+Attention work best. The innovation of this project lies in the combination of CNN and BiLSTM+Attention, which has achieved good results. In some tasks, adding a layer of BiLSTM after CNN can enhance the model's understanding of semantics. CNN is responsible for extracting the features of the text, while BiLSTM is responsible for understanding the semantic information of sentences. When CNN integrates the cyclic neural network, it combines the functions of the two, and the effect will often be improved. When the input text is very long, it is difficult for the two-way long short-term memory model to have a good vector representation of the text. Therefore, at this time, you can consider using the attention mechanism to try to grasp the focus of the text. Specifically, the Attention mechanism is to retain the intermediate output results of the BiLSTM encoder for the input sequence, and then train a model to selectively learn these inputs and associate the output sequence with it when the model outputs. In this project, introducing the Attention module after the model can improve the effect of the model.
TextCNN+BiLSTM+Attention does not perform well in this project. After consulting relevant information, it is found that TextCNN is generally not used to integrate BiLSTM. TextCNN itself is a very good model. Adding a layer of recurrent neural network behind TextCNN often It just brings more computing time, and its own understanding of semantics does not help, and may even interfere with the results. The Transformer model performed best in this project, and it has a lot to do with its own model architecture. Transformer abandons the traditional CNN and RNN models, and uses the Self-Attention module. The decoding ability of the model is stronger than that of Seq2Seq and other models.

4.2 Experience

During the process of the project, I encountered many difficulties, such as: environment configuration problems, tensorflow and keras versions do not match, resulting in the late code always reporting errors, and some library functions cannot be used. It is recommended to use anaconda to facilitate the management of various packages. In the process of using Transformer, it does not understand the parameters and often reports errors.
Although this course was short, I learned a lot. Many things I didn’t understand when I searched for information on the Internet and watched videos before, I suddenly realized it through the teacher’s explanation in class. Now I think back to the project I did during the comprehensive training. At that time, I didn’t understand some theoretical knowledge, and I didn’t know why after reporting an error. However, after studying this course, my theoretical knowledge has become more solid, and I have gradually formed an nlp framework in my mind, and I have a comprehensive understanding of the entire processing flow and model of nlp tasks. Although the natural language processing course is over, the learning and exploration of nlp is not over yet. Not only consolidated the knowledge I learned in the classroom, but also strengthened my self-study ability and accumulated project experience. This will be of great help to later study and scientific research.

references

[1] Liang Xiao. Research on Poetry Generation Method Combining Attention Mechanism and Conditional Variational Autoencoder [D]. Guilin University of Electronic Science and Technology, 2021. DOI: 10.27049/d.cnki.ggldc.2021.000874. [2] https
: //zhuanlan.zhihu.com/p/162035103
[3] Xiao Yifei. Research on Automatic Generation of Long Texts Based on Deep Learning [D]. Jiangnan University, 2022. DOI: 10.27169/d.cnki.gwqgu.2022.001020.
[4] Zhang Chenyang , Du Yihua. Research progress on automatic short text generation technology [J]. Frontiers of Data and Computing Development, 2021, 3(03): 111-125. [5] Gao Tian.
Research on text generation algorithm based on seq2seq framework [D]. Qingdao University of Science and Technology, 2021. DOI: 10.27264/d.cnki.gqdhc.2021.000987.

Guess you like

Origin blog.csdn.net/weixin_55085530/article/details/127814021