Machine learning technology: Use deep learning to process text

So far, we have explored the application of machine learning in various environments-topic modeling, clustering, classification, text summarization, and even POS tagging and NER tagging are all trained using machine learning. In this chapter, we will begin to explore a cutting-edge machine learning technology: deep learning. Deep learning is inspired by biology to construct algorithmic structures to complete text learning tasks, such as text generation, classification, and word embedding. This chapter will discuss the basics of deep learning and how to implement text deep learning models. The topics covered in this chapter are as follows:

  • Deep learning
  • Application of deep learning to text;
  • Text generation technology.

13.1 Deep learning

The previous chapters introduced machine learning techniques, including topic models, clustering and classification algorithms, and what we call shallow learning-word embedding. Word embedding is regarded as the first neural network model that readers encounter in this book, and they can learn semantic information.

A neural network can be understood as a computing system or machine learning algorithm whose structure is inspired by biological neurons in the brain. We can only introduce neural networks in this general way, because current technology lacks a thorough understanding of the human brain. Neural networks draw on the neural connections and structures of the brain, such as perceptrons and single-layer neural networks.

A standard neural network contains some neuron nodes as arithmetic units, and they interact through connections. In a sense, the model is similar to the structure of the brain. Nodes represent neurons, and connections represent connections between neurons. Neurons in different layers perform different types of operations. The network shown in Figure 13.1 contains an input layer, multiple hidden layers, and an output layer.

 

Figure 13.1 Example of neural network structure

Conversely, the research of neural networks has also promoted the development of cognitive science, and neural networks can help understand the human brain. The tasks of classification, clustering, and vector creation of words and documents mentioned before can all be accomplished by machine learning algorithms implemented by neural networks.

Outside the field of text analysis, neural networks have achieved great success. Current research results in the fields of image classification, machine vision, speech recognition, and medical diagnosis are usually realized through neural networks. As mentioned earlier, neural networks can generate word vectors, and the values ​​stored in the hidden layer in the figure can be represented as word vectors.

This section introduces deep learning and also extends the topic of neural networks to deep learning. Deep learning is just a form of multilayer neural network. Because most of the current neural networks have applied multi-layer structure, this multi-layer structure is deep learning technology. There are exceptions. For example, in Word2Vec, we only get weights from one layer.

Neural networks and deep learning have applications in many fields. Although we cannot accurately explain them from a mathematical perspective, this book still regards it as a preferred solution for natural language processing, so we will introduce how to Apply deep learning to text analysis.

13.2 Application of deep learning to text

When learning word embeddings, we have realized the power of neural networks. This is only part of the function of the neural network, which is to obtain useful information through the structure itself, but its capabilities are more than that. When starting to use deeper networks, it is not prudent to use weights to extract useful information. In this case, we are more interested in the natural output of the neural network. We can train neural networks to perform multiple tasks related to text analysis. In fact, for some of these tasks, the application of neural networks has completely changed the way we handle tasks.

One of the best deep learning use cases is in the field of machine translation, especially Google's neural translation model. Since September 2016, Google has used statistical and rule-based methods and models for language translation, but the Google Brain research team quickly switched to using neural networks, which we call zero-shot translation. Previously, when Google performed the translation task from Malaysian to Arabic, it would first translate the source language into the intermediate language English. After the emergence of the neural network, the model accepts a source language input sentence, and does not immediately output the translated target sentence, but runs a set of scoring mechanisms behind it, such as grammar checking. Compared with the traditional translation method that splits the source language sentence, performs some rule-based translation, and then reassembles the cumbersome steps of a sentence, the deep translation model is more concise. Although the deep model requires more training data and longer training time, its model file is still smaller than the statistical translation model file. More and more language translations are replaced with in-depth models, and the effects are beyond the previous models, especially the newly released Hindi translation model.

Although machine translation technology has made great progress, it still has many shortcomings. For example, users need more grammatically accurate translation results, and current translation systems can only provide results in the target language with close semantics. Just as deep models shine in other fields, people also hope that neural networks can greatly improve the quality of machine translation.

Word embedding technology is another very popular application of neural networks in the field of text processing. Considering how word vectors and document vectors are used in many NLP tasks, it means that word embedding has a place in many machine learning algorithms involving text. In fact, replacing all previous vectors with word embeddings means that all algorithms or applications include neural networks, which can capture the contextual information of words and help improve classification and clustering.

In classification and clustering tasks, neural networks are widely used. In many complex scenarios, such as chat robots, text classification is inseparable. The sentiment analysis in the text is essentially a classification task, that is, distinguishing whether the current sentiment is positive or negative (or more subdivided multiple emotions). Complex networks such as convolutional neural networks and recurrent neural networks can be used for these text classification tasks. Of course, the simplest single-layer neural network can also achieve good prediction results.

Looking back at the POS tagging and NER tagging introduced before, they actually use neural networks to identify parts of speech and named entities, so we have already involved deep learning when using spaCy to tag parts of speech.

The mathematical principles of neural networks are beyond the scope of this book. When discussing different types of neural networks and how to use them, we only discuss their architecture, hyperparameters, and practical applications. Hyperparameters are configurable parameters in machine learning algorithms. Usually, specific values ​​of hyperparameters need to be set before the algorithm is executed.

For ordinary neural networks and even convolutional neural networks, the size of the input and output spaces is fixed and set by the developer. The input/output type can be images, sentences, or essentially a set of vectors. In the field of natural language processing, the output vector represents the probability that the document belongs to a certain category. Recurrent neural network belongs to a kind of neural network with special architecture, which can accept sequence input and realize prediction tasks that are far more complicated than classification. Recurrent neural networks are very commonly used in text analysis because they understand the input data as a sequence to capture the context information of the words in the sentence.

Another application scenario of neural networks in text is to generate probabilistic language models, which can be understood as calculating the probability of the next word (or character) based on the previous text. In other words, the model uses contextual information to calculate the probability of the current word. This method has been widely used before the emergence of neural networks, such as n-gram technology, and the working principle is similar. Traditional methods are based on corpus and text database, trying to calculate the co-occurrence probability of two adjacent words. For example, we would think that New York is a phrase because their co-occurrence probability is very high, and the co-occurrence probability is calculated based on conditional probability and chain probability rules.

The neural network is not realized by learning the occurrence probability of words and characters, but by a sequence generator, so the neural network is a generative model. The generative model of natural language processing is very interesting. It can learn what kind of sentences have high probability, so the text data needed for training can be obtained through neural network simulation.

The word embedding technology is created based on this idea: if the word blue appears after the wall is painted text with the same probability as red, the word embedding technology will encode the two words into the same semantic space. This semantic understanding technology later developed into a shared representation, that is, mapping inputs with the same semantics but different types to the same vector space. For example, the English word dog and the Chinese word dog have the same semantics, so they can be mapped to very similar vectors in the shared Chinese-English vector space. The magic of the neural network is that through training, it can even map images and text to the same space. Automatic text description of images is such a technique.

Deep models that incorporate reinforcement learning (the technology of training models through rewards and punishments for learning errors) can already defeat humans in the game of Go, and Go was once considered the most difficult field for artificial intelligence to break through.

One of the earliest natural language processing tasks is text summarization. The traditional way to solve this problem is to sort the sentences that provide the most information and select a subset of them. This book tries to use this algorithm in the relevant chapters of the text summary. For deep learning, it can directly generate a piece of text, which is similar to the way humans think, that is, omitting the step of selecting key sentences and creating a summary directly through a probability model. This technique is often referred to as Natural Language Generation (NLG).

Therefore, the neural network machine translation model just mentioned is also a similar generative model, which directly generates sentences in the target language. Below we try to use this method as an example to construct the first text-based depth model.

13.3 Text generation

The previous chapters extensively discussed deep learning and natural language processing, as well as text generation techniques to obtain convincing results. Next, we will implement some examples of text generation.

The neural network structure we will use is a recurrent neural network, and its specific implementation version is LSTM, a long short memory network. This kind of network can simultaneously capture the long and short context information of words. The most popular blog about LSTM is Understanding LSTM Networks written by Colah . Readers can learn more about the internal principles of LSTM from this article.

Andrej Karpathy also wrote a similar architecture article The unreasonable effectiveness of Neural Networks on his blog. The implementation language is Lua and the framework is Keras (a highly abstract deep learning framework).

The deep learning ecosystem based on the Python language is developing rapidly. According to the actual situation, developers can use a variety of methods to build a deep learning system. This book uses a relatively abstract high-level framework to easily show the reader the training process. In 2018, choosing a deep learning framework is not easy, so this book uses Keras as an example framework, but before that, let’s briefly discuss and compare the characteristics of various frameworks.

  • TensorFlow:  TensorFlow is a neural network framework released by Google, and it is also a framework used by the artificial intelligence team Google Brain. Unlike pure commercial development tools, TensorFlow is maintained by an active open source community and supports running on the GPU platform. GPU support is a very important feature, it can perform mathematical operations faster than ordinary CPU. Because TensorFlow is a model based on graph computing, it fits well with the neural network model. The framework supports both high-level and low-level interfaces, and it is currently the most popular selection scheme in industry and science.
  • Theano:  It is the world's first deep learning framework developed by Yoshia Bengio (a pioneer in deep learning) of MILA (Montreal Institute of Learning Algorithms). It takes symbolic graphs as a part of deep learning construction, provides low-level interface operations, and is a very powerful deep learning system. Although its code has ceased to be maintained, it is still worthy of reference, even if only to understand this history. Lasagne and Blocks are high-level interfaces of Theano, abstracting and encapsulating some low-level operations.
  • Caffe&Caffe2:  Caffe is the first framework dedicated to deep learning, developed by the University of California, Berkeley. The framework is characterized by fast speed and modularity, and it may be a bit clumsy to use, because it is not a framework developed in Python language, and you need to configure a .prototxt file to use neural networks. Of course, this additional operation does not affect the cost of learning, and we still hope to use some of its excellent features.
  • PyTorch:  is a framework developed based on the Torch library of Lua. It has rapidly grown into a member of the deep framework family. Its author, Facebook AI Research Institute FAIR, has donated it to the open source community and provided multiple sets of APIs. As it has good features such as dynamic calculation graphs, readers are recommended to refer to it.
  • Keras:  Keras is the deep framework used in the examples in this book. With many advanced and abstract and concise interface packages, it is considered to be the most suitable in-depth framework for prototype development. It supports both TensorFlow and Theano two underlying algorithms. We will see its ease of use in implementing code in the text generation example. At the same time, Keras has a large and active community. TensorFlow also announced that Keras will be packaged in a later release version, which means that Keras will still have strong vitality for a long time in the future.

It is recommended that readers understand each in-depth framework so that it can be used optimally in different application scenarios. The technologies involved in these frameworks are the same, so they may have the same logic and text generation process.

The example mentioned earlier in this chapter will involve a recurrent neural network. The advantage of this network is that it can memorize the context. The parameters of the current network layer are learned based on the information passed by the previous layer, and the recursive name is derived from this, so it It can get better training effect than other neural network structures.

We will use a variant of recurrent neural network LSTM (Long Short Memory Network) to implement the following example, this network can maintain long-term information memory. When the input is a time series structure, LSTM can often achieve good results. In natural language scenarios, the appearance of each word is affected by the context of the sentence. This feature of LSTM is even more important, and the uniqueness of this network structure is that it can understand the context of surrounding words, and at the same time Remember the previous words.

If readers are interested in the mathematical principles behind RNN and LSTM, you can refer to the following two articles:

  • Understanding LSTM Networks
  • Unreasonable Effectiveness of Recurrent Neural Networks

The first step of the sample code is also to load some necessary libraries. Please make sure to install Keras and TensorFlow on the local pip using pip or conda.

The following code is the result of a slight modification of Jupyter Notebook:

import kerasfrom keras.models import Sequentialfrom keras.layers import LSTM, Dense, Dropoutfrom keras.callbacks import ModelCheckpointfrom keras.utils import np_utilsimport numpy as np

Here we use Keras's sequence model and add an LSTM structure. The next step is to organize the training data. In theory, any text data can be used as input, depending on the type of data we want to generate. This is where developers can be creative. RNN can form JK Rowling, Shakespeare and even your own writing style, provided that there is enough data.

Using Keras to generate text requires a mapping of all the different characters to be constructed in advance (the example here is character-based). For example, the input text is source_data.txt. In the sample code below, all variables depend on the selected data set, but no matter what text file is selected, the code will run normally.

filename    = 'data/source_data.txt'data        = open(filename).read()data        = data.lower()# Find all the unique characterschars       = sorted(list(set(data)))char_to_int = dict((c, i) for i, c in enumerate(chars))ix_to_char  = dict((i, c) for i, c in enumerate(chars))vocab_size  = len(chars)

Both dictionaries in the above code need to be used as variables to pass characters to the model and generate text. A set of standard input should contain the three variable values ​​of print(chars), vocab_size and char_to_int.

The content of the character set is as follows:

['n', ' ', '!', '&', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '4','5', '6', '7', '8', '9', ':', ';', '?', '[', ']', 'a', 'b', 'c', 'd', 'e','f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't','u', 'v', 'w', 'x', 'y', 'z']

The dictionary size is:

51

After mapping to id, the dictionary content is as follows:

{'n': 0, ' ': 1, '!': 2, '&': 3, "'": 4, '(': 5, ')': 6, ',': 7, '-': 8,'.': 9, '0': 10, '1': 11, '2': 12, '3': 13, '4': 14, '5': 15, '6': 16, '7':17, '8': 18, '9': 19, ':': 20, ';': 21, '?': 22, '[': 23, ']': 24, 'a': 25,'b': 26, 'c': 27, 'd': 28, 'e': 29, 'f': 30, 'g': 31, 'h': 32, 'i': 33,'j': 34, 'k': 35, 'l': 36, 'm': 37, 'n': 38, 'o': 39, 'p': 40, 'q': 41,'r': 42, 's': 43, 't': 44, 'u': 45, 'v': 46, 'w': 47, 'x': 48, 'y': 49,'z': 50}

RNN accepts character sequences as input and outputs similar sequences. Now process the data source into the following sequence:

seq_length = 100list_X = [ ]list_Y = [ ]for i in range(0, len(chars) - seq_length, 1):    seq_in = raw_text[i:i + seq_length]    seq_out = raw_text[i + seq_length]    list_X.append([char_to_int[char] for char in seq_in])    list_Y.append(char_to_int[seq_out])n_patterns = len(list_X)

To be converted into a format that meets the model input, further processing is required:

X = np.reshape(list_X, (n_patterns, seq_length, 1))# Encode output as one-hot vectorY = np_utils.to_categorical(list_Y)

Because the unit of each prediction output is a character, character-based one-hot encoding is essential. This example uses np_utils.to_categorical for encoding. For example, when using index 37 to encode the letter m, the code will look like this:

[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.  1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

Now start to formally create a neural network model:

model = Sequential()model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))model.add(Dropout(0.2))model.add(Dense(y.shape[1], activation='softmax'))model.compile(loss='categorical_crossentropy', optimizer='adam')

The above example creates an LSTM with only one layer of neurons (created using Dense), the dropout rate is set to 0.2, the activation function is softmax, and the optimization algorithm is ADAM.

When the neural network performs well on only one data set, the Dropout value is used to solve the over-fitting problem of the neural network. The activation function is used to determine the activation method of the output value of a neuron, and the optimization algorithm is used to help the network reduce the error between the predicted value and the true value.

Choosing the values ​​of these hyperparameters belongs to practical knowledge. In the next chapter, we will briefly introduce how to select appropriate hyperparameter values ​​for text processing tasks. For now, you can temporarily consider the hyperparameter selection as a black box step to understand. The hyperparameters used here are all standard parameters when using Keras to generate text.

The code for training the model is very simple, similar to scikit-learn, just call the fit function:

filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1,save_best_only=True, mode='min')callbacks_list = [checkpoint]# fit the modelmodel.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)

The fit function will train the input repeatedly for n_epochs times, and then save the optimal weight for each training through the callback method.

The time for the fit function to complete training depends on the size of the training set, which often takes several hours or even days.

Another training method is to pre-load the weights of a trained model:

filename = "weights.hdf5"model.load_weights(filename)model.compile(loss='categorical_crossentropy', optimizer='adam')

We got a trained model, and we can start to generate character-level text sequences.

start   = np.random.randint(0, len(X) - 1)pattern = np.ravel(X[start]).tolist()

Because I want the generated text to be more random, the numpy library is used to limit the range of characters:

output = []for i in range(250):    x           = np.reshape(pattern, (1, len(pattern), 1))    x           = x / float(vocab_size)    prediction  = model.predict(x, verbose = 0)    index       = np.argmax(prediction)    result      = index    output.append(result)    pattern.append(index)    pattern = pattern[1 : len(pattern)]print (""", ''.join([ix_to_char[value] for value in output]), """)

As you can see, based on the current character x we ​​want to predict, the model gives the prediction result of the next character with the largest probability of occurrence (argmax function returns the character id with the largest probability of occurrence), and then converts the index into a character, and then Add to the output list. Depending on the number of iterations we want to see in the output, we need to run multiple loops.

The network model in the LSTM example is not complicated, and readers can add more layers to the network by themselves to achieve better prediction results than this example. Of course, a simple model will become better than before after many epochs training. Andrej Karpathy's blog proved this conclusion and provided the experimental results of the model on Shakespeare and Linux code base.

Preprocessing the input data and repeating the training epoch at the same time can also optimize the prediction effect. Increasing the number of network layers and the number of epoch training will also increase the time cost of training. If readers just want to experiment with RNNs instead of building a scalable production model, Keras is enough.

13.4 Summary

This chapter fully demonstrates the power of deep learning. We successfully trained a text generator that is close to human in terms of grammar and spelling. To create a more realistic chatbot, further tuning and logical intervention are needed.

Although the text generation results of this quality are not perfect for us, in other text analysis scenarios, neural networks can produce satisfactory prediction results, such as text classification and clustering. The next chapter will explore the use of Keras and spaCy for text classification.

Before ending this chapter, readers are recommended to read the following articles to deepen their understanding of deep learning text generation techniques:

  • NLP Best Practices
  • Deep Learning and Representations
  • Unreasonable Effectiveness of Neural Networks
  • Best of 2017 for NLP and DL

This article is taken from "Natural Language Processing and Computational Linguistics"

 

This book describes how to use natural language processing and computational linguistic algorithms to reason about the data you have and gain insights. These algorithms are based on statistical machine learning and artificial intelligence techniques. Tools that use these algorithms are now readily available and can be used in tools such as Python, Gensim, and spaCy.

This book starts with data cleaning, and then introduces related concepts of computational linguistics. After mastering these contents, you can use real language and text, and use Python to explore more complex areas of statistical NLP and deep learning. You will learn how to use appropriate tools to annotate, parse and model texts, and master the use of corresponding framework tools. You will also know when to choose a tool like Gensim as a topic model, and when to use Keras for deep learning .

This book balances the relationship between theory and actual cases, so you can execute your own natural language processing projects while mastering theoretical knowledge. You will discover the rich ecosystem of Python, a natural language processing tool, and enter the interesting world of modern text analysis.

Guess you like

Origin blog.csdn.net/epubit17/article/details/108233425