Encoder and Decoder with Attention Model

Encoder Decoder with Attention model is a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. It uses a multilayered Gated Recurrent Unit (GRU) to map the input sequence to a vector of a fixed dimensionality, and then another deep GRU to decode the target sequence from the vector.

kaggle
在这里插入图片描述
A sequence to sequence model has two parts – an encoder and a decoder. Both the parts are practically two different neural network models combined into one giant network. the task of an encoder network is to understand the input sequence, and create a smaller dimensional representation of it. This representation is then forwarded to a decoder network which generates a sequence of its own that represents the output. The input is put through an encoder model which gives us the encoder output. Here, each input words is assigned a weight by the attention mechanism which is then used by the decoder to predict the next word in the sentence. We use Bahdanau attention for the encoder.

One type of network built with attention is called a transformer (explained below). If you understand the transformer, you understand attention. And the best way to understand the transformer is to contrast it with the neural networks that came before. They differ in the way they process input (which in turn contains assumptions about the structure of the data to be processed, assumptions about the world) and automatically recombine that input into relevant features.

Recurrent Networks and LSTMs
How do words work? Well, for one thing, you say them one after another. They are expressed in a two-dimensional line that is somehow related to time. That is, reading a text or singing the lyrics of a song happens one word at a time as the seconds tick by.

Compare that to images for a moment. An image can be glimpsed in its totality in an instant. Most images contain three-dimensional data, at a minimum. If you consider each major color to be its own dimension, and the illusion of depth, then the image contains many more than two. And if you’re dealing with video (or life), you’ve added the dimension of time as well.

One neural network that showed early promise in processing two-dimensional processions of words is called a recurrent neural network (RNN), in particular one of its variants, the Long Short-Term Memory network (LSTM).

RNNs process text like a snow plow going down a road. One direction. All they know is the road they have cleared so far. The road ahead of them is blinding white; i.e. the end of the sentence is totally unclear and gives them no additional information. And the remote past is probably getting a little snowed under already; i.e. if they are dealing with a really long sentence, they have probably already forgotten parts of it.

So RNNs tend to have a lot more information to make good predictions by the time they got to the end of a sentence than they would have at the beginning, because they were carrying more context with them about the next word they wanted to predict. (But some of the context needed to predict that word might be further down a sentence they hadn’t fully plowed yet.) And that makes for bad performance.

RNNs basically understand words they encounter late in the sentence given the words they have encountered earlier. (This is kind of corrected by sending snow plows down the street in two directions with something called bi-directional LSTMS.)

RNNs also have a memory problem. There’s only so much they can remember about long-range dependencies (the words they saw a long time ago that are somehow related to the next word).

That is, RNNs put too much emphasis on words being close to one another, and too much emphasis on upstream context over downstream context.

Attention fixes that.

Attention Mechanisms

Attention takes two sentences, turns them into a matrix where the words of one sentence form the columns, and the words of another sentence form the rows, and then it makes matches, identifying relevant context. This is very useful in machine translation.
在这里插入图片描述

So that’s cool, but it gets better.
You don’t just have to use attention to correlate meaning between sentences in two different languages. You can also put the same sentence along the columns and the rows, in order to understand how some parts of that sentence relate to others. For example, where are my pronouns’ antecedents? This is called “self-attention”, although it is so common that many people simple call it attention.

Linking pronouns to antecedents is an old problem, which resulted in this chestnut, mimicking the call and response of a protest march:
WHAT DO WE WANT?

Natural language processing!

WHEN DO WE WANT IT?

Sorry, when do we want what?
A neural network armed with an attention mechanism can actually understand what “it” is referring to. That is, it knows how to disregard the noise and focus on what’s relevant, how to connect two related words that in themselves do not carry markers pointing to the other.

So attention allows you to look at the totality of a sentence, the Gesamtbedeutung as the Germans might say, to make connections between any particular word and its relevant context. This is very different from the small-memory, upstream-focused RNNs, and also quite distinct from the proximity-focused convolutional networks.

Language is this two-dimensional array that somehow manages to express relationships inherent in life over many dimensions (time, space, colors, causation), but it can only do so by creating syntactic bonds among words that are not immediately next to each other in a sentence.

Attention allows you to travel through wormholes of syntax to identify relationships with other words that are far away — all the while ignoring other words that just don’t have much bearing on whatever word you’re trying to make a prediction about (Borges’s “idle details”).

Transformer

While attention was initially used in addition to other algorithms, like RNNs or CNNs, it has been found to perform very well on its own. Combined with feed-forward layers, attention units can simply be stacked, to form encoders.

Transformers use attention mechanisms to gather information about the relevant context of a given word, and then encode that context in the vector that represents the word. So in a sense, attention and transformers are about smarter representations.

Feed forward networks treat features as independent (gender, siblings); convolutional networks focus on relative location and proximity; RNNs and LSTMs have memory limitations and tend to read in one direction. In contrast to these, attention and the transformer can grab context about a word from distant parts of a sentence, both earlier and later than the word appears, in order to encode information to help us understand the word and its role in the system called a sentence.

There are various architectures for transformers. One of them uses key-value stores and a form of memory.

To quote Hassabis et al:

While attention is typically thought of as an orienting mechanism for perception, its “spotlight” can also be focused internally, toward the contents of memory. This idea, a recent focus in neuroscience studies (Summerfield et al., 2006), has also inspired work in AI. In some architectures, attentional mechanisms have been used to select information to be read out from the internal memory of the network. This has helped provide recent successes in machine translation (Bahdanau et al., 2014) and led to important advances on memory and reasoning tasks (Graves et al., 2016). These architectures offer a novel implementation of content-addressable retrieval, which was itself a concept originally introduced to AI from neuroscience (Hopfield, 1982).

In this architecture, you have a key, a value, and a search query. The query searches over the keys of all words that might supply context for it. Those keys are related to values that encode more meaning about the key word. Any given word can have multiple meanings and relate to other words in different ways, you can have more than one query-key-value complex attached to it. That’s “multi-headed attention.”

One thing to keep in mind is that the relation of queries to keys and keys to values is differentiable. That is, an attention mechanism can learn to reshape the relationship between a search word and the words providing context as the network learns.