Attention is All You Need

The paper I’d like to discuss is Attention Is All You Need by Google. The paper proposes a new architecture that replaces RNNs with purely attention called Transformer. Transformer has revolutionized the nlp field especially on the machine translation task.

So what’s so special about Transformer? Transformer is the first sequence to sequence model without any recurrent component, relying entirely on attention to compute representations.

It shows strong results on translations and since there’s no recurrents and only matrix multiplications, it’s faster with the help of GPU and parallel computing.

What is Transformer? Before That…

Before we dive into the technical details, there’s some prerequisites papers that I found useful when trying to figure out Transformer:

Encoder-decoder Structure

If you don’t have any prior knowledge, it’s fine. Let’s start with the very basic encoder and decoder. Take an recurrent neural network (RNN) that encodes an input sequence of symbols to a sequence of continuous representations as hidden state S. Given S, the decoder then generates an output sequence of symbols one element at a time. At each step the model consumes the previously generated prediction as additional input when generating the next. The actual Transformer Structure is much complicated then this.

source: http://ruder.io/deep-learning-nlp-best-practices/

On a high level view, Transformer is consisted of three parts: encoders, decoders, and connections between them. Both encoder and decoder are composed of a stack of N = 6 identical layers to enhance the performance.

Compare to the previous slide, the encode-decoder structure is still the same. We still encode the input but now using attentions instead of RNNs.

source: http://jalammar.github.io/illustrated-transformer/

RNNs vs Attention Mechanism

Let’s compare RNNs and attention. RNNs cram everything they know about a sequence  into the final hidden state of the network so the decision layer can only access the memory layer of the corresponding time step. On the other hand, an attention mechanism takes into account the input from several time steps and accords different weights to those inputs to make one prediction.


source: https://skymind.ai/wiki/attention-mechanism-memory-network

Encoder-Decoder with Attention Mechanism

Using attention in a encoder-decoder structure is not new. The idea is that attention acts as the only source to get information from encoder to decoder, allowing the decoder to attend to which encoder they attend weights to.

With the output vector from the encoder side, you query each output asking how relevant they are to the current computation on the decoder side. Each output from the encoder then gets a score of relevance which we can turn into a probability distribution that sums up to one via the Softmax activation. We can then extract a context vector that’s a weighted summation of the encoder outputs depending on how relevant we think they are.

source: http://phontron.com/class/nn4nlp2017/assets/slides/nn4nlp-09-attention.pdf

Calculate Attentions with Key, Value, and Query

If we look at how attention is actually calculated, there’s three important vectors, query, keys, and values. Keys and values vector is split from the hidden state with non-trivial transformation as the graph indicates here.

An attention function is to compute the compatibility of the query with the key vectors to retrieve the corresponding value.

Source-target-attention vs Self-attention (divide by different input source)

If we look at where the query vector comes from, there’s two kind of attentions: Source-Target-Attention a.k.a. Encoder-Decoder Attention and self attention. Again, this all exists before Transformer was proposed.

For the Source-Target-Attention: Key Value comes from encoder’s hidden layer (a.k.a. Source sentence) while the Query comes from decoder’s hidden layer (a.k.a the Target sentence),this is the Encoder-Decoder Attention we just talked about in the slide before.

The other one on the right is called Self-Attention: the Query, Key, Value all comes from the same place (that’s why it’s called “Self”),for example, the encoder’s Query, Key, Value all comes from the output of the previous hidden layer.

Attention Score Functions


So how do we actually calculate the scores? there’s two kind of scoring functions: the older one is called additive attention, which is to use a feed forward network to get the Attention weight. The newer one is called dot-product attention. Since it doesn’t need parameters, it is faster and more efficient. Transformer uses this type of scoring function.

Self-Attention Scores

With that in mind, we can now look at how self-attention in Transformer is actually computed step by step. What is special about self attention is that it allows each word to look at other positions in the input sequence. This helps Transformer to get a better encoding for this word by scoring each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

  • Get the word embedding for each word, then create three vectors from the embedding vector.
  • Calculate score by taking the dot product of the query vector with the key vector of the respective word we’re scoring.
  • Divide the scores by square root of the dimension of the key vectors to normalize the value to get more stable gradients.
  • Pass the result through a Softmax operation. to normalize the scores that add up to 1.
  • Multiply each value vector by the Softmax score to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words.
  • The final step is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).

In the actual implementation, the calculation is done in matrix form that stacks each word vector into a matrix for faster processing. each row in the matrix represents a word

We can also learn more features with multi-headed attention. Each attention Q/K/V matrix is initialized randomly. After training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace to learn different features. For example, some might learn semantic features, others might learn syntactic dependencies.

Three Kind of Attentions

If we zoom in, each encoder layer has two sub-layers while decoder has three. There’s three kind of attentions used in Transformer’s encoder/decoder structure. The first is a multi-head self-attention mechanism in the encoder, and the second is the original encoder-decoder attention in the decoder, which performs multi-head attention over the output of the encoder stack The decoder stack also has a self-attention layer but it is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions by setting them to negative infinity before the Softmax step.

Residual Connection

Six layers is pretty deep. When the deeper a network is, the harder it is to train because the further a gradient has to travel, the more it risks vanishing or exploding. One of the solution for avoiding vanishing gradients is using Residual Connection. For example, each sub-layer in each encoder has a residual connection around it, and is followed by a layer-normalization step.


The Attention mechanism enhances this model by enabling you to “glance back” at the input sentence at each step of your decoder stage. Each decoder output now depends not just on the last decoder state, but on a weighted combination of all the input states.

Position Encoding

For sequence to sequence model, orders and position is important information. Besides of embedding the input sequences, Transformer also embed the positions by taking all the possible positions and embed those in a random uniform space and combine the two as the final input embeddings.

Linear and Softmax Layer

To get the next predicted word, the last decoder output will be passed through a fully connected layer that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector with size of the vocabulary size.

The Softmax layer then turns those scores into probabilities. The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.


Attention is All You Need 有 “ 1 則迴響 ”

發表留言