Attention is All You Need? Comprehend Transformer (I)


For this part, I would like to explain my personal understanding of the training session of Transformer. First of all, there are some materials which I think could contribute to a better understanding of this model.

As for the basic properties of attention, I suppose you have already been fed with ideas like:

  • Unlike Auto-Regressive CNN which references things by positions, Transformer references things by content;
  • No RNN, LSTM or GRU structure embeded, only attention.

From paper Attention is All You Need, the structure of Transformer is presented in the way of Figure.1. The one in the left side is the overview of the whole system. We can find that, multi-head attention is the innovative part of Transformer. As for the others, although they may seem to be complex, but are more common compared with multi-head attention.


Suppose we want to translate English into Chinese, and we have the following two sample sentences:

English: Forget him, I would like to steal electric motorcycles to feed you.
Chinese: 忘了他,我偷电瓶车养你。


Before Encoder Layer

Here we call the layer consists of multi-head attention and feedfoward network as encoder layer. From Figure.1, it is multiplied by N which equals 6 in [1].


For sentences like the one given (Forget him, I would like to steal electric motorcycles to feed you.), Transformer uses $BATCH\_NUM$ of them to format a training batch, then pads them to have the exactly same length($MAX\_LEN$) among this batch. Then, Transformer uses word embedding to transform every word from these $BATCH\_NUM$ sentences into its corresponding word vector. Here, we have a matrix of size $BATCH\_NUM\times{MAX\_LEN}\times{WORD\_DIM}$.

Positional encoding is used to add more position information to the matrix of sentence, since from intuition, attention only refer to content, the relationship between two words rather than position. However, word position is an essential factor when it comes to sentence construction and comprehension. What author did here is creating a position matrix which is of size $MAX\_LEN\times{WORD\_DIM}$. Every word position corresponds to a $WORD\_DIM$-dim vector. This matrix is generated according to the following rules:

Here positional encoding depends on two elements: the position of a certain word in a sentence - $pos$ and the index of this element - $i$ in this vector. For example, the word steal in the example sentence, It is the 8th word in this sentence(punctuation included). $WORD\_DIM = 512$. The corresponding $WORD\_DIM$-dim(512-dim) vector is supposed to be:

After word embedding layer, this word will correspond to a 512-dim word vector. And what will be sent to the next stage(multi-head attention) is the matrix sum of position vector and word vector. And a 512-dim vector plus a 512-dim vector; their output is another 512-dim vector. A quite simple operation, but seem to play an important role in the success of Transformer. As for the padding ones, their position indexes are zero.

In addition, author mentioned that they experimented with learned positional embeddings, however, it shared the almost same performance with sinusoidal version. How to prove the reason why sinusoidal position embedding works calls for further reading.

Multi-head Attention & Scaled Dot-product Attention

In this section, two most important concepts in Transformer emerge: multi-head attention and scaled dot-product attention. Briefly speaking, we need only to figure out two basic idea: multi-head and scaled.


Here, multi-head is opposed to single-head in more common attention model. The input to multi-head attention($Prepro\_input$) is the output of section 2.1 ; for every single word, it is a WORD_DIM vector; for every sentence, there are MAX_LEN vectors like this. In single-head attention, we merely conduct self-attention once with regard to these vectors or the linear transformation result of these vectors. For more information of basic ‘single-head’ self-attention, please refer to Stanford NLP Lecture 11 #27. For $Q$, $K$ and $V$, in encoder’s attention mechanism, they are the exact the same as the $Prepro\_input$. Before entering self-attention block, they are processed by linear transformation blocks. Since Transformer is multi-head attention mechanism, where multi refers to $n\_head=8$. Then we have $3\times{8}$ linear transformation blocks; each block has its own weights and bias.

Actually, I have to admit that I feel a little uneasy and frustrated with the idea here(maybe I am wrong …). That’s because, $Q$(query), $K$(key) and $V$(value) in the attention model have distinguished and clear physical definition, and it is quite apparent in a basic Seq2seq with attention model, where query is a certain decoded vector, key is the hidden vector of every word in source sentence. Yet here, the difference between $Q$, $K$ and $V$ is vague. They certainly have different physical meaning, but from the aspect of mathematical calculation, they are almost the same, except for the weights and bias in their corresponding linear transformation. Such differences are just left to weights and relied on back-propagation to enforce, which makes me think a more proper prior or something like that may be needed.

Scaled dot-product attention is quite straightforward. Unlike multiplicative attention and additive attention, dot-product attention does not have any parameters to learn through training process. The only difference between basic dot-product attention and scaled dot-product attention is that for scaled one, every attention score is divided by a constant, $\sqrt{d_{k}}$.

Basic dot-product attention:

Scaled dot-product attention:

In summary for this part, the multi-head attention block is encoder self-attention and consist of the following steps:

  1. $n\_head$ scaled dot-product Attention;
  2. concatenate results of $n\_head$ scaled dot-product Attention;
  3. linear project the concatenating result from dimension $n\_head\times{dim\_q}$ to $d\_model$;
  4. Dropout;
  5. Add residual($Q$) to the dropout result and conduct layer normalization.

Positional-wise Feed-forward network

Positional-wise feed-forward network could be regarded as a full-connected feed-forward network, or convolution network with kernel size 1. It could be expressed mathematically with the following format:

The parameters $W_1$, $b_1$, $W_2$ and $b_2$ are exactly same among different words in different sentences but vary from layer to layer. Therefore, I actually could not understand why this part is called positional-wise network. Maybe positional-blind more suitable?

For other details about positional-wise feed-forward network, it also uses dropout, residual mechanism and layer normalization.


Blocks described in 2.2(multi-head attention) and 2.3(positional-wise feed-forward network) are stacked into sequential structure. It outputs a $MAX\_LEN\times{WORD\_DIM}$ matrix which could be regarded as the abstract representation the source sentence.


Decoder shares a similar structure with that of encoder, except two aspects: a masked multi-head self-attention and a multi-head attention combining information from both source sentence and target sentence.


As for the decoder self-attention($Deco\_atten1$ in Figure.4), it needs to be masked because for decoder, the $i^{th}$ word could not observe $(i+1)^{th}$ word. Therefore, any words after $i^{th}$ words would be masked. The mask that is used in this sentence is the combination of the padding mask and post-target mask. For example, if we have a sentence of 5 words, but the maximal length of this batch of sentences is 8. The overall $8\times{8}$ mask matrix may seem like this:

In other words, when we are conducting decoding to $i^{th}$ words, it will not consider attention contribution from the words after it. For other details it is exactly the same as the self-attention in encoder.

For $Deco\_atten2$ in Figure.4, it is a more common-style attention rather than the self-attention mentioned above. It is the type used in Seq2seq with attention model. Its $Q$, $K$ and $V$ correspond to the output of decoder’s masked multi-head attention(and its following post-processing), encoder output and encoder output. Attention scores are calculated from the dot-product of encoder output and decoder’s masked multi-head attention output. These attention scores then are used to weight the importance of encoder output.

Cost Function and Other Training Details

The output of decoder is an array of size $BATCH\_NUM\times{MAX\_LEN}\times{d\_model}$, where $n\_head\times{dim\_q} = d\_model$. At last, there is a linear transformation block. The input for this linear transformation block is the $d\_model$-dim vector for each single word in this batch of sentences. It projects the original $d\_model$-dim vector into $VOCAB\_SIZE$-dim vector. And after a softmax, each dimension of this $VOCAB\_SIZE$-dim vector corresponds to the probability of this position to use the corresponding word. For example, if the $10^{th}$ number in this $VOCAB\_SIZE$-dim vector is $0.8$, then the probability for this word to be the $10^{th}$ word in given vocabulary.

Cost function is cross entropy loss with regard to word’s real index in the vocabulary and the corresponding softmax output for that real word. When there is a batch of sentences, they are concatenated into a sequence of word indexes. Calculate the negative log value of each softmax value regarding each word index, multiply it by weights, and sum all of these values up. The weights are defined like this: for the padding word, the weight is zero; for others, the weight is one. This process could be presented as the following formulas:


[1] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems. 2017: 5998-6008.