Paper Review: Attention Is All You Need

Aakrit Singhal
4 min readMar 22, 2021

--

Introduction

The Attention Is All You Need paper was a breakthrough for Natural Language Processing, especially in the use case of Machine Translation for going in a different route, and depending fully on the Attention Mechanism. State of the art approaches for sequence and language modelling were recurrent neural networks (RNNs), i.e long short-term memory (LSTM), and gated recurrent neural networks (GRU). This paper establishes a new model architecture of Transformers, which uses attention mechanism instead of recurrence to reach higher state of the art translation quality.

Machine Translation is done by mapping sequence to sequence, and this uses the encoder-decoder model. However, the problem is that the decoder needs a different set of values for each timestep. And so, it has to read everything before it in the source sentence in order to generate the target sentence. Hence, for instance, for a sentence such as ‘I like the colour red more than black”, once it outputs ‘red’, it needs to know what that was being compared to next, but no longer needs to remember ‘red’ itself. Hence, attention mechanism allows the decoder to look at the entire sentence and selectively extract the information needed during decoding. And while they may work for shorter sentences, RNN based architectures are hard to parallelize and inefficient in learning long term dependencies within the input and output sequences.

Hence, Transformers use multiple attention distributions and multiple outputs for a single input in order to improve the problem. The Transformer also uses layer normalization and residual connections to make optimizations easier which is summarized next.

Model Architecture

Figure 1: Model Architecture

In Figure 1 of the paper, the structure of a transformer is deeply and clearly laid out. There are two parts to it, the structure on the left is the encoder, while the structure on the right is the decoder. The encoder contains 2 sub layers: The Multi-Head Attention and Feed Forward network. Between each is a residual connection, followed by layer normalization.

Residual connection takes the input and adds it to the output of the sub network. And hence the source sentence gets embedded with word vector into the input embeddings, from which encoding takes place. Positional encoding is used to find the relative/absolute positions for each token in the sequence, which is basically to help with gauging which word/token comes before another to maintain a sequence.

The encoder contains self-attention layers, which means that keys, values, and queries all come from the same place, thereby making it possible to attend all previous layers of encoding.

The decoder model is similar to the encoder one, however, it now includes a Masked Multi-Head attention which, in summary, attends the previous decoder states and masks the future tokens while decoding a word to prevent the repetition of target sentences; therefore, it plays a similar role to the decoder hidden states in a traditional machine translation model.

The Attention mechanism of the Transformer also used a Scaled Dot-Production Attention approach, as shown in Figure 2 (left) of the paper. This helps obtain the weights of the values: the input has keys and queries, whose dot product is computed by dividing each by root of the keys dimension, then helps to apply the softmax function in order to get the weights.

Results

For training the architecture, the Adam optimizer was used, wherein the learning rate was first increased and then decreased using a formula. Residual dropout was used as a regularization technique to prevent overfitting, and label smoothing was used so that the model learns to be more unsure gradually, and hence improves accuracy due to more learning through data. Label smoothing can also be regarded to prevent the model becoming too confident in its predictions.

In brief, the results were better than RNNs and reached new state of the art records in terms of the BLEU Score for English to German and English to French Machine Translation. Even the base models performed better than previous ones at a fraction of the training cost of any competitive models.

Conclusion

Hence, through this paper it has been proved that for translation tasks, Transformers can be trained significantly faster than recurrent or convolution based neural networks. As mentioned in the paper, some problems with these models that Transformers fixed were:

(i) ability of learning long-range dependencies in sequence transduction tasks and specially for longer paths of forward and backward signals (difficult to remember things for long periods).

(ii) When a sequence is processed in RNNs, each hidden state depends on the previous one, and this causes a hurdle for the GPU to wait for data to become available.

(iii) Self attention has much better amount of computation that can be parallelized, which is measured by the minimum number of sequential operations required.

References:

[1] Vaswani, Ashish, et al. “Attention is all you need.” (2017), Advances in neural information processing systems. 2017.

--

--