                                                   A meaningful representation of the input, you must encode 

This is Part 1/2 of Dissecting BERT written by Miguel Romero and Francisco Ingham. Each article was written jointly by both authors. If you already understand the Encoder architecture from Attention is All You Need and you are interested in the differences that make BERT awesome, head on to BERT Specifics.

这是Miguel Romero和Francisco Ingham撰写的解剖BERT的第1/2部分。每篇文章都是由两位作者共同撰写的。如果您已经了解了“Attention is All You Need”编码器架构,并且您对使BERT非常棒的原因感兴趣,请转向BERT细节。

Many thanks to Yannet Interian for her revision and feedback.

In this blog post, we are going to examine the Encoder architecture in depth (see Figure 1) as described in Attention Is All You Need. In BERT Specifics we will dive into the novel modifications that make BERT particularly effective.

在这篇博文中,我们将深入研究编码器架构(参见图1),如“Attention Is All You Need”中所述。在BERT细节中,我们将深入探讨使BERT特别有效的新颖修改。

                                                                                    Figure 1: The Encoder


Before we begin, let’s define the notation we will use throughout the article:

emb_dim: Dimension of the token embeddings. 


input_length: Length of the input sequence (the same in all sequences in a specific batch due to padding).


hidden_dim: Size of the Feed-Forward network’s hidden layer.


vocab_size: Amount of words in the vocabulary (derived from the corpus).



The Encoder used in BERT is an attention-based architecture for Natural Language Processing (NLP) that was introduced in the paper Attention Is All You Need a year ago. The paper introduces an architecture called the Transformer which is composed of two parts, the Encoder and the Decoder. Since BERT only uses the Encoder we are only going to explain that in this blog post (if you want to learn about the Decoder and how it is integrated with the Encoder, we wrote a separate blog post on this).

BERT中使用的encoder是一种基于注意力的自然语言处理(NLP)架构,它在一年前的Attention Is All You Need一文中引入。本文介绍了一种称为transformer的架构,它由两部分组成,即编码器和解码器。由于BERT只使用编码器,我们只会在这篇博文中解释一下(如果你想了解解码器以及它如何与编码器集成,我们就此写了一篇单独的博客文章)。

Transfer learning has quickly become a standard for state of the art results in NLP since the release of ULMFiT earlier this year. After that, remarkable advances have been achieved by combining the Transformer with transfer learning. Two iconic examples of this combination are OpenAI’s GPT and Google AI’s BERT.

自今年早些时候ULMFiT发布以来,迁移学习已迅速成为NLP最先进成果的标准。之后,通过将T与迁移学习相结合,取得了显着的进步。这种组合的两个标志性例子是OpenAI的GPT和Google AI的BERT。

This series aims to:

  1. Provide an intuitive understanding of the Transformer and BERT’s underlying architecture.
  2. Explain the fundamental principles of what makes BERT so successful in NLP tasks.
  • 提供对Transformer和BERT底层架构的直观理解。
  • 解释使BERT在NLP任务中如此成功的基本原理。

To explain this architecture we will adopt the general to specifics approach. We will start by looking at the information flow in the architecture and we will dive into the inputs and outputs of the Encoder as presented in the paperNext, we will look into each of the encoder blocks and understand how Multi-Head Attention is used. Don't worry if you don't know what that is yet; we will make sure you understand it by the end of this article.


Information Flow

The data flow through the architecture is as follows:

  1. The model represents each token as a vector of emb_dim size. With one embedding vector for each of the input tokens, we have a matrix of dimensions (input_length) x (emb_dim) for a specific input sequence.
  2. It then adds positional information (positional encoding). This step returns a matrix of dimensions (input_length) x (emb_dim), just like in the previous step.
  3. The data goes through N encoder blocks. After this, we obtain a matrix of dimensions (input_length) x (emb_dim).
  • 该模型将每个单词表示为emb_dim大小的向量。对于每个输入单词使用一个嵌入向量,我们将获得一个特定输入序列的矩阵,维度是(input_length)x(emb_dim)。
  • 然后对它添加位置信息(位置编码)。此步骤返回维度矩阵(input_length)x(emb_dim),就像上一步骤一样。
  • 数据通过N个编码器块。在此之后,我们获得的矩阵维度仍然是是(input_length)x(emb_dim)。

                                                                        Figure 2: Information flow in the Encoder

Note: The dimensions of the input and output of the encoder block are the same. Hence, it makes sense to use the output of one encoder block as the input of the next encoder block.


Note: In BERT's experiments, the number of blocks N (or L, as they call it) was chosen to be 12 and 24.


Note: The blocks do not share weights with each other


From words to vectors


Tokenization, numericalization and word embeddings


                                               Figure 3: Where tokenization, numericalization and embeddings happen.

Tokenization, numericalization and embeddings do not differ from the way it is done with RNNs. Given a sentence in a corpus:


“ Hello, how are you?”

The first step is to tokenize it:


“ Hello, how are you?” → [“Hello”, “,” , “how”, “are”, “you”, “?”]

This is followed by numericalization, mapping each token to a unique integer in the corpus’ vocabulary.


[“Hello”, “, “, “how”, “are”, “you”, “?”] → [34, 90, 15, 684, 55, 193]

Next, we get the embedding for each word in the sequence. Each word of the sequence is mapped to a emb_dim dimensional vector that the model will learn during training. You can think about it as a vector look-up for each token. The elements of those vectors are treated as model parameters and are optimized with back-propagation just like any other weights.


Therefore, for each token, we look up the corresponding vector:


Stacking each of the vectors together we obtain a matrix Z of dimensions (input_length) x (emb_dim):


It is important to remark that padding was used to make the input sequences in a batch have the same length. That is, we increase the length of some of the sequences by adding ‘<pad>’ tokens. The sequence after padding might be:


[“<pad>”, “<pad>”, “<pad>”, “Hello”, “, “, “how”, “are”, “you”, “?”] → [5, 5, 5, 34, 90, 15, 684, 55, 193]

if the input_length was set to 9.



Positional Encoding

                                                             Figure 4: Where Positional Encoding is computed.

Note: In BERT the authors used learned positional embeddings. If you are only interested in BERT you can skip this section where we explain the functions used to calculate the positional encodings in Attention is All You Need

注意:在BERT中,作者使用了学习好的位置嵌入。如果您只对BERT感兴趣,可以跳过本节,我们将在Attention is All You Need中解释用于计算位置编码的函数。

At this point, we have a matrix representation of our sequence. However, these representations are not encoding the fact that words appear in different positions.


Intuitively, we aim to be able to modify the represented meaning of a specific word depending on its position. We don't want to change the full representation of the word but we want to modify it a little to encode its position.


The approach chosen in the paper is to add numbers between [-1,1] using predetermined (non-learned) sinusoidal functions to the token embeddings. Observe that now, for the rest of the Encoder, the word will be represented slightly differently depending on the position the word is in (even if it is the same word).


Moreover, we would like the Encoder to be able to use the fact that some words are in a given position while, in the same sequence, other words are in other specific positions. That is, we want the network to able to understand relative positions and not only absolute ones. The sinuosidal functions chosen by the authors allow positions to be represented as linear combinations of each other and thus allow the network to learn relative relationships between the token positions.


The approach chosen in the paper to add this information is adding to Z a matrix P with positional encodings.

Z + P

The authors chose to use a combination of sinusoidal functions. Mathematically, using i for the position of the token in the sequence and for the position of the embedding feature:


More specifically, for a given sentence P, the positional embedding matrix would be as follows:


The authors explain that the result of using this deterministic method instead of learning positional representations (just like we did with the tokens) lead to similar performance. Moreover, this approach had some specific advantages over learned positional representations:


  • The input_length can be increased indefinitely since the functions can be calculated for any arbitrary position.
  • Fewer parameters needed to be learned and the model trained quicker.



The resulting matrix:

X = Z + P

is the input of the first encoder block and has dimensions (input_length) x (emb_dim).


Encoder block

A total of N encoder blocks are chained together to generate the Encoder’s output. A specific block is in charge of finding relationships between the input representations and encode them in its output.


                                                                           Figure 5: Encoder block.

Intuitively, this iterative process through the blocks will help the neural network capture more complex relationships between words in the input sequence. You can think about it as iteratively building the meaning of the input sequence as a whole.


Multi-Head Attention

                                                              Figure 6: Where Multi-Head Attention happens.

The Transformer uses Multi-Head Attention, which means it computes attention h different times with different weight matrices and then concatenates the results together.

Transformer使用Multi-Head Attention,这意味着它使用不同的权重矩阵计算不同时间的注意力,然后将结果连接在一起。

The result of each of those parallel computations of attention is called a head. We are going to denote a specific head and the associated weight matrices with the subscript i.


                                              Figure 7: Illustration of the parallel heads computations and their concatenation

As shown in Figure 7, once all the heads have been computed they will be concatenated. This will result in a matrix of dimensions (input_length) x (h*d_v). Afterwards, a linear layer with weight matrix W⁰ of dimensions (h*d_v) x (emb_dim) will be applied leading to a final result of dimensions (input_length) x (emb_dim). Mathematically:

如图7所示,一旦计算了所有头,它们将被连接起来。这将产生维度为(input_length)x(h * d_v)的矩阵。然后,将使用维度是(h * d_v)x(emb_dim)的权重矩阵W'的对其进行线性变换,从而获得的最终结果为(input_length)x(emb_dim):

Where Q,K and V are placeholders for different input matrices. In particular, for this case Q,K and V will be replaced by the output matrix of the previous step X.


Scaled Dot-Product Attention


Each head is going to be characterized by three different projections (matrix multiplications) given by matrices:


To compute a head we will take the input matrix X and separately project it with the above weight matrices:


Note: In the paper d_k and d_v are set such that d_k = d_v = emb_dim/h

Once we have K_iQ_i and V_i we use them to compute the Scaled Dot-Product Attention:

一旦我们有了K_i,Q_i和V_i,我们就用它们来计算Scaled Dot-Product Attention:


                                                              Figure 8: Illustration of the Dot-Product Attention.

Note: In the encoder block the computation of attention does not use a mask. In our Decoder post we explain how the decoder uses masking.

Going Deeper

This is the key of the architecture (the name of the paper is no coincidence) so we need to understand it carefully. Let’s start by looking at the matrix product between Q_i and K_i transposed:


Remember that Q_i and K_i were different projections of the tokens into a d_kdimensional space. Therefore, we can think about the dot product of those projections as a measure of similarity between tokens projections. For every vector projected through Q_i the dot product with the projections through K_i measures the similarity between those vectors. If we call v_i and u_j the projections of the i-th token and the j-th token through Q_i and K_i respectively, their dot product can be seen as:


Thus, this is a measure of how similar are the directions of u_i and v_j and how large are their lengths (the closest the direction and the larger the length, the greater the dot product).


Another way of thinking about this matrix product is as the encoding of a specific relationship between each of the tokens in the input sequence (the relationship is defined by the matrices K_iQ_i).


After this multiplication, the matrix is divided element-wise by the square root of d_k for scaling purposes.


The next step is a Softmax applied row-wise (one softmax computation for each row):

(注:此时维度(length, dk) dot (dk,length) = (length, length) )

In our example, this could be:

                                                                                         Before Softmax

                                                                                          After Softmax

The result would is rows with numbers between zero and one that sum to one. Finally, the result is multiplied by V_i to get the result of the head.


Example 1

For the sake of understanding let’s propose a dummy example. Suppose that the resulting first row of:


is [0,0,0,0,1,0]. Hence, because 1 is in the 5th position of the vector, the result will then be:


Where v_{token} is the projection through V_i of the token’s representation. Observe that in this case the word “hello” ends up with a representation based on the 4th token “you” of the input for that head.

其中v_ {token}是token在V_i下的投影。注意,在这种情况下,单词“hello”最终会得到一个基于该head的输入的第4个token “you”的表示。

(注:实际上dv是(length, dv)维度的,这里将v_ {token}视作一个dv维度的vector)

Supposing an equivalent example for the rest of the heads. The word “Hello”will be now represented by the concatenation of the different projections of other words. The network will learn over training time which relationships are more useful and will relate tokens to each other based on these relationships.

假设其余头部的也进行类似的过程。 “Hello”一词现在将由不同head得到的投影结果拼接得到。网络将在训练时了解哪些关系(或者说哪些投影)更有用,并加深这种联系。

Example 2

Let us now complicate the example a little bit more. Suppose now our previous example in the more general scenario where there isn’t just a single 1 per row but decimal positive numbers that sum to 1:


If we do as in the previous example and multiply that by V_i:


This results in a matrix where each row is a composition of the projection of the token’s representations through V_i:

这会产生一个矩阵,其中每一行都是  token经过V_i投影后的结果的组合:

Observe that we can think about the resulting representation of “Hello” as a weighted combination (centroid) of the projected vectors through V_i of the input tokens.


Thus, a specific head captures a specific relationship between the input tokens. Now, if we do that h times (a total of h heads) each encoder block is capturing different relationships between input tokens.


Following up, assume that the example above referred to the first head. Then the first row would be:


Then the first row of the result of the Multi-Head Attention layer, i.e. the representation of “Hello” at this point, would be


Which is a vector of length emb_dim given that the matrix W_0 has dimensions (d_v*h) x (emb_dim). Applying the same logic in the rest of the rows/tokens representations we obtain a matrix of dimensions (input_length) x (emb_dim).

在给定矩阵W_0具有维度(d_v * h)x(emb_dim)的情况下,将获得长度为emb_dim的向量。在其余的行/标记中应用相同的逻辑,我们获得维度矩阵(input_length)x(emb_dim)。

Thus, at this point, the representation of the token is the concatenation of weighted combinations of token representations (centroids) through the different learned projections.


Position-wise Feed-Forward Network

                                                                                        Figure 9: Feed Forward

This step is composed of the following layers:

                                                        Figure 10: Scheme of the Feed Forwards Neural Netwrok

Mathematically, for each row in the output of the previous layer:


where W_1 and W_2 are (emb_dim) x (d_F) and (d_F) x (emb_dim) matrices respectively.

Observe that during this step, vector representations of tokens don’t “interact” with each other. It is equivalent to run the calculations row-wise and stack the resulting rows in a matrix.


The output of this step has dimension (input_length) x (emb_dim).

Dropout, Add & Norm

                                                            Figure 11: Where Dropout, Addition and normalization happens.

Before this layer, there is always a layer for which inputs and outputs have the same dimensions (Multi-Head Attention or Feed-Forward). We will call that layer Sublayer and its input x.


After each Sublayer, dropout is applied with 10% probability. Call this result Dropout(Sublayer(x)). This result is added to the Sublayer’s input x, and we get x + Dropout(Sublayer(x)).

在每个子层之后,dropout设为10%。调用此结果Dropout(Sublayer(x))。这个结果被添加到Sublayer的输入x,我们得到x + Dropout(Sublayer(x))。

Observe that in the context of a Multi-Head Attention layer, this means adding the original representation of a token to the representation based on the relationship with other tokens. It is like telling the token:


“Learn the relationship with the rest of the tokens, but don’t forget what we already learned about yourself!”

Finally, a token-wise/row-wise normalization is computed with the mean and standard deviation of each row. This improves the stability of the network.

The output of these layers is:

And that’s it! This is the architecture behind all of the magic in state of the art NLP.

If you have any feedback please let us know in the comment section!


Attention Is All You Need; Vaswani et al., 2017.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding; Devlin et al., 2018.

The Annotated Transformer; Alexander Rush, Vincent Nguyen and Guillaume Klein.

Universal Language Model Fine-tuning for Text Classification; Howard et al., 2018.

Improving Language Understanding by Generative Pre-Training; Radford et al., 2018.

