BERT BERT BERT!

No, not Bert on Sesame Street!

5 min readAug 19, 2021

BERT (Bidirectional Encoder Representations) is an NLP model created by Google in 2018. Upon release, it was considered a state-of-the-art model. Employing the transformer architecture BERT broke performance records in NLP tasks such as sentence classification, Named Entity Recognition (NER)Tagging, and Question answering.

This model gained popularity as it was the first of its kind which utilized the bidirectional training of the Transformer model in Natural Language Processing tasks, increasing the contextual understanding and flow.

BERT is a pre-trained language model. A fine example of transfer learning.

Transfer learning is a machine learning technique where a model trained on one task is re-purposed on a second related task.It is an optimization that allows rapid progress or improved performance when modeling the second task.
-A Gentle Introduction to Transfer Learning for Deep Learning (Machine Learning Mastery)

How does BERT work?

As mentioned previously, BERT utilizes Transformers, an attention mechanism that learns contextual relations between words in a text. A Transformer consists of two parts: An encoder and a decoder. The encoder reads the input text while the decoder produces a prediction for the task. BERT only requires the encoder. The paper “Attention is all you need” by Google describes the Transformer architecture in detail.

Unlike directional models, BERT reads the entire text input at once making it a bidirectional model. This means the model can learn the context of a word based on its surrounding text on the right and left.

As the above image illustrates, the input is a sequence of tokens, which are first embedded into vectors and then fed into the neural network to be processed. The output is a sequence of vectors of size Z, in which each vector corresponds to an input token with the same index.

The input is first shaped and adorned, with some extra meta-data.

Token embeddings: A [CLS] token is added to the input word tokens at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.
Segment embeddings: A marker indicating Sentence A or Sentence B is added to each token allowing the encoder to distinguish between sentences.
Positional embeddings: A positional embedding is added to each token to indicate its position in the sentence.

Standard models, which used a single directional approach, before BERT was trained by predicting the next word in the sequence. This method, unfortunately, did not provide a deep sense of understanding as its learning was restricted in one direction. To overcome this hurdle, BERT utilizes two training strategies.

Masked LM (MLM): 15% of the tokens are masked before input into the model. The model then proceeds to predict these masked values, while depending on the non-masked words in the surroundings.

The prediction of the output requires:

Adding a classification layer on top of the encoder output.
Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.
Finding the probability of each word present in the vocabulary using softmax.

The BERT loss function only takes the prediction of masked values into consideration, while ignoring the unmasked ones. This results in the model converging much more slowly than previously standard left-to-right or right-to-left models. However, this slow convergence has an advantage. It has increased contextual awareness.

2. Next Sentence Prediction (NSP): This technique is used to understand the relationship between two sentences. The model is given pairs of sentences and learns to predict if the second sentence in the pair is the one that follows the first. As mentioned previously the input data is separated using the [SEP] token. During the training period, sentences are fed in the following manner:

50% of the time, the pair of sentences are subsequent sentences.
The rest of the time (another 50%), random sentences are paired together.

To predict if the second sentence is indeed connected to the first, the following steps are performed:

The entire input sequence goes through the Transformer model.
The output of the [CLS] token is transformed into a 2×1 shaped vector, using a simple classification layer (learned matrices of weights and biases).
Calculating the probability of IsNextSequence with softmax.

Both training methods take place simultaneously and the combined loss function is reduced as it is a better indicator.

Calibrating BERT

BERT can be customized for many Natural Language Processing tasks such as question-answering, text classification, Named Entity Recognition (NER), and so on. This can be done by keeping the core implementation as it is and carrying out a few tweaks to fine-tune it to the task at hand.

Classification tasks are done similarly to Next Sentence classification, by adding a classification layer on top of the Transformer output for the [CLS] token.
A NER model can be trained by feeding the output vector of each token into a classification layer that predicts the NER label. More on NER here.
In Question Answering tasks, the software receives a question regarding a text sequence and is required to find the right answer from a corpus. A Q&A BERT model can be trained by learning two extra vectors: a start vector and an end vector, which marks the beginning and the end of the answer.

Most of the hyper-parameters remain the same across various NLP tasks. The BERT paper gives specific details on the hyper-parameters that require tuning.

Conclusion

Since its inception, BERT has inspired the creation of other pre-trained language models such as RoBERTa, ALBERT, DistilBERT, XLNet, and UniLM. Being the cornerstone of a new era in NLP, this model still continues to be one of the most used ones in 2021. This article serves as a quick guide on BERT for those who have prior experience with NLP and want to know more.

BERT BERT BERT!

No, not Bert on Sesame Street!

How does BERT work?

Calibrating BERT

Conclusion

References

Other Articles by Author

Written by Tony Jesuthasan

No responses yet