BERT

  • Bidirectional Encoder rep from transformers
  • Uses Token Embedding
  • Self Supervised
  • Masked language modeling, next sentence prediction
  • [CLS] : start of classification task, [SEP] between sentences, [MASK] : masked token
  • christianversloot #Roam-Highlights
    • BERT base , which has 12 Encoder Segments stacked on top of each other, has 768-dimensional intermediate state, and utilizes 12 Attention heads (with hence 768/12 = 64-dimensional Attention heads).

    • BERT large (), which has 24 Encoder Segments, 1024-dimensional intermediate state, and 16 Attention heads (64-dimensional Attention heads again).

    • BERT utilizes the encoder segment, meaning that it outputs some vectors for every token. The first vector, , is also called in the BERT paper: it is the “class vector” that contains sentence-level information (or in the case of multiple sentences, information about the sentence pair). All other vectors are vectors representing information about the specific token.

    • In other words, structuring BERT this way allows us to perform sentence-level tasks and token-level tasks. If we use BERT and want to work with sentence-level information, we build on top of the token.

    • Masked Language Modeling

    • Next Sentence Prediction (NSP)

      • This task ensures that the model learns sentence-level information. It is also really simple, and is the reason why the BERT inputs can sometimes be a pair of sentences. NSP involves textual entailment, or understanding the relationship between two sentences.
      • Constructing a training dataset for this task is simple: given an unlabeled corpus, we take a phrase, and take the next one for the 50% of cases where BERT has a next sentence. We take another phase at random given A for the 50% where this is not the case (Devlin et al., 2018). This way, we can construct a dataset where there is a 50/50 split between ‘is next’ and ‘is not next’ sentences.
    • BooksCorpus

    • English Wikipedia

    • It thus does not matter whether your downstream task involves single text or text pairs: BERT can handle it.

      • Sentence pairs in paraphrasing tasks.
      • Hypothesis-premise pairs in textual entailment tasks.
      • Question-answer pairs in question answering.
      • Text-empty pair in text classification.
    • Yes, you read it right: sentence B is empty if your goal is to fine-tune for text classification. There simply is no sentence after the token.

    • Fine-tuning is also really inexpensive