What is BERT?

What is BERT?

BERT language model is an open source machine learning framework for natural language processing (NLP). BERT is designed to help computers understand the meaning of ambiguous language in text by using surrounding text to establish context. The BERT framework was pretrained using text from Wikipedia and can be fine-tuned with question-and-answer data sets.

BERT, which stands for Bidirectional Encoder Representations from Transformers, is based on transformers, a deep learning model in which every output element is connected to every input element, and the weightings between them are dynamically calculated based upon their connection.

Historically, language models could only read input text sequentially -- either left-to-right or right-to-left -- but couldn't do both at the same time. BERT is different because it's designed to read in both directions at once. The introduction of transformer models enabled this capability, which is known as bidirectionality. Using bidirectionality, BERT is pretrained on two different but related NLP tasks: masked language modeling (MLM) and next sentence prediction (NSP).

The objective of MLM training is to hide a word in a sentence and then have the program predict what word has been hidden based on the hidden word's context. The objective of NSP training is to have the program predict whether two given sentences have a logical, sequential connection or whether their relationship is simply random.

How BERT works ?

The goal of any given NLP technique is to understand human language as it is spoken naturally. In BERT's case, this means predicting a word in a blank. To do this, models typically train using a large repository of specialized, labeled training data. This process involves linguists doing laborious manual data labeling.

BERT, however, was pretrained using only a collection of unlabeled, plain text, namely the entirety of English Wikipedia and the Brown Corpus. It continues to learn through unsupervised learning from unlabeled text and improves even as it's being used in practical applications such as Google search.

BERT's pretraining serves as a base layer of knowledge from which it can build its responses. From there, BERT can adapt to the ever-growing body of searchable content and queries, and it can be fine-tuned to a user's specifications. This process is known as transfer learning. Aside from this pretraining process, BERT has multiple other aspects it relies on to function as intended, including the following:

Transformers

Google's work on transformers made BERT possible. The transformer is the part of the model that gives BERT its increased capacity for understanding context and ambiguity in language. The transformer processes any given word in relation to all other words in a sentence, rather than processing them one at a time. By looking at all surrounding words, the transformer enables BERT to understand the full context of the word and therefore better understand searcher intent.

This is contrasted against the traditional method of language processing, known as word embedding. This approach was used in models such as GloVe and word2vec. It would map every single word to a vector, which represented only one dimension of that word's meaning.

Masked language modeling

Word embedding models require large data sets of structured data. While they are adept at many general NLP tasks, they fail at the context-heavy, predictive nature of question answering because all words are in some sense fixed to a vector or meaning.

BERT uses an MLM method to keep the word in focus from seeing itself, or having a fixed meaning independent of its context. BERT is forced to identify the masked word based on context alone. In BERT, words are defined by their surroundings, not by a prefixed identity.

Self-attention mechanisms

BERT also relies on a self-attention mechanism that captures and understands relationships among words in a sentence. The bidirectional transformers at the center of BERT's design make this possible. This is significant because often, a word may change meaning as a sentence develops. Each word added augments the overall meaning of the word the NLP algorithm is focusing on. The more words that are present in each sentence or phrase, the more ambiguous the word in focus becomes. BERT accounts for the augmented meaning by reading bidirectionally, accounting for the effect of all other words in a sentence on the focus word and eliminating the left-to-right momentum that biases words towards a certain meaning as a sentence progresses.

What is BERT used for?

Google uses BERT to optimize the interpretation of user search queries. BERT excels at functions that make this possible, including the following:

Sequence-to-sequence language generation tasks such as:

  • Question answering.

  • Abstract summarization.

  • Sentence prediction.

  • Conversational response generation.

NLU tasks such as:

  • Polysemy and coreference resolution. Coreference means words that sound or look the same but have different meanings.

  • Word sense disambiguation.

  • Natural language inference.

  • Sentiment classification.

BERT is open source, meaning anyone can use it. Google claims that users can train a state-of-the-art question-and-answer system in just 30 minutes on a cloud tensor processing unit, and in a few hours using a graphic processing unit. Many other organizations, research groups and separate factions of Google are fine-tuning the model's architecture with supervised training to either optimize it for efficiency or specialize it for specific tasks by pretraining BERT with certain contextual representations. Examples include the following:

  • PatentBERT. This BERT model is fine-tuned to perform patent classification tasks.

  • DocBERT. This model is fine-tuned for document classification tasks.

  • BioBERT. This biomedical language representation model is for biomedical text mining.

  • VideoBERT. This joint visual-linguistic model is used in unsupervised learning of unlabeled data on YouTube.

  • SciBERT. This model is for scientific text.

  • G-BERT. This pretrained BERT model uses medical codes with hierarchical representations through graph neural networks and then fine-tuned for making medical recommendations.

  • TinyBERT by Huawei. This smaller, "student" BERT learns from the original "teacher" BERT, performing transformer distillation to improve efficiency. TinyBERT produced promising results in comparison to BERT-base while being 7.5 times smaller and 9.4 times faster at inference.

  • DistilBERT by Hugging Face. This smaller, faster and cheaper version of BERT is trained from BERT, then certain architectural aspects are removed to improve efficiency.

  • ALBERT. This lighter version of BERT lowers memory consumption and increases the speed with which the model is trained.

  • SpanBERT. This model improved BERT's ability to predict spans of text.

  • RoBERTa. Through more advanced training methods, this model was trained on a bigger data set for a longer time to improve performance.

  • ELECTRA. This version has been tailored to generate high-quality representations of text.

To view or add a comment, sign in

Others also viewed

Explore topics