From the course: Introduction to Transformer Models for NLP
Unlock this course with a free trial
Join today to access over 24,800 courses taught by industry experts.
WordPiece tokenization
From the course: Introduction to Transformer Models for NLP
WordPiece tokenization
- Section 4.2, Wordpiece Tokenization. We've zoomed in on multiple aspects of transformers in BERT. We've talked about the attention mechanisms that power the representations, the contextual representations of tokens. We've talked about the fact that BERT needs to tokenize inputs, but we haven't talked about how, how BERT tokenizes and embeds the initial inputs, "Like Istanbul is a great city," here. It does so through something called wordpiece tokenization. So consider the following sentence, "Another beautiful day." Not the best sentence, but it gets the job done. To tokenize this, we split this up into a list of tokens in our vocabulary, which is over 30,000 tokens. We also add those two special tokens, CLS at the beginning and SEP at the end. Remember, CLS is meant to represent an entire sequence, and SEP is meant to represent a separation between sentences if we're passing two sentences at once. So to tokenize, "Another beautiful day," it would end up looking like this. We'd…