From the course: Large Language Models: Text Classification for NLP using BERT

Unlock the full course today

Join today to access over 24,800 courses taught by industry experts.

Tokenizers

Tokenizers

- [Instructor] Let's go ahead and head over to Google Colab. So we've installed the transformers library. We're then going to be using the bert-base-uncased as our checkpoint. And here you can see that the vocabulary size is 30,522 tokens. And here are a couple of examples of some of the tokens in the vocabulary. Each of the tokens are mapped to a token ID. The original sentence is, "I like NLP." When we use WordPiece tokenization, because we're using bert-base-uncased checkpoint, all of the tokens are converted into lowercase, and you then have i, like, and then double hash p. These are then mapped to numerical IDs or tokens based on the vocabulary. And you can see that the CLS token is added to the front and the SEP token at the back. So the CLS token has an ID of 101, and the SEP token has an ID of 102. Now what happens if we picked a Unicode character that isn't in the vocabulary? What would the WordPiece tokenizer…

Contents