LinkedIn and 3rd parties use essential and non-essential cookies to provide, secure, analyze and improve our Services, and to show you relevant ads (including professional and job ads) on and off LinkedIn. Learn more in our Cookie Policy.

Select Accept to consent or Reject to decline non-essential cookies for this use. You can update your choices at any time in your settings.

Start free trial Sign in

From the course: Large Language Models: Text Classification for NLP using BERT

Unlock the full course today

Join today to access over 24,800 courses taught by industry experts.

Tokenizers

Tokenizers - Python Tutorial

From the course: Large Language Models: Text Classification for NLP using BERT

Start my 1-month free trial Buy for my team

Tokenizers

“

- [Instructor] Let's go ahead and head over to Google Colab. So we've installed the transformers library. We're then going to be using the bert-base-uncased as our checkpoint. And here you can see that the vocabulary size is 30,522 tokens. And here are a couple of examples of some of the tokens in the vocabulary. Each of the tokens are mapped to a token ID. The original sentence is, "I like NLP." When we use WordPiece tokenization, because we're using bert-base-uncased checkpoint, all of the tokens are converted into lowercase, and you then have i, like, and then double hash p. These are then mapped to numerical IDs or tokens based on the vocabulary. And you can see that the CLS token is added to the front and the SEP token at the back. So the CLS token has an ID of 101, and the SEP token has an ID of 102. Now what happens if we picked a Unicode character that isn't in the vocabulary? What would the WordPiece tokenizer…

Contents

- Natural language processing with transformers
  
  34s
- How to use the exercise files
  
  34s