13. A tokenizer processes the input prompt and prepares the
actual input into the language model: a list of token IDs.
14. If we want to inspect those IDs, we can use the tokenizer’s
decode method to translate the IDs back into text that we
can read:
output
15. Notice the following:
● The first token is ID 1 (<s>), a special token indicating the
beginning of the text.
● Some tokens are complete words (e.g., Write, an, email).
● Some tokens are parts of words (e.g., apolog, izing, trag,
ic).
● Punctuation characters are their own token.
16. We can also inspect the tokens generated by the model by
printing the generation_output variable
17. We can pass it an individual token ID or a list of them
18. How Does the Tokenizer Break Down Text? (Three factors)
First, at model design time, the creator of the model chooses a
tokenization method. Popular methods include byte pair encoding (BPE)
(widely used by GPT models) and WordPiece (used by BERT)
Second, after choosing the method, we need to make a number of
tokenizer design choices like vocabulary size and what special tokens to
use.
Third, the tokenizer needs to be trained on a specific dataset to establish
the best vocabulary it can use to represent that dataset.
21. Word Tokens
This approach was common with earlier methods like word2vec but is
being used less and less in NLP. Its usefulness, however, led it to be used
outside of NLP for use cases such as recommendation systems.
One challenge with word tokenization is that the tokenizer may be unable
to deal with new words that enter the dataset after the tokenizer was
trained.
22. Subword Tokens
This method contains full and partial words. In addition to the vocabulary
expressivity mentioned earlier, another benefit of the approach is its ability to
represent new words by breaking down the new token into smaller characters,
which tend to be a part of the vocabulary.
23. Characters Tokens
This is another method that can deal successfully with new words because it has
the raw letters to fall back on. While that makes the representation easier to
tokenize, it makes the modeling more difficult.
24. Byte Tokens
One additional tokenization method breaks down tokens into the individual bytes that are
used to represent unicode characters.
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
Representation
ByT5: Towards a token-free future with pre-trained byte-to-byte models
27. Given the following text:
text = """
English and CAPITALIZATION
鸟
show_tokens False None elif ==
>= else: two tabs:" " Three
tabs:
" "
12.0*50=600
"""
We will see how each tokenizer deals with
a number of different kinds of tokens:
● Capitalization
● Languages other than English
● Emojis
● Programming code with keywords
and whitespaces often used for
indentation (in languages like Python
for example).
● Numbers and digits.
● Special tokens
28. Given the following code
colors_list =
['102;194;165',
'252;141;98',
'141;160;203',
'231;138;195',
'166;216;84',
'255;217;47'
]
def show_tokens(sentence,
tokenizer_name):
tokenizer =
AutoTokenizer.from_pretrained(t
okenizer_name)
token_ids =
tokenizer(sentence).input_ids
for idx, t in
enumerate(token_ids):
print(f'x1b[0;30;48;2;{
colors_list[idx %
len(colors_list)]}m' +
tokenizer.decode(t) +
'x1b[0m', end=' ')
29. BERT base model (uncased) (2018) (Link hugging face)
1. Tokenization method: WordPiece introduced in Japanese and Korean voice
search
2. Vocabulary size: 30,522
3. Special tokens:
a. unk_token [UNK]: An unknown token that the tokenizer has no specific encoding for.
b. sep_token [SEP]: A separator that enables certain tasks that require giving the model two texts
c. pad_token [PAD]: A padding token used to pad unused positions in the model’s input (as the
model expects a certain length of input, its context-size).
d. cls_token [CLS]: A special classification token for classification tasks
e. mask_token [MASK]: A masking token used to hide tokens during the training process.
30. BERT base model (uncased) (2018) (Tokenized text)
● The newline breaks are gone
● All the text is in lowercase.
● The word “capitalization” is encoded as two subtokens: capital ##ization. The ##
characters are used to indicate this token is a partial token connected to the token
that precedes it.
● The emoji and Chinese characters are gone and replaced with the [UNK] special
token indicating an “unknown token.”
31. GPT-2 (2019) (Link hugging face)
1. Tokenization method: Byte pair encode (BPE),
introduced in Neural Machine Translation of Rare
Words with Subword Units
2. Vocabulary size: 50,257
3. Special tokens: <|endoftext|>
32. GPT-2 (2019) (Tokenized text)
● The newline breaks are represented in the tokenizer
● Capitalization is preserved.
● The 鸟 characters are now represented by multiple tokens each. While we see
these tokens printed as the � character, they actually stand for different
tokens
33. GPT-2 (2019) (Tokenized text)
● The two tabs are represented as two tokens (token number 197 in that
vocabulary) and the four spaces are represented as three tokens (number 220)
with the final space being a part of the token for the closing quote character.
34. Flan-T5 (2022) (Link hugging face)
1. Tokenization method: SentencePiece (SentencePiece: A simple
and language independent subword tokenizer and detokenizer
for Neural Text Processing), which supports BPE and the
unigram language model
2. Vocabulary size: 32,100
3. Special tokens:
a. unk_token <unk>
b. pad_token <pad>
35. Flan-T5 (2022) (Tokenized text)
● No newline or whitespace tokens; this would make it
challenging for the model to work with code.
● The emoji and Chinese characters are both replaced by the
<unk> token, making the model completely blind to them.
36. GPT-4 (2023)
1. Tokenization method: BPE
2. Vocabulary size: A little over 100,000
3. Special tokens:
a. <|endoftext|>
b. These three tokens enable the LLM to generate a completion
given not only the text before it but also considering the text after
it.
i. <|fim_prefix|>
ii. <|fim_middle|>
iii. <|fim_suffix|>
37. GPT-4 (2023) (Tokenized text)
The GPT-4 tokenizer behaves similarly to its ancestor, the GPT-2 tokenizer.
Some differences are:
● The GPT-4 tokenizer represents the four spaces as a single token.
● The Python keyword elif has its own token
● The GPT-4 tokenizer uses fewer tokens to represent most words. Examples here include
“CAPITALIZATION”
38. StarCoder2 (2024)
1. Tokenization method: Byte pair encoding (BPE)
2. Vocabulary size: 49,152
3. Special tokens:
a. <|endoftext|>
b. Fill in the middle tokens:
i. <fim_prefix>
ii. <fim_middle>
iii. <fim_suffix>
iv. <fim_pad>
c. Special tokens for the name of the repository and the filename:
i. <filename>
ii. <reponame>
iii. <gh_stars>
39. StarCoder2 (2024) (Tokenized text)
This is an encoder that focuses on code generation:
● Encodes the list of whitespaces as a single token.
● Each digit is assigned its own token (so 600 becomes 6 0
0)
40. Phi-3 (and Llama 2)
1. Tokenization method: Byte pair encoding (BPE)
2. Vocabulary size: 32,000
3. Special tokens:
a. <|endoftext|>
b. Chat tokens:
i. <|user|>
ii. <|assistant|>
iii. <|system|>
41. Tokenizer Properties
1. Tokenization methods: Byte pair encoding (BPE), Wordpiece,
etc
2. Tokenizer parameters:
a. Vocabulary size
b. Special tokens: Beginning of text token (e.g., <s>), End of text
token, Padding token, Unknown token, CLS token, Masking
token,
c. Capitalization
3. The domain of the data
43. Token embeddings
The next piece of the puzzle is finding the best numerical
representation for these tokens that the model can use to
calculate and properly model the patterns in the text.
These patterns reveal themselves to us as a model’s coherence
in a specific language, or capability to code, etc.
44. A language model holds an embedding vector associated
with each token in its tokenizer.
45. Language models produce contextualized token
embeddings that improve on raw, static token embeddings.