SlideShare a Scribd company logo
Tokens and embeddings
Generative AI
Marlon S. Viñán Ludeña
Resultados de aprendizaje
Utilizar técnicas y herramientas de IA Generativa en el
desarrollo de soluciones de impacto empresarial.
Objetivos de aprendizaje
Comprender el proceso de tokenización y embedding
utilizados en distintas herramientas de la IA generativa
Summary
● LLM tokenization
○ Word Versus Subword Versus Character
Versus Byte Tokens
○ Comparing Trained LLM Tokenizers
● Token embeddings
LLM Tokenization
Tokens and embeddings
How Tokenizers Prepare the Inputs to the Language
Model
High-level view of a language
model and its input prompt.
Open AI platform : Example(Have the bards who preceded
me left any theme unsung?)
Downloading and Running an LLM
instructs the library to
automatically select an
appropriate data type.
Downloading and Running an LLM
CUDA (Compute Unified
Device Architecture) is a
parallel computing
platform and API
developed by NVIDIA
Downloading and Running an LLM
Let’s print input_ids to see what it holds inside:
integers
A tokenizer processes the input prompt and prepares the
actual input into the language model: a list of token IDs.
If we want to inspect those IDs, we can use the tokenizer’s
decode method to translate the IDs back into text that we
can read:
output
Notice the following:
● The first token is ID 1 (<s>), a special token indicating the
beginning of the text.
● Some tokens are complete words (e.g., Write, an, email).
● Some tokens are parts of words (e.g., apolog, izing, trag,
ic).
● Punctuation characters are their own token.
We can also inspect the tokens generated by the model by
printing the generation_output variable
We can pass it an individual token ID or a list of them
How Does the Tokenizer Break Down Text? (Three factors)
First, at model design time, the creator of the model chooses a
tokenization method. Popular methods include byte pair encoding (BPE)
(widely used by GPT models) and WordPiece (used by BERT)
Second, after choosing the method, we need to make a number of
tokenizer design choices like vocabulary size and what special tokens to
use.
Third, the tokenizer needs to be trained on a specific dataset to establish
the best vocabulary it can use to represent that dataset.
Tokenizers are also used to process the output of the
model
Word Versus Subword Versus Character Versus
Byte Tokens
Word Tokens
This approach was common with earlier methods like word2vec but is
being used less and less in NLP. Its usefulness, however, led it to be used
outside of NLP for use cases such as recommendation systems.
One challenge with word tokenization is that the tokenizer may be unable
to deal with new words that enter the dataset after the tokenizer was
trained.
Subword Tokens
This method contains full and partial words. In addition to the vocabulary
expressivity mentioned earlier, another benefit of the approach is its ability to
represent new words by breaking down the new token into smaller characters,
which tend to be a part of the vocabulary.
Characters Tokens
This is another method that can deal successfully with new words because it has
the raw letters to fall back on. While that makes the representation easier to
tokenize, it makes the modeling more difficult.
Byte Tokens
One additional tokenization method breaks down tokens into the individual bytes that are
used to represent unicode characters.
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
Representation
ByT5: Towards a token-free future with pre-trained byte-to-byte models
Methods of tokenization (summary)
Comparing Trained LLM Tokenizers
Given the following text:
text = """
English and CAPITALIZATION
鸟
show_tokens False None elif ==
>= else: two tabs:" " Three
tabs:
" "
12.0*50=600
"""
We will see how each tokenizer deals with
a number of different kinds of tokens:
● Capitalization
● Languages other than English
● Emojis
● Programming code with keywords
and whitespaces often used for
indentation (in languages like Python
for example).
● Numbers and digits.
● Special tokens
Given the following code
colors_list =
['102;194;165',
'252;141;98',
'141;160;203',
'231;138;195',
'166;216;84',
'255;217;47'
]
def show_tokens(sentence,
tokenizer_name):
tokenizer =
AutoTokenizer.from_pretrained(t
okenizer_name)
token_ids =
tokenizer(sentence).input_ids
for idx, t in
enumerate(token_ids):
print(f'x1b[0;30;48;2;{
colors_list[idx %
len(colors_list)]}m' +
tokenizer.decode(t) +
'x1b[0m', end=' ')
BERT base model (uncased) (2018) (Link hugging face)
1. Tokenization method: WordPiece introduced in Japanese and Korean voice
search
2. Vocabulary size: 30,522
3. Special tokens:
a. unk_token [UNK]: An unknown token that the tokenizer has no specific encoding for.
b. sep_token [SEP]: A separator that enables certain tasks that require giving the model two texts
c. pad_token [PAD]: A padding token used to pad unused positions in the model’s input (as the
model expects a certain length of input, its context-size).
d. cls_token [CLS]: A special classification token for classification tasks
e. mask_token [MASK]: A masking token used to hide tokens during the training process.
BERT base model (uncased) (2018) (Tokenized text)
● The newline breaks are gone
● All the text is in lowercase.
● The word “capitalization” is encoded as two subtokens: capital ##ization. The ##
characters are used to indicate this token is a partial token connected to the token
that precedes it.
● The emoji and Chinese characters are gone and replaced with the [UNK] special
token indicating an “unknown token.”
GPT-2 (2019) (Link hugging face)
1. Tokenization method: Byte pair encode (BPE),
introduced in Neural Machine Translation of Rare
Words with Subword Units
2. Vocabulary size: 50,257
3. Special tokens: <|endoftext|>
GPT-2 (2019) (Tokenized text)
● The newline breaks are represented in the tokenizer
● Capitalization is preserved.
● The 鸟 characters are now represented by multiple tokens each. While we see
these tokens printed as the � character, they actually stand for different
tokens
GPT-2 (2019) (Tokenized text)
● The two tabs are represented as two tokens (token number 197 in that
vocabulary) and the four spaces are represented as three tokens (number 220)
with the final space being a part of the token for the closing quote character.
Flan-T5 (2022) (Link hugging face)
1. Tokenization method: SentencePiece (SentencePiece: A simple
and language independent subword tokenizer and detokenizer
for Neural Text Processing), which supports BPE and the
unigram language model
2. Vocabulary size: 32,100
3. Special tokens:
a. unk_token <unk>
b. pad_token <pad>
Flan-T5 (2022) (Tokenized text)
● No newline or whitespace tokens; this would make it
challenging for the model to work with code.
● The emoji and Chinese characters are both replaced by the
<unk> token, making the model completely blind to them.
GPT-4 (2023)
1. Tokenization method: BPE
2. Vocabulary size: A little over 100,000
3. Special tokens:
a. <|endoftext|>
b. These three tokens enable the LLM to generate a completion
given not only the text before it but also considering the text after
it.
i. <|fim_prefix|>
ii. <|fim_middle|>
iii. <|fim_suffix|>
GPT-4 (2023) (Tokenized text)
The GPT-4 tokenizer behaves similarly to its ancestor, the GPT-2 tokenizer.
Some differences are:
● The GPT-4 tokenizer represents the four spaces as a single token.
● The Python keyword elif has its own token
● The GPT-4 tokenizer uses fewer tokens to represent most words. Examples here include
“CAPITALIZATION”
StarCoder2 (2024)
1. Tokenization method: Byte pair encoding (BPE)
2. Vocabulary size: 49,152
3. Special tokens:
a. <|endoftext|>
b. Fill in the middle tokens:
i. <fim_prefix>
ii. <fim_middle>
iii. <fim_suffix>
iv. <fim_pad>
c. Special tokens for the name of the repository and the filename:
i. <filename>
ii. <reponame>
iii. <gh_stars>
StarCoder2 (2024) (Tokenized text)
This is an encoder that focuses on code generation:
● Encodes the list of whitespaces as a single token.
● Each digit is assigned its own token (so 600 becomes 6 0
0)
Phi-3 (and Llama 2)
1. Tokenization method: Byte pair encoding (BPE)
2. Vocabulary size: 32,000
3. Special tokens:
a. <|endoftext|>
b. Chat tokens:
i. <|user|>
ii. <|assistant|>
iii. <|system|>
Tokenizer Properties
1. Tokenization methods: Byte pair encoding (BPE), Wordpiece,
etc
2. Tokenizer parameters:
a. Vocabulary size
b. Special tokens: Beginning of text token (e.g., <s>), End of text
token, Padding token, Unknown token, CLS token, Masking
token,
c. Capitalization
3. The domain of the data
Token embeddings
Token embeddings
The next piece of the puzzle is finding the best numerical
representation for these tokens that the model can use to
calculate and properly model the patterns in the text.
These patterns reveal themselves to us as a model’s coherence
in a specific language, or capability to code, etc.
A language model holds an embedding vector associated
with each token in its tokenizer.
Language models produce contextualized token
embeddings that improve on raw, static token embeddings.
Code to generate contextualized word embeddings
Output
four tokens
Output
four tokens Vector of 384 values
Output
four tokens Vector of 384 values
Batch dimension used in cases
(like training)
But what are these four vectors?
But what are these four vectors?
Added tokens
Our language model has now processed the text input.
The applications of large language models build on top of
outputs like this.
A language model operates on raw, static embeddings as
its input and produces contextual text embeddings.
Text Embeddings (for Sentences and Whole
Documents)
The embedding model to extract the features and convert
the input text to embeddings.
Code for sentence embeddings
Code for sentence embeddings
In the next weeks we will start to see the importance in RAGs app
Questions?

More Related Content

PDF
Assignment4
PPTX
Chahioiuou9oioooooooooooooofffghfpterTwo.pptx
PPT
Module 2
PPT
SS & CD Module 3
DOCX
Dineshmaterial1 091225091539-phpapp02
DOCX
Compiler Design Material
PDF
compiler.pdfljdvgepitju4io3elkhldhyreyio4uw
PPTX
Tokenization and how to use it from scratch
Assignment4
Chahioiuou9oioooooooooooooofffghfpterTwo.pptx
Module 2
SS & CD Module 3
Dineshmaterial1 091225091539-phpapp02
Compiler Design Material
compiler.pdfljdvgepitju4io3elkhldhyreyio4uw
Tokenization and how to use it from scratch

Similar to tokens_and_embeddings using Large Language Models (20)

PDF
COMPILER DESIGN.pdf
PPT
Lexical Analysis
PPTX
Chapter 2.pptx compiler design lecture note
PPTX
Named Entity Recognition For Hindi-English code-mixed Twitter Text
PDF
PSEUDOCODE TO SOURCE PROGRAMMING LANGUAGE TRANSLATOR
PDF
A485937779_22738_27_2025_UNIT-1 Lesson 2 Part 1 Variables, Expression and Sta...
PPTX
UNIT1Lesson 2.pptx
PPTX
1909 BERT: why-and-how (CODE SEMINAR)
PPTX
PPTX
06 chapter03 04_control_logix_tags_memory_structure_fa16
PDF
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
PDF
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
PDF
Handout#09
PPTX
Data Type in C Programming
PPTX
ashjhas sahdj ajshbas sajakj askk sadk as
PPTX
A Lecture of Compiler Design Subject.pptx
PPTX
Cd ch2 - lexical analysis
PDF
PDF
An Introduction to the Compiler Designss
PPTX
Compiler Design
COMPILER DESIGN.pdf
Lexical Analysis
Chapter 2.pptx compiler design lecture note
Named Entity Recognition For Hindi-English code-mixed Twitter Text
PSEUDOCODE TO SOURCE PROGRAMMING LANGUAGE TRANSLATOR
A485937779_22738_27_2025_UNIT-1 Lesson 2 Part 1 Variables, Expression and Sta...
UNIT1Lesson 2.pptx
1909 BERT: why-and-how (CODE SEMINAR)
06 chapter03 04_control_logix_tags_memory_structure_fa16
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
Handout#09
Data Type in C Programming
ashjhas sahdj ajshbas sajakj askk sadk as
A Lecture of Compiler Design Subject.pptx
Cd ch2 - lexical analysis
An Introduction to the Compiler Designss
Compiler Design
Ad

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Cloud computing and distributed systems.
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Electronic commerce courselecture one. Pdf
PDF
Encapsulation theory and applications.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
“AI and Expert System Decision Support & Business Intelligence Systems”
Review of recent advances in non-invasive hemoglobin estimation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Cloud computing and distributed systems.
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Understanding_Digital_Forensics_Presentation.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The Rise and Fall of 3GPP – Time for a Sabbatical?
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Machine learning based COVID-19 study performance prediction
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Network Security Unit 5.pdf for BCA BBA.
Electronic commerce courselecture one. Pdf
Encapsulation theory and applications.pdf
The AUB Centre for AI in Media Proposal.docx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Per capita expenditure prediction using model stacking based on satellite ima...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Ad

tokens_and_embeddings using Large Language Models

  • 1. Tokens and embeddings Generative AI Marlon S. Viñán Ludeña
  • 2. Resultados de aprendizaje Utilizar técnicas y herramientas de IA Generativa en el desarrollo de soluciones de impacto empresarial.
  • 3. Objetivos de aprendizaje Comprender el proceso de tokenización y embedding utilizados en distintas herramientas de la IA generativa
  • 4. Summary ● LLM tokenization ○ Word Versus Subword Versus Character Versus Byte Tokens ○ Comparing Trained LLM Tokenizers ● Token embeddings
  • 7. How Tokenizers Prepare the Inputs to the Language Model High-level view of a language model and its input prompt.
  • 8. Open AI platform : Example(Have the bards who preceded me left any theme unsung?)
  • 9. Downloading and Running an LLM instructs the library to automatically select an appropriate data type.
  • 10. Downloading and Running an LLM CUDA (Compute Unified Device Architecture) is a parallel computing platform and API developed by NVIDIA
  • 12. Let’s print input_ids to see what it holds inside: integers
  • 13. A tokenizer processes the input prompt and prepares the actual input into the language model: a list of token IDs.
  • 14. If we want to inspect those IDs, we can use the tokenizer’s decode method to translate the IDs back into text that we can read: output
  • 15. Notice the following: ● The first token is ID 1 (<s>), a special token indicating the beginning of the text. ● Some tokens are complete words (e.g., Write, an, email). ● Some tokens are parts of words (e.g., apolog, izing, trag, ic). ● Punctuation characters are their own token.
  • 16. We can also inspect the tokens generated by the model by printing the generation_output variable
  • 17. We can pass it an individual token ID or a list of them
  • 18. How Does the Tokenizer Break Down Text? (Three factors) First, at model design time, the creator of the model chooses a tokenization method. Popular methods include byte pair encoding (BPE) (widely used by GPT models) and WordPiece (used by BERT) Second, after choosing the method, we need to make a number of tokenizer design choices like vocabulary size and what special tokens to use. Third, the tokenizer needs to be trained on a specific dataset to establish the best vocabulary it can use to represent that dataset.
  • 19. Tokenizers are also used to process the output of the model
  • 20. Word Versus Subword Versus Character Versus Byte Tokens
  • 21. Word Tokens This approach was common with earlier methods like word2vec but is being used less and less in NLP. Its usefulness, however, led it to be used outside of NLP for use cases such as recommendation systems. One challenge with word tokenization is that the tokenizer may be unable to deal with new words that enter the dataset after the tokenizer was trained.
  • 22. Subword Tokens This method contains full and partial words. In addition to the vocabulary expressivity mentioned earlier, another benefit of the approach is its ability to represent new words by breaking down the new token into smaller characters, which tend to be a part of the vocabulary.
  • 23. Characters Tokens This is another method that can deal successfully with new words because it has the raw letters to fall back on. While that makes the representation easier to tokenize, it makes the modeling more difficult.
  • 24. Byte Tokens One additional tokenization method breaks down tokens into the individual bytes that are used to represent unicode characters. CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation ByT5: Towards a token-free future with pre-trained byte-to-byte models
  • 26. Comparing Trained LLM Tokenizers
  • 27. Given the following text: text = """ English and CAPITALIZATION 鸟 show_tokens False None elif == >= else: two tabs:" " Three tabs: " " 12.0*50=600 """ We will see how each tokenizer deals with a number of different kinds of tokens: ● Capitalization ● Languages other than English ● Emojis ● Programming code with keywords and whitespaces often used for indentation (in languages like Python for example). ● Numbers and digits. ● Special tokens
  • 28. Given the following code colors_list = ['102;194;165', '252;141;98', '141;160;203', '231;138;195', '166;216;84', '255;217;47' ] def show_tokens(sentence, tokenizer_name): tokenizer = AutoTokenizer.from_pretrained(t okenizer_name) token_ids = tokenizer(sentence).input_ids for idx, t in enumerate(token_ids): print(f'x1b[0;30;48;2;{ colors_list[idx % len(colors_list)]}m' + tokenizer.decode(t) + 'x1b[0m', end=' ')
  • 29. BERT base model (uncased) (2018) (Link hugging face) 1. Tokenization method: WordPiece introduced in Japanese and Korean voice search 2. Vocabulary size: 30,522 3. Special tokens: a. unk_token [UNK]: An unknown token that the tokenizer has no specific encoding for. b. sep_token [SEP]: A separator that enables certain tasks that require giving the model two texts c. pad_token [PAD]: A padding token used to pad unused positions in the model’s input (as the model expects a certain length of input, its context-size). d. cls_token [CLS]: A special classification token for classification tasks e. mask_token [MASK]: A masking token used to hide tokens during the training process.
  • 30. BERT base model (uncased) (2018) (Tokenized text) ● The newline breaks are gone ● All the text is in lowercase. ● The word “capitalization” is encoded as two subtokens: capital ##ization. The ## characters are used to indicate this token is a partial token connected to the token that precedes it. ● The emoji and Chinese characters are gone and replaced with the [UNK] special token indicating an “unknown token.”
  • 31. GPT-2 (2019) (Link hugging face) 1. Tokenization method: Byte pair encode (BPE), introduced in Neural Machine Translation of Rare Words with Subword Units 2. Vocabulary size: 50,257 3. Special tokens: <|endoftext|>
  • 32. GPT-2 (2019) (Tokenized text) ● The newline breaks are represented in the tokenizer ● Capitalization is preserved. ● The 鸟 characters are now represented by multiple tokens each. While we see these tokens printed as the � character, they actually stand for different tokens
  • 33. GPT-2 (2019) (Tokenized text) ● The two tabs are represented as two tokens (token number 197 in that vocabulary) and the four spaces are represented as three tokens (number 220) with the final space being a part of the token for the closing quote character.
  • 34. Flan-T5 (2022) (Link hugging face) 1. Tokenization method: SentencePiece (SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing), which supports BPE and the unigram language model 2. Vocabulary size: 32,100 3. Special tokens: a. unk_token <unk> b. pad_token <pad>
  • 35. Flan-T5 (2022) (Tokenized text) ● No newline or whitespace tokens; this would make it challenging for the model to work with code. ● The emoji and Chinese characters are both replaced by the <unk> token, making the model completely blind to them.
  • 36. GPT-4 (2023) 1. Tokenization method: BPE 2. Vocabulary size: A little over 100,000 3. Special tokens: a. <|endoftext|> b. These three tokens enable the LLM to generate a completion given not only the text before it but also considering the text after it. i. <|fim_prefix|> ii. <|fim_middle|> iii. <|fim_suffix|>
  • 37. GPT-4 (2023) (Tokenized text) The GPT-4 tokenizer behaves similarly to its ancestor, the GPT-2 tokenizer. Some differences are: ● The GPT-4 tokenizer represents the four spaces as a single token. ● The Python keyword elif has its own token ● The GPT-4 tokenizer uses fewer tokens to represent most words. Examples here include “CAPITALIZATION”
  • 38. StarCoder2 (2024) 1. Tokenization method: Byte pair encoding (BPE) 2. Vocabulary size: 49,152 3. Special tokens: a. <|endoftext|> b. Fill in the middle tokens: i. <fim_prefix> ii. <fim_middle> iii. <fim_suffix> iv. <fim_pad> c. Special tokens for the name of the repository and the filename: i. <filename> ii. <reponame> iii. <gh_stars>
  • 39. StarCoder2 (2024) (Tokenized text) This is an encoder that focuses on code generation: ● Encodes the list of whitespaces as a single token. ● Each digit is assigned its own token (so 600 becomes 6 0 0)
  • 40. Phi-3 (and Llama 2) 1. Tokenization method: Byte pair encoding (BPE) 2. Vocabulary size: 32,000 3. Special tokens: a. <|endoftext|> b. Chat tokens: i. <|user|> ii. <|assistant|> iii. <|system|>
  • 41. Tokenizer Properties 1. Tokenization methods: Byte pair encoding (BPE), Wordpiece, etc 2. Tokenizer parameters: a. Vocabulary size b. Special tokens: Beginning of text token (e.g., <s>), End of text token, Padding token, Unknown token, CLS token, Masking token, c. Capitalization 3. The domain of the data
  • 43. Token embeddings The next piece of the puzzle is finding the best numerical representation for these tokens that the model can use to calculate and properly model the patterns in the text. These patterns reveal themselves to us as a model’s coherence in a specific language, or capability to code, etc.
  • 44. A language model holds an embedding vector associated with each token in its tokenizer.
  • 45. Language models produce contextualized token embeddings that improve on raw, static token embeddings.
  • 46. Code to generate contextualized word embeddings
  • 48. Output four tokens Vector of 384 values
  • 49. Output four tokens Vector of 384 values Batch dimension used in cases (like training)
  • 50. But what are these four vectors?
  • 51. But what are these four vectors? Added tokens
  • 52. Our language model has now processed the text input. The applications of large language models build on top of outputs like this.
  • 53. A language model operates on raw, static embeddings as its input and produces contextual text embeddings.
  • 54. Text Embeddings (for Sentences and Whole Documents)
  • 55. The embedding model to extract the features and convert the input text to embeddings.
  • 56. Code for sentence embeddings
  • 57. Code for sentence embeddings In the next weeks we will start to see the importance in RAGs app