tokens_and_embeddings using Large Language Models

Tokens and embeddings
Generative AI
Marlon S. Viñán Ludeña

Resultados de aprendizaje
Utilizar técnicas y herramientas de IA Generativa en el
desarrollo de soluciones de impacto empresarial.

Objetivos de aprendizaje
Comprender el proceso de tokenización y embedding
utilizados en distintas herramientas de la IA generativa

Summary
● LLM tokenization
○ Word Versus Subword Versus Character
Versus Byte Tokens
○ Comparing Trained LLM Tokenizers
● Token embeddings

How Tokenizers Prepare the Inputs to the Language
Model
High-level view of a language
model and its input prompt.

Open AI platform : Example(Have the bards who preceded
me left any theme unsung?)

Downloading and Running an LLM
instructs the library to
automatically select an
appropriate data type.

CUDA (Compute Uniﬁed
Device Architecture) is a
parallel computing
platform and API
developed by NVIDIA

Let’s print input_ids to see what it holds inside:
integers

A tokenizer processes the input prompt and prepares the
actual input into the language model: a list of token IDs.

If we want to inspect those IDs, we can use the tokenizer’s
decode method to translate the IDs back into text that we
can read:
output

Notice the following:
● The ﬁrst token is ID 1 (<s>), a special token indicating the
beginning of the text.
● Some tokens are complete words (e.g., Write, an, email).
● Some tokens are parts of words (e.g., apolog, izing, trag,
ic).
● Punctuation characters are their own token.

We can also inspect the tokens generated by the model by
printing the generation_output variable

We can pass it an individual token ID or a list of them

How Does the Tokenizer Break Down Text? (Three factors)
First, at model design time, the creator of the model chooses a
tokenization method. Popular methods include byte pair encoding (BPE)
(widely used by GPT models) and WordPiece (used by BERT)
Second, after choosing the method, we need to make a number of
tokenizer design choices like vocabulary size and what special tokens to
use.
Third, the tokenizer needs to be trained on a speciﬁc dataset to establish
the best vocabulary it can use to represent that dataset.

Tokenizers are also used to process the output of the
model

Word Versus Subword Versus Character Versus
Byte Tokens

Word Tokens
This approach was common with earlier methods like word2vec but is
being used less and less in NLP. Its usefulness, however, led it to be used
outside of NLP for use cases such as recommendation systems.
One challenge with word tokenization is that the tokenizer may be unable
to deal with new words that enter the dataset after the tokenizer was
trained.

Subword Tokens
This method contains full and partial words. In addition to the vocabulary
expressivity mentioned earlier, another beneﬁt of the approach is its ability to
represent new words by breaking down the new token into smaller characters,
which tend to be a part of the vocabulary.

Characters Tokens
This is another method that can deal successfully with new words because it has
the raw letters to fall back on. While that makes the representation easier to
tokenize, it makes the modeling more diﬃcult.

Byte Tokens
One additional tokenization method breaks down tokens into the individual bytes that are
used to represent unicode characters.
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
Representation
ByT5: Towards a token-free future with pre-trained byte-to-byte models

Methods of tokenization (summary)

Comparing Trained LLM Tokenizers

Given the following text:
text = """
English and CAPITALIZATION
鸟
show_tokens False None elif ==
>= else: two tabs:" " Three
tabs:
" "
12.0*50=600
"""
We will see how each tokenizer deals with
a number of diﬀerent kinds of tokens:
● Capitalization
● Languages other than English
● Emojis
● Programming code with keywords
and whitespaces often used for
indentation (in languages like Python
for example).
● Numbers and digits.
● Special tokens

Given the following code
colors_list =
['102;194;165',
'252;141;98',
'141;160;203',
'231;138;195',
'166;216;84',
'255;217;47'
]
def show_tokens(sentence,
tokenizer_name):
tokenizer =
AutoTokenizer.from_pretrained(t
okenizer_name)
token_ids =
tokenizer(sentence).input_ids
for idx, t in
enumerate(token_ids):
print(f'x1b[0;30;48;2;{
colors_list[idx %
len(colors_list)]}m' +
tokenizer.decode(t) +
'x1b[0m', end=' ')

BERT base model (uncased) (2018) (Link hugging face)
1. Tokenization method: WordPiece introduced in Japanese and Korean voice
search
2. Vocabulary size: 30,522
3. Special tokens:
a. unk_token [UNK]: An unknown token that the tokenizer has no specific encoding for.
b. sep_token [SEP]: A separator that enables certain tasks that require giving the model two texts
c. pad_token [PAD]: A padding token used to pad unused positions in the model’s input (as the
model expects a certain length of input, its context-size).
d. cls_token [CLS]: A special classification token for classification tasks
e. mask_token [MASK]: A masking token used to hide tokens during the training process.

BERT base model (uncased) (2018) (Tokenized text)
● The newline breaks are gone
● All the text is in lowercase.
● The word “capitalization” is encoded as two subtokens: capital ##ization. The ##
characters are used to indicate this token is a partial token connected to the token
that precedes it.
● The emoji and Chinese characters are gone and replaced with the [UNK] special
token indicating an “unknown token.”

GPT-2 (2019) (Link hugging face)
1. Tokenization method: Byte pair encode (BPE),
introduced in Neural Machine Translation of Rare
Words with Subword Units
3. Special tokens: <|endoftext|>

GPT-2 (2019) (Tokenized text)
● The newline breaks are represented in the tokenizer
● Capitalization is preserved.
● The 鸟 characters are now represented by multiple tokens each. While we see
these tokens printed as the � character, they actually stand for diﬀerent
tokens

● The two tabs are represented as two tokens (token number 197 in that
vocabulary) and the four spaces are represented as three tokens (number 220)
with the ﬁnal space being a part of the token for the closing quote character.

Flan-T5 (2022) (Link hugging face)
1. Tokenization method: SentencePiece (SentencePiece: A simple
and language independent subword tokenizer and detokenizer
for Neural Text Processing), which supports BPE and the
unigram language model
3. Special tokens:
a. unk_token <unk>
b. pad_token <pad>

Flan-T5 (2022) (Tokenized text)
● No newline or whitespace tokens; this would make it
challenging for the model to work with code.
● The emoji and Chinese characters are both replaced by the
<unk> token, making the model completely blind to them.

The GPT-4 tokenizer behaves similarly to its ancestor, the GPT-2 tokenizer.
Some diﬀerences are:
● The GPT-4 tokenizer represents the four spaces as a single token.
● The Python keyword elif has its own token
● The GPT-4 tokenizer uses fewer tokens to represent most words. Examples here include
“CAPITALIZATION”

StarCoder2 (2024)
1. Tokenization method: Byte pair encoding (BPE)
3. Special tokens:
a. <|endoftext|>
b. Fill in the middle tokens:
i. <fim_prefix>
ii. <fim_middle>
iii. <fim_suffix>
iv. <fim_pad>
c. Special tokens for the name of the repository and the filename:
i. <filename>
ii. <reponame>
iii. <gh_stars>

StarCoder2 (2024) (Tokenized text)
This is an encoder that focuses on code generation:
● Encodes the list of whitespaces as a single token.
● Each digit is assigned its own token (so 600 becomes 6 0
0)

Tokenizer Properties
1. Tokenization methods: Byte pair encoding (BPE), Wordpiece,
etc
2. Tokenizer parameters:
a. Vocabulary size
b. Special tokens: Beginning of text token (e.g., <s>), End of text
token, Padding token, Unknown token, CLS token, Masking
token,
c. Capitalization
3. The domain of the data

Token embeddings
The next piece of the puzzle is ﬁnding the best numerical
representation for these tokens that the model can use to
calculate and properly model the patterns in the text.
These patterns reveal themselves to us as a model’s coherence
in a speciﬁc language, or capability to code, etc.

A language model holds an embedding vector associated
with each token in its tokenizer.

Language models produce contextualized token
embeddings that improve on raw, static token embeddings.

Code to generate contextualized word embeddings

Output
four tokens Vector of 384 values

Output
four tokens Vector of 384 values
Batch dimension used in cases
(like training)

But what are these four vectors?

But what are these four vectors?
Added tokens

Our language model has now processed the text input.
The applications of large language models build on top of
outputs like this.

A language model operates on raw, static embeddings as
its input and produces contextual text embeddings.

Text Embeddings (for Sentences and Whole
Documents)

The embedding model to extract the features and convert
the input text to embeddings.

Code for sentence embeddings
In the next weeks we will start to see the importance in RAGs app

tokens_and_embeddings using Large Language Models

More Related Content

Similar to tokens_and_embeddings using Large Language Models (20)

Recently uploaded (20)

tokens_and_embeddings using Large Language Models