Tokenization How LLMs Process Text into Tokens

Nikitha R

Data Scientist | Machine learning expert ,AI & LLM-Powered Solutions | Python, LangChain, Snowflake

Published Feb 7, 2025

In our previous post, we explored how language modeling enables LLMs to understand and generate text. But before any of that happens, the text must be processed into a machine-readable format. This is where tokenization comes in—a crucial step where words or subwords are converted into tokens and then assigned unique numerical IDs.

What is Tokenization?

Tokenization is the process of splitting text into smaller units called tokens, which could be:

Words: "Artificial intelligence is amazing!" → ["Artificial", "intelligence", "is", "amazing", "!"].
Subwords: For example, "unbelievable" → ["un", "believable"].
Characters: In some cases, each letter can be treated as a token (e.g., "AI" → ["A", "I"]).

Once tokens are created, they are converted into unique numerical IDs that the model can process.

How Are Tokens Assigned Numbers?

Each token is mapped to a unique number based on the model’s vocabulary file, which is essentially a lookup table of all possible tokens the model knows. Here’s how it works:

Vocabulary Creation

Before training, the model builds a vocabulary from its training dataset. This vocabulary contains all unique tokens (words, subwords, or characters) that the model has seen during training. Each token is assigned a unique ID based on its frequency in the dataset:

Frequently used tokens (e.g., "the", "is", "and") are assigned lower numbers for efficiency.
Rare or complex tokens (e.g., "photosynthesis") are assigned higher numbers.

Token-to-ID Mapping

When you input text into the model, each token is looked up in this vocabulary file and replaced with its corresponding ID. For example:

Vocabulary: {"artificial": 1212, "intelligence": 7099, "is": 2003, "amazing": 6429}
Input Sentence: "Artificial intelligence is amazing!"
Token IDs: [1212, 7099, 2003, 6429].

Encoding Efficiency:

Lower IDs are often reserved for common words to make processing faster and more efficient. Subword tokenization ensures that even unknown words can be broken down into smaller components and mapped to existing IDs.

Why Does ‘Artificial’ Become 1212 and Not 2981?

The assignment of numbers depends entirely on the vocabulary file created during training. Tokens are sorted by their frequency in the training data:

More frequent tokens get smaller IDs (e.g., "the" might be 1).
Less frequent or rare tokens get larger IDs (e.g., "photosynthesis" might be 10,000).

This approach ensures that common tokens are processed more efficiently during inference.

Example Code to Understand Tokenization

Here’s how you can tokenize text and see how tokens are converted into numerical IDs using Hugging Face’s library:

Note:

The "Ġ" in tokenized outputs represents a space before a word and is used by some tokenizers (like GPT-2) to explicitly mark word boundaries for clarity and efficiency. Instead of treating spaces as separate tokens, the tokenizer prepends "Ġ" to indicate that the token follows a space.

For example, in the sentence "Artificial intelligence is amazing!", the tokens are ['Artificial', 'Ġintelligence', 'Ġis', 'Ġamazing', '!'], where "Ġintelligence" means there’s a space before "intelligence."

This approach helps LLMs distinguish between words at the start of a sentence versus those following others, improving context understanding and processing efficiency. During decoding, the "Ġ" is removed, reconstructing the original text seamlessly.

Here’s what’s happening

The tokenizer splits the sentence into tokens like ["Artificial", "intelligence", etc.] using subword tokenization.
Each token is mapped to its corresponding ID from the vocabulary file.
The function converts these numerical IDs back into readable text.

Real-Life Applications of Tokenization

Tokenization enables LLMs to perform tasks like:

Text Generation: Breaking down input prompts for coherent responses.
Machine Translation: Splitting sentences for accurate translations.
Search Engines: Analyzing queries to match relevant results.

For example: When you type "Translate ‘hello’ to French" into a chatbot, tokenization breaks down your query so the model can understand and respond appropriately.

Why Does This Matter?

Tokenization is the first step in making raw text understandable for LLMs. By converting words into numerical representations, models can analyze patterns and generate meaningful responses efficiently.

What’s Next?

In our next post, we’ll dive deeper into another core concept: the attention mechanism, which allows LLMs to focus on the most relevant parts of input text while generating outputs.

#AI #ArtificialIntelligence #MachineLearning #DeepLearning #NaturalLanguageProcessing #NLP #GenerativeAI #TechInnovation #FutureOfAI #TechTrends #DigitalTransformation #TechUpdates #AITools #AIApplications #LLM #LargeLanguageModels #ChatGPT #OpenAI #GPT4 #BERT #Transformers #AIModels #LanguageModels #EdTech #HealthTech #FinTech #CustomerServiceTech #AIInHealthcare #AIInEducation #CareerGrowth #OpenToWork #JobSeekers #TechCareers #FutureJobs #StudentsOfAI #LearnWithAI #STEMEducation #Upskilling #Reskilling #AICommunity #TechCommunity #InnovationLeaders #TechnologyNews #InnovationMatters #FutureTech

Tokenization How LLMs Process Text into Tokens

Nikitha R

Data Scientist | Machine learning expert ,AI & LLM-Powered Solutions | Python, LangChain, Snowflake

What is Tokenization?

How Are Tokens Assigned Numbers?

Vocabulary Creation

Token-to-ID Mapping

Encoding Efficiency:

Why Does ‘Artificial’ Become 1212 and Not 2981?

Example Code to Understand Tokenization

Here’s what’s happening

Real-Life Applications of Tokenization

Why Does This Matter?

What’s Next?

More articles by this author

Others also viewed

What are LLMs (Large Language Models)?

Tokenization: The Gateway to Transformer Understanding

Fine-Tuning a Language Model

Mastering Prompt Engineering Strategies and Tactics

Encoder vs. Decoder: Understanding the Two Halves of Transformer Architecture

How Large Language Models (LLMs) Work and How They Are Developed

Large Language Models

Most Companies Use LLMs Wrong. Here’s Why

How Do LLMs Actually Differ from Regular Language Models?

Natural Language Processing: Linear Text Classification

Explore topics

What is Tokenization?

How Are Tokens Assigned Numbers?

Vocabulary Creation

Token-to-ID Mapping

Encoding Efficiency:

Why Does ‘Artificial’ Become 1212 and Not 2981?

Example Code to Understand Tokenization

Here’s what’s happening

Real-Life Applications of Tokenization

Why Does This Matter?

What’s Next?

Chain-of-Thought Prompting and Few-Shot Learning

Aug 12, 2025

Retrieval-Augmented Generation(RAG)-Combining LLMs with External Data

Aug 4, 2025

How You Teach AI to Work for You

Jul 31, 2025

Prompt Engineering—Crafting Effective Prompts

Jul 22, 2025

Running LLMs Locally—Requirements and Setup

Jun 24, 2025

Introduction to Key Libraries—Hugging Face, LangChain, LlamaIndex

May 30, 2025

Setting Up a Python Environment for LLMs

May 18, 2025

Using LLMs via APIs: Unlocking the Power of Language Models in Your Apps & Workflows

May 3, 2025

Attention Variants: Evolving the Core of Transformers for Next-Level AI

Apr 18, 2025

The Role of Pretraining and Fine-Tuning in Large Language Models

Apr 5, 2025

Others also viewed

What are LLMs (Large Language Models)?

Tokenization: The Gateway to Transformer Understanding

Fine-Tuning a Language Model

Mastering Prompt Engineering Strategies and Tactics

Encoder vs. Decoder: Understanding the Two Halves of Transformer Architecture

How Large Language Models (LLMs) Work and How They Are Developed

Large Language Models

Most Companies Use LLMs Wrong. Here’s Why

How Do LLMs Actually Differ from Regular Language Models?

Natural Language Processing: Linear Text Classification

Explore topics