Tokenization How LLMs Process Text into Tokens
In our previous post, we explored how language modeling enables LLMs to understand and generate text. But before any of that happens, the text must be processed into a machine-readable format. This is where tokenization comes in—a crucial step where words or subwords are converted into tokens and then assigned unique numerical IDs.
What is Tokenization?
Tokenization is the process of splitting text into smaller units called tokens, which could be:
Words: "Artificial intelligence is amazing!" → ["Artificial", "intelligence", "is", "amazing", "!"].
Subwords: For example, "unbelievable" → ["un", "believable"].
Characters: In some cases, each letter can be treated as a token (e.g., "AI" → ["A", "I"]).
Once tokens are created, they are converted into unique numerical IDs that the model can process.
How Are Tokens Assigned Numbers?
Each token is mapped to a unique number based on the model’s vocabulary file, which is essentially a lookup table of all possible tokens the model knows. Here’s how it works:
Vocabulary Creation
Before training, the model builds a vocabulary from its training dataset. This vocabulary contains all unique tokens (words, subwords, or characters) that the model has seen during training. Each token is assigned a unique ID based on its frequency in the dataset:
Frequently used tokens (e.g., "the", "is", "and") are assigned lower numbers for efficiency.
Rare or complex tokens (e.g., "photosynthesis") are assigned higher numbers.
Token-to-ID Mapping
When you input text into the model, each token is looked up in this vocabulary file and replaced with its corresponding ID. For example:
Vocabulary: {"artificial": 1212, "intelligence": 7099, "is": 2003, "amazing": 6429}
Input Sentence: "Artificial intelligence is amazing!"
Token IDs: [1212, 7099, 2003, 6429].
Encoding Efficiency:
Lower IDs are often reserved for common words to make processing faster and more efficient. Subword tokenization ensures that even unknown words can be broken down into smaller components and mapped to existing IDs.
Why Does ‘Artificial’ Become 1212 and Not 2981?
The assignment of numbers depends entirely on the vocabulary file created during training. Tokens are sorted by their frequency in the training data:
More frequent tokens get smaller IDs (e.g., "the" might be 1).
Less frequent or rare tokens get larger IDs (e.g., "photosynthesis" might be 10,000).
This approach ensures that common tokens are processed more efficiently during inference.
Example Code to Understand Tokenization
Here’s how you can tokenize text and see how tokens are converted into numerical IDs using Hugging Face’s library:
Note:
The "Ġ" in tokenized outputs represents a space before a word and is used by some tokenizers (like GPT-2) to explicitly mark word boundaries for clarity and efficiency. Instead of treating spaces as separate tokens, the tokenizer prepends "Ġ" to indicate that the token follows a space.
For example, in the sentence "Artificial intelligence is amazing!", the tokens are ['Artificial', 'Ġintelligence', 'Ġis', 'Ġamazing', '!'], where "Ġintelligence" means there’s a space before "intelligence."
This approach helps LLMs distinguish between words at the start of a sentence versus those following others, improving context understanding and processing efficiency. During decoding, the "Ġ" is removed, reconstructing the original text seamlessly.
Here’s what’s happening
The tokenizer splits the sentence into tokens like ["Artificial", "intelligence", etc.] using subword tokenization.
Each token is mapped to its corresponding ID from the vocabulary file.
The function converts these numerical IDs back into readable text.
Real-Life Applications of Tokenization
Tokenization enables LLMs to perform tasks like:
Text Generation: Breaking down input prompts for coherent responses.
Machine Translation: Splitting sentences for accurate translations.
Search Engines: Analyzing queries to match relevant results.
For example: When you type "Translate ‘hello’ to French" into a chatbot, tokenization breaks down your query so the model can understand and respond appropriately.
Why Does This Matter?
Tokenization is the first step in making raw text understandable for LLMs. By converting words into numerical representations, models can analyze patterns and generate meaningful responses efficiently.
What’s Next?
In our next post, we’ll dive deeper into another core concept: the attention mechanism, which allows LLMs to focus on the most relevant parts of input text while generating outputs.
#AI #ArtificialIntelligence #MachineLearning #DeepLearning #NaturalLanguageProcessing #NLP #GenerativeAI #TechInnovation #FutureOfAI #TechTrends #DigitalTransformation #TechUpdates #AITools #AIApplications #LLM #LargeLanguageModels #ChatGPT #OpenAI #GPT4 #BERT #Transformers #AIModels #LanguageModels #EdTech #HealthTech #FinTech #CustomerServiceTech #AIInHealthcare #AIInEducation #CareerGrowth #OpenToWork #JobSeekers #TechCareers #FutureJobs #StudentsOfAI #LearnWithAI #STEMEducation #Upskilling #Reskilling #AICommunity #TechCommunity #InnovationLeaders #TechnologyNews #InnovationMatters #FutureTech
Senior Data Engineer | Snowflake | ETL & Real-Time Streaming | Python • PySpark • IaC
6moVery helpful