How I Built a Transformer from Scratch!
Not too long ago, if you wanted a machine to understand language, you had to rely on models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory Networks (LSTMs). These models worked sequentially, meaning they read sentences one word at a time, like how humans read a book.
But they had serious limitations:
They struggled with long sentences: If a sentence were too long, earlier words would get “forgotten.”
They couldn’t process words in parallel, making them slow and hard to scale.
They had trouble capturing context: For example, in "I just tried that chocolate brownie Ice Cream from that Cafe yesterday and it was amazing!" an LSTM might struggle to understand whether “it” refers to “ice cream” or “cafe”.
For years, researchers tried different tricks to fix these problems—such as attention mechanisms inside LSTMs, but the improvements were limited.
Then, in 2017, something amazing happened. A research paper titled “Attention Is All You Need” by Vaswani et al. introduced Transformers, a completely new architecture that replaced RNNs and LSTMs entirely.
And the impact? Massive!
BERT: The model that changed Google Search and NLP forever.
GPT-3 & GPT-4: The foundation of ChatGPT, trained on hundreds of billions of words.
DALL·E & Stable Diffusion: Even AI-generated images use Transformers!
AlphaFold: Revolutionized protein folding research using Transformer-based models.
Transformers are now everywhere, and they have changed not just NLP, but AI as a whole.
So I decided that if I really wanted to understand deep learning at its core, I needed to build a Transformer from scratch. No shortcuts. No pre-built libraries like Hugging Face. Just pure PyTorch and math.
In this post, I’ll break down:
How Transformers work (in simple words)
The steps I followed to build one
What I learned along the way
Let's jump Right In!
How Do Transformers Work?
Imagine you're reading a long paragraph. As a human, you don’t read word by word and forget everything before it. Instead, your brain focuses on important words while keeping the overall meaning in mind.
That’s exactly what Transformers do: they focus on relevant words in a sentence using a mechanism called "Self-Attention."
Let’s break it down step by step:
Step 1: Understanding Self-Attention
Self-Attention is the core idea behind Transformers. Instead of processing words one by one like RNNs, a Transformer looks at the entire sentence at once and figures out which words are important to each other.
Example: Let’s take a sentence: "The dog got tired and slept on the rug since it was cozy"
For a model to understand what “it” refers to, it needs to pay more attention to “rug” rather than “dog”. Self-attention helps the model dynamically assign importance to words based on context.
The Transformer calculates this using three key vectors:
Query (Q): What are we looking for?
Key (K): What does each word represent?
Value (V): What is the actual meaning of each word?
Each word in the sentence is converted into these Q, K, and V vectors. Then, the Transformer calculates attention scores, determining how much importance each word should give to others.
Step 2: Multi-Head Attention: Seeing from Multiple Angles
Instead of using just one self-attention mechanism, Transformers use multiple attention heads.
Think of it like looking at an object from different angles. One attention head might focus on grammatical structure, another on meaning, and another on word position.
This helps the model understand different aspects of a sentence simultaneously.
Step 3: Position Encoding: Handling Word Order
Unlike RNNs, Transformers don’t read sentences in order, they process everything at once. So how do they understand the position of words?
They use Positional Encoding, a technique that adds a unique pattern to each word's representation, allowing the model to recognize word order even though everything is processed in parallel.
Step 4: The Encoder-Decoder Architecture
A standard Transformer consists of two main parts:
The Encoder: Reads and processes the input sentence.
The Decoder: Generates the output (used in tasks like translation).
If you’ve used GPT-4 or ChatGPT, they use only the Decoder part. If you’ve used BERT, it’s an Encoder-only model.
Why Is This Better Than RNNs?
Parallel Processing: Instead of processing words one by one, Transformers look at the entire sentence at once, making them faster and more scalable.
Better Context Understanding: Self-Attention allows Transformers to capture long-range dependencies, meaning they remember important words even in long paragraphs.
Handles Large Datasets: Models like GPT-4 are trained on trillions of words because Transformers can efficiently process massive amounts of text.
How I Built a Transformer from Scratch in PyTorch
Now that we understand how Transformers work, let’s dive into building one step by step!
Instead of relying on pre-built libraries like Hugging Face’s transformers, I implemented everything from scratch using PyTorch. Here’s the breakdown:
Step 1: Input Embeddings: Converting Words into Embeddings
Since deep learning models work with numbers, the first step was to convert words into dense vectors using an embedding layer.
Here’s how I implemented it:
Step 2: Adding Positional Encodings
Since Transformers process all words at once, they need a way to understand word order. That’s where Positional Encoding comes in.
How It Works:
We create a fixed matrix of values based on sinusoidal functions (sin/cos).
The even indices use sin and the odd indices use cos, creating a unique encoding for each position.
This helps the Transformer distinguish words based on their position in the sequence.
Step 3: Multi-Head Self-Attention: The Core of Transformers
After embedding the input and adding positional encodings, the next step is to let the model learn the relationships between words. The way Transformers do this is through self-attention, a mechanism that allows each word to attend to every other word in the sentence.
However, a single attention mechanism might not capture all the nuances of a sentence. That’s why the Transformer splits attention into multiple heads, each focusing on different aspects of the data.
How Multi-Head Attention Works:
Project Input Into Queries, Keys, and Values
Each word in a sequence is projected into three different vectors: Query (Q): What are we looking for? Key (K): What do we have? and Value (V): What information do we carry?
These are created using three linear layers:
Compute Attention Scores
Attention is computed using the Scaled Dot-Product Attention formula:
This means: We take the dot product of Query (Q) and Key (K) to find relevance. We scale the result by √d_k to stabilize gradients and we apply softmax to normalize scores.
We multiply by Values (V) to get the weighted sum.
Apply Multi-Head Attention
Instead of applying attention once, we split d_model into multiple smaller attention heads and run attention in parallel.
We combine the results at the end using another linear transformation
Final Implementation Code:
Step 4: Transformer Encoder: Stacking Attention & Feed-Forward Networks
After computing Multi-Head Self-Attention, we need to process the outputs in a structured way. This is done through Encoder Blocks, which are the fundamental building blocks of the Transformer Encoder.
What Happens in an Encoder Block?
Each Encoder Block consists of:
Multi-Head Self-Attention: Helps the model focus on relevant words in the sequence.
Add & Norm (Residual Connection + Layer Norm): Stabilizes training and prevents vanishing gradients.
Feed-Forward Network (FFN): Expands the model’s capacity to capture complex patterns.
Add & Norm Again: Another residual connection to keep training stable.
These blocks are stacked on top of each other to form the full Transformer Encoder.
EncoderBlock
Applies Multi-Head Self-Attention with a residual connection.
Applies a Feed-Forward Network (FFN) with another residual connection.
Encoder
A stack of EncoderBlocks (e.g., 6 layers in the original Transformer).
Applies Layer Normalization after processing all layers.
Step 5: Transformer Decoder: Generating Predictions
So far, we’ve built the Transformer Encoder that processes input sequences. But what about generating output sequences? That’s where the Decoder comes in!
The Decoder follows a structure similar to the Encoder but with a few key differences.
How the Transformer Decoder Works
Each Decoder Block consists of:
Masked Multi-Head Self-Attention:
Prevents the model from “cheating” by seeing future tokens during training.
Ensures predictions are generated one token at a time.
Cross-Attention:
Attends to the encoder’s output and extracts relevant information.
This helps the decoder understand how input and output are related.
Feed-Forward Network (FFN):
Expands the model’s ability to learn complex patterns.
Add & Norm (Residual Connections + Layer Norm):
Stabilizes training and prevents gradient issues.
DecoderBlock
First applies Masked Multi-Head Self-Attention (to prevent the model from looking ahead).
Then applies Cross-Attention (so it can use information from the Encoder).
Uses a Feed-Forward Network (FFN) to transform the output.
Decoder
A stack of DecoderBlocks (e.g., 6 layers in the original Transformer).
Applies Layer Normalization at the end.
Final Step: Assembling the Full Transformer Model!
We’ve built all the core components (Embedding, Encoder, Decoder, Attention, Feed-Forward Layers), and now we’re putting everything together into a fully functional Transformer model!
How the Transformer Works
Encodes the input sequence using Embeddings, Positional Encoding, and the Encoder stack.
Decodes the output sequence by attending to both previous outputs and the encoded input.
Projects the decoder output onto the vocabulary space for predictions.
Dataset
For training our machine translation model, we use the OPUS Books dataset, a multilingual parallel corpus derived from translated books. This dataset is well-suited for machine translation because it provides high-quality, structured, and contextually rich sentence pairs. For this project, I decided to use 20% of the total training set. The main goal of this project is to convert the English text to French.
Why OPUS Books?
High-Quality Translations: Since the dataset is sourced from books, it contains well-formed, grammatically correct sentences.
Diverse Language Pairs: Supports multiple language combinations, making it useful for training bilingual models.
Publicly Available: Easily accessible through the Hugging Face datasets library, ensuring effortless integration into deep learning pipelines.
Dataset Link: Opus Books Dataset
Next, we preprocess the dataset by tokenizing the text and preparing it for model training.
Training the Model
The training process involves several key steps to ensure the model learns efficiently and performs well on translation tasks.
1. Setting Up the Environment
We first determine whether to use a GPU (if available) or fall back to a CPU. This ensures that our training process runs optimally based on available hardware resources.
For this project, I used Google Colab with a T4 GPU, which provided the necessary computing power to train the Transformer model efficiently.
2. Preparing the Data
The dataset is loaded and split into training and validation sets. The data is tokenized and converted into numerical format so the model can process it. The train dataset is used for learning, while the validation dataset helps monitor performance.
3. Initializing the Model
A transformer-based model is instantiated with vocabulary sizes based on the tokenized dataset. If a pre-trained model is available, it is loaded to continue training from a previous checkpoint.
4. Defining the Optimization Process
The Adam optimizer is used to adjust the model’s parameters.
Cross-entropy loss with label smoothing is used to improve generalization.
The loss function ignores padding tokens to avoid misleading error signals.
5. Training the Model
During each epoch:
The model processes input data, generating predictions.
The loss is computed by comparing predictions with actual translations.
The optimizer updates model weights to minimize loss.
Accuracy is measured to track model improvements.
Training progress is logged using TensorBoard for analysis.
6. Running Validation
After each epoch, the model is evaluated on the validation dataset to measure performance on unseen data. This helps detect overfitting and ensures the model generalizes well.
7. Saving the Model
At the end of each epoch, the model's state is saved. This allows training to resume from the last checkpoint without starting over.
This structured training loop ensures the model learns effectively while tracking progress and maintaining flexibility for future improvements.
Results
After training the model for 20 epochs, which took approximately 1 Day and 19 hours, we observed significant performance improvements. The key metrics are as follows:
Final Training Loss: 2.8
Training Accuracy: 55%.
Overall Trend: The model steadily learned to generate better translations, with loss reducing and accuracy increasing as training progressed.
These results indicate that the model successfully captured meaningful language patterns, though there may still be room for further improvements through hyperparameter tuning and additional training.
Qualitative Analysis
Source Text: "As that was the case, neither Jane, to whom I related the whole, nor I, thought it necessary to make our knowledge public; for of what use could it apparently be to any one, that the good opinion which all the neighbourhood had of him should then be overthrown?"
Target (Actual) Translation: "Quand je suis revenue à la maison, le régiment allait bientôt quitter Meryton ; ni Jane, ni moi n’avons jugé nécessaire de dévoiler ce que nous savions."
Predicted Translation: "Quant à ce que je ne pouvais pas dire que je ne pouvais pas accepter de la peine , Jane , si je le pouvais , si je le pouvais pas , si je pouvais , si je le pourrais , si je lui , si ce serait une chose qui pouvait se faire savoir si ce serait la chose de son amour ?"
Observations
Some lexical elements match (e.g., Jane and si je le pouvais have some relevance).
Major issues with fluency & structure: The predicted translation is repetitive, lacks coherence, and deviates significantly from the original meaning.
Contextual failure: The model does not correctly capture the essence of the source text, leading to nonsensical outputs.
Conclusion
While the model demonstrates basic learning, it struggles with sentence structure, context preservation, and fluency. This suggests that more training, better tokenization, or fine-tuning with pre-trained models would be necessary for better performance.
Ups and Downs of This Project
Ups (Successes & Strengths)
Effective Learning: The model showed a steady decrease in loss and an increase in accuracy over 20 epochs, demonstrating its ability to learn meaningful translation patterns.
Scalability: The OPUS Books dataset provided a solid foundation, proving that even training on a subset (20%) can yield promising results.
Efficient Training Pipeline: The use of tokenization, batching, and optimized data handling ensured smooth execution over long training durations.
Downs (Challenges & Limitations)
Long Training Time: Training for 20 epochs took nearly 1 Day and 19 hours, making hyperparameter tuning and further experimentation slow and resource-intensive.
Translation Quality: While the model improved, it still requires fine-tuning or more training to achieve high-quality translations.
Expensive to Train from Scratch: Training a transformer model from scratch is computationally expensive. Using pre-trained models or transfer learning could be a more practical approach.
Hardware Constraints: The training speed was limited by available resources, and a more powerful GPU setup would greatly improve efficiency.
Despite these challenges, the project provided valuable insights into building and optimizing transformer-based translation models.
Resources
Here are some great videos and articles that helped me throughout this project:
Articles & Documentation
Attention is All You Need: The Official Academic Paper
Hugging Face Datasets Documentation: Understanding how to load and process datasets
Hugging Face Tokenizers: Building Efficient Tokenizers
The Illustrated Transformer by Jay Alammar: A visual guide to how Transformers work
Opus Book Dataset: A collection of copyright-free books aligned by Andras Farkas
Understanding Positional Encoding in Transformers: What exactly is Positional Embedding
Videos & Tutorials
Transformer Neural Networks, ChatGPT's foundation, Clearly Explained!!! by StatQuest with Josh Starmer
Attention for Neural Networks, Clearly Explained!!! by StatQuest with Josh Starmer
Coding a Transformer from scratch on PyTorch, with full explanation, training and inference. by Umar Jamil
Attention is all you need (Transformer) - Model explanation (including math), Inference and Training by Umar Jamil
Pytorch Transformers from Scratch (Attention is all you need) by Aladdin Persson
Try It Yourself!
You can access and edit the project using the links below:
Google Colab: Google Colab Notebook
GitHub Repository: GitHub Repository
Final Conclusion
This project was an exciting journey in building a translation model from scratch! Over 20 epochs (1 Day and 19 hours of training), the model learned and improved, but there were still challenges.
What Went Well?: The model gradually got better at translating. Loss decreased, and accuracy improved over time.
What Could Be Better?: The translations were sometimes repetitive and unclear. Training from scratch was expensive and slow.
What’s Next?: Fine-tuning a pre-trained model (like mBART) can give better results in less time. Using more data and optimizing tokenization can improve fluency.
Final Thoughts
This was a great learning experience! While training from scratch was tough, we now understand the process better and know how to make it even better next time. Onward to more improvements!