From Single-Head Simplicity to Multi-Head Mastery: My Transformer Model Journey

Venkateswarlu Mopidevi

Experienced Software Engineer | Cloud & DevOps (GCP, Azure, Docker) | Machine Learning & AI | Python | Java

Published Apr 6, 2025

Introduction

In my earlier articles, I shared how AI models have grown — from basic one-hot encodings to word embeddings, deep CNNs, and even models that understand both images and text together. This time, I explored Transformers, one of the most powerful tools in modern AI. In this project, I built a Transformer model to solve a fun challenge: for every letter in a sentence, can the model tell how many times it has already appeared — or how often it shows up in total? The twist? We only care if it appeared 0, 1, or at least 2 times. By training the model and visualizing its attention, I got to see how it learns to “count” letters by focusing on patterns in the sequence.

What’s the Task?

The challenge is simple, yet interesting: given a 20-character string made up of lowercase letters and spaces, the model has to make a prediction for each character.

But what’s it predicting?

There are two tasks:

Task 1 — Count Before: For each character, predict how many times it has already appeared earlier in the string.
Task 2 — Count Before and After: For each character, predict how many times it appears in the entire string (excluding the current position).

And one more twist: We don’t need the exact count — we only care whether the number of appearances is 0, 1, or 2 or more.

To train the model, I used a dataset of 10,000 training samples and 1,000 evaluation samples. So essentially, the model is learning to “count” characters in context — but in a simplified, category-based way. And that’s where attention comes in!

Preprocessing the Data

Before jumping into building the model, I had to prepare the data in a way that the Transformer model understands:

1. Character to Index Mapping

Each character (from ‘a’ to ‘z’ and space ‘ ’) was mapped to a unique index. This gives us a vocabulary size of 27.

2. Turning Strings into Tensors

For each 20-character input string:

I converted the characters to their corresponding indices using the mapping.
These index sequences were then turned into PyTorch tensors, ready to be used by the embedding layer.

3. Generating Target Labels

Depending on the task:

Task 1: I calculated how many times each character appeared before the current position (capped at 2).
Task 2: I counted how many times each character appeared in the entire string (excluding the current one), also capped at 2.

These labels were then converted into tensors of the same length as the input sequence.

Building the Basic Transformer Model

With the data preprocessed, it was time to build the model. I started simple: a Transformer with just one layer and one attention head. This made it easy to understand how attention works — and to watch the model learn over time.

Why Attention Matters

Unlike older models that read sequences left to right (like RNNs), Transformers use a mechanism called self-attention. This allows each token (character) in the sequence to look at other tokens — not just before it, but anywhere in the sequence.

For example, if the model is looking at the letter ‘d’, it can “pay attention” to other ‘d’ s that came earlier (or later), helping it decide how often that character has already appeared.

But Wait — Don’t We Need Order?

Yes! One challenge with Transformers is that they don’t understand position on their own. That’s where positional encoding comes in.

In my model, I used learnable positional embeddings, which are added to the character embeddings. This tells the model not just what character it’s looking at, but also where it is in the sequence. That way, it knows if a repeated ‘a’ came before or after the current one.

What My Simple Transformer Looked Like

Embedding Layer Turns each character into a dense vector that the model can learn from.
Positional Encoding Helps the model understand order — essential for tasks like counting!
Self-Attention Layer (1 layer, 1 head) The heart of the Transformer. Each character can “look at” other characters to help make its decision.
Feedforward + Output Layer After processing with attention, the model passes through a feedforward network and then predicts one of three categories: 0, 1, or 2+ appearances.

Architecture of the Basic Transformer Model (1 Layer, 1 Head)

Training Results

Even this simple Transformer learned really well:

Task 1 (count-before): It quickly learned to look backward and achieved over 90% accuracy.
Task 2 (count-before-and-after): Slightly harder, but still reached ~90% with just one head and one layer!

Tuning the Basics: Hyperparameters That Made a Difference

To get my model learning effectively, I had to experiment with a few key hyperparameters. Here’s what worked — and what didn’t:

Learning Rate: I started with 0.001, which gave stable results. But when I increased it up to 0.005, the model didn’t improve — in fact, performance dropped. Lesson learned: even small tasks need carefully chosen learning rates.
Number of Epochs: I trained for 10 epochs. The model reached over 90% accuracy in just a few epochs and then stabilized, so there was no need to go much further.
Loss Function: I used NLLLoss (Negative Log Likelihood Loss) because my model outputs log-probabilities using log_softmax. This loss function is a natural fit — it measures how well the predicted log-probabilities match the correct class.
I also tested CrossEntropyLoss, which is commonly used for classification tasks. However, it expects raw logits (before applying log_softmax) and internally computes the log-probabilities itself. Since I had already applied log_softmax in my model, using CrossEntropyLoss resulted in incorrect training behavior. So NLLLoss was the right choice for my setup.

Takeaway: Sometimes, a simple model with well-tuned hyperparameters can outperform deeper or more complex architectures — especially for focused tasks like this one!

Visualizing Attention: What Did the Model Learn?

To understand how the model actually “counts” characters, I visualized the attention maps generated by the trained Transformer. These maps show which characters the model is focusing on when making predictions for each position in the string.

Here are attention maps from both tasks using the same input string: “ed by rank and file”

Task 2 Attention Map: Count Before and After

🟠 Task 1 — Count Before
🔵 Task 2 — Count Before and After

What’s Similar in Both?

Diagonal Focus: In both maps, there’s a bright diagonal line. This means each letter is paying attention to itself. That’s expected — the model needs to remember what it’s currently looking at!
Matching Tokens Light Up: When characters like 'e', 'a', 'n', 'd', or ' ' appear more than once, the attention map highlights those earlier or later spots. That’s the model linking repeated characters to help it count.

Task 1: Only Look Back 🔙

The Task 1 map shows the model mostly attending to letters that came before. For example, if the second 'e' appears, the model strongly focuses on the first 'e' to decide the count. This backward attention makes sense — the goal is to count only previous occurrences.

Task 2: Look Both Ways 🔄

In contrast, Task 2 shows more spread-out attention. The model now attends to both earlier and later characters. That’s because it needs to consider all other instances of a character, no matter their position.

A Note on Spaces ' '

Spaces are very common and light up clearly in both maps. In Task 1, the model looks at previous spaces. In Task 2, it looks both before and after — again matching the task’s requirement.

Scaling It Up: Going Deeper with More Layers

After training a simple 1-layer Transformer, I wanted to see what happens if we go deeper. So, I tried a multi-layer, single-head Transformer on Task1.

How did it perform? At first, accuracy improved a lot — reaching above 99% in just a few epochs! But then, it started to drop. The model became unstable and showed signs of overfitting. Despite its deeper architecture, it didn’t consistently outperform the simpler model.

Key Insight: Adding more layers brings complexity, but it also requires careful tuning. For small tasks like this, a shallow model might be all you need.

After trying a deeper Transformer with multiple layers, I visualized how each layer attends to different parts of the sequence. Here’s what the attention maps reveal for the input: “ed by rank and file”

Attention Maps Across Transformer Layers (Task 1)

Layer-by-Layer Breakdown:

Layer 1 — Broad Context The model casts a wide net here, lightly attending to most of the sequence. It’s trying to gather a rough idea of the overall structure.

Layer 2 — Self-Focus Emerges We start seeing a strong diagonal pattern. Each character begins to attend more to itself and nearby characters — great for tracking local information.

Layer 3 — Spotting Repeats This layer shows more distinct off-diagonal spots. The model starts linking repeated characters — an important step for counting earlier appearances.

Layer 4 — Sharper Focus Now attention becomes more selective. The model filters out less relevant parts and hones in on key tokens that influence the prediction.

Layer 5 — Final Refinement Only a few high-confidence links remain. The model confidently zeroes in on just the most important characters to make the final count.

Why It Matters: By stacking layers like this, the model gradually learns to shift from broad context to pinpointed reasoning — perfect for a task like counting repeated characters.

Exploring Multi-Head Attention: More Eyes on the Sequence 👀

After experimenting with deeper models, I wanted to test another idea: what if we let the model look at the sequence in multiple ways at the same time?

That’s exactly what multi-head attention does.

In a single-head model, attention comes from one perspective. But with multiple heads, each can learn to focus on different patterns — like one head might specialize in spotting repeated vowels, while another might focus on spacing or neighboring characters.

So I trained a 1-layer, 4-head Transformer on Task 1.

How did it perform?

Epoch 1: Train Accuracy — 88.9%, Eval Accuracy — 91.6%

Epoch 5: Train Accuracy — 88.5%, Eval Accuracy — 98.4%

Epoch 10: Train Accuracy — 88.9%, Eval Accuracy — 99.05%

Observation: The training accuracy remained static, but evaluation accuracy steadily improved. This suggests that different heads specialize in different behaviors, and when combined, they generalize better, even if individual heads aren’t perfect.

What Did Each Head Learn?

By visualizing each attention head, we can understand how they divide up the work:

Attention maps from each head of the multi-head single-layer Transformer model (Task 1)./

Head 1: Mostly attends to the current token and its nearby neighbors. It helps keep the identity and local context clear.
Head 2: Links the current token to earlier occurrences of the same character — useful for counting past appearances.
Head 3: Spreads attention more broadly across the sequence to understand the overall frequency of characters.
Head 4: Acts like a high-level filter, focusing sharply on the most relevant positions for making the final classification.

Why Multi-Head Works: Multi-head attention gives the model the flexibility to analyze the sequence from multiple perspectives. Like a team of readers each noticing different things, the heads combine their insights to form a more complete understanding — boosting performance without needing a deeper model.

Final Thoughts

What started as a simple letter-counting task turned into a powerful learning experience with Transformers. From a basic one-head model to experimenting with multiple layers and heads, I saw how attention helps models focus, learn patterns, and make accurate predictions — all by “looking” at the right parts of the input.

Sometimes, small tasks teach big lessons. 🚀

From Single-Head Simplicity to Multi-Head Mastery: My Transformer Model Journey