Day 8/50: Building a Small Language Model from Scratch: What are Rotary Positional Embeddings (RoPE)

In the past two days, we focused on understanding positional embeddings and code it.

What Are Rotary Positional Embeddings (RoPE) and why Do They Matter in Transformers?

In modern deep learning, transformers is the critical component. From ChatGPT to BERT, most cutting-edge models rely on transformer architecture that processes sequences of data in parallel.

But there’s a fundamental challenge at the heart of transformers:

They have no natural sense of order.

That’s where positional embeddings come into play and more recently, a powerful variant called Rotary Positional Embeddings (RoPE) has emerged as a more efficient and elegant solution.

Let’s break this down.

Recap: Why Positional Embeddings Are Needed

Transformers treat input as a set of tokens processed simultaneously. Unlike RNNs (which read words one at a time), transformers look at all words at once, great for speed, but it removes the concept of order.

For example, to a raw transformer:

“The cat sat on the mat.”
“The mat sat on the cat.”

…are just the same tokens scrambled around. No sense of “first”, “last”, or “next”.

To solve this, we inject information about each token’s position in the sequence.

Traditional Positional Embeddings: How They Work

Two popular methods:

Learned positional embeddings — The model learns a unique vector for position 1, 2, 3, …, during training.
Sinusoidal embeddings — Use sin and cos functions to generate fixed vectors for each position.

They work, but have limitations:

Hard-coded or learned per-position.
Don’t generalize well to longer sequences.
Not easily applicable to attention pairwise scores (important for some models).

Enter Rotary Positional Embeddings (RoPE)

RoPE was introduced in RoFormer (Su et al., 2021) and is now widely used in models like LLaMA and DeepSeek.

What’s the Big Idea?

Instead of adding position information to token embeddings, RoPE rotates the token vectors in space based on their position. It works within the attention mechanism, modifying the query and key vectors directly.

This rotation encodes relative positions into the attention scores, a key difference from absolute positional embeddings.

How RoPE Works (Intuition)

Imagine each embedding as a 2D point on a plane. Now, for each position p, we rotate the embedding vector by an angle proportional to p.

More formally, RoPE represents rotation using complex numbers or 2D vector pairs. The rotation is done dimension-wise in even-odd pairs:

# Pseudocode for RoPE rotation
for i in range(0, dim, 2):
    x1, x2 = x[i], x[i+1]
    angle = theta * position
    x[i] = x1 * cos(angle) - x2 * sin(angle)
    x[i+1] = x1 * sin(angle) + x2 * cos(angle)

This makes the attention score between tokens naturally reflect relative positions something traditional embeddings don’t do well.

RoPE vs Traditional Positional Embeddings

| Feature                        | Traditional Embeddings        | Rotary Positional Embeddings (RoPE)            |
|-------------------------------|-------------------------------|------------------------------------------------|
| Position Injected              | Added to input embeddings     | Applied inside attention mechanism             |
| Absolute or Relative?          | Absolute                      | Relative                                       |
| Generalizes to Long Sequences? | Poor                          | Strong                                         |
| Learnable Parameters?          | Sometimes (if learned)        | No                                             |
| Adopted in SOTA models?        | Less common now               | Yes (LLaMA, DeepSeek)       |

Why RoPE Is So Useful

Relative Attention Encoding: RoPE allows attention layers to understand how far apart tokens are.
No Extra Parameters: The rotation is deterministic and parameter-free.
Extends to Long Sequences: RoPE generalizes better when dealing with unseen long sequences.
Simple to Implement: It’s just rotating even-odd dimensions — no need to store or learn positional vectors.

Use Cases in Real Models

LLaMA (Meta): Uses RoPE for better generalization and efficiency.
DeepSeek: DeepSeek model uses RoPE to separate query/key heads, enabling efficient long-context attention without bloating memory.

If you’re building a transformer-based model, especially for long documents, code, or scientific texts, RoPE might be your go-to.

Final Thoughts

Rotary Positional Embeddings are a powerful enhancement over traditional techniques. Their integration into attention mechanisms makes them ideal for models that require scaling to long sequences and complex patterns.

They’re elegant, efficient, and now a standard in state-of-the-art transformer architectures.

Next time you’re building or tweaking a transformer, consider rotating your embeddings, literally!

Tomorrow, we’re going to code this from scratch and see how we use it in https://github.com/ideaweaver-ai/DeepSeek-Children-Stories-15M-model

👉 If you’re looking to go beyond toy problems and actually build real stuff, this Free bootcamp is built on experience, not templates. No fluff, just the kind of GenAI projects you can fine-tune, deploy, and scale:

https://www.linkedin.com/events/free2monthgenerativeaicourse-be7345645002171011073/

Day 8/50: Building a Small Language Model from Scratch: What are Rotary Positional Embeddings (RoPE)

Prashant Lakhera

EB1-A Recipient, Lead System Engineer @ Salesforce | Ex-Redhat, GenAI, Author of 4 books, Blogger, YouTuber,kubestronaut, MLOps, AWS Bedrock, Hugging Face

Recap: Why Positional Embeddings Are Needed

Traditional Positional Embeddings: How They Work

Enter Rotary Positional Embeddings (RoPE)

What’s the Big Idea?

How RoPE Works (Intuition)

RoPE vs Traditional Positional Embeddings

Why RoPE Is So Useful

Use Cases in Real Models

Final Thoughts

100 Days of DevOps Interview

5,895 followers

More articles by this author

Others also viewed

Beyond Human Data: A Critical Examination of Silver & Sutton’s “Welcome to the Era of Experience”

Chain-of-Thought Reasoning with Granite

🥇Top ML Papers of the Week

🥇Top ML Papers of the Week

How to Prompt OpenAI o1 + Should You Use It? - AI&YOU #72

🥇Top ML Papers of the Week

Formulation of Node Embeddings in Graphs: Node2Vec Algorithm - Part 6 of X of my notes

The Foundation: Understanding LLMs and Prompt Engineering, And Why It All Matters

Thoughtful prompts (Layer of Thoughts, Chain of Thoughts,Tree of Thoughts …)

The Art and Science of Prompt Engineering: Techniques for Maximizing LLM Performance

Explore topics

Recap: Why Positional Embeddings Are Needed

Traditional Positional Embeddings: How They Work

Enter Rotary Positional Embeddings (RoPE)

What’s the Big Idea?

How RoPE Works (Intuition)

RoPE vs Traditional Positional Embeddings

Why RoPE Is So Useful

Use Cases in Real Models

Final Thoughts

100 Days of DevOps Interview

5,895 followers

From debates to Implementation: Building with GPT-OSS

Aug 12, 2025

Day 12/50: Building a Small Language Model from Scratch — Implementing a Simplified Attention Mechanism in Python

Jul 9, 2025

Day 11/50: Building a small language from scratch: Introduction to the Attention Mechanism in Large Language Models (LLMs)

Jul 8, 2025

Day 10/50: Building a Small Language Model from Scratch — What is Model Distillation?

Jul 5, 2025

Day 9/50: Building a Small Language Model from Scratch — Coding Rotary Positional Embeddings (RoPE)

Jul 3, 2025

🚀 What a week for the IdeaWeaver project! 🚀

Jun 27, 2025

Day 4 of 50 Days of Building a Small Language Model from Scratch - Understanding Byte Pair Encoding (BPE) Tokenizer

Jun 26, 2025

50 Days of Building a Small Language Model from Scratch

Jun 22, 2025

What Really Happens When You Ask a Cursor a Question with GitHub MCP Integrated

Jun 16, 2025

🎨 Is AI Killing the Soul of Art? Or Just Changing It?🎨

Apr 24, 2025

Others also viewed

Beyond Human Data: A Critical Examination of Silver & Sutton’s “Welcome to the Era of Experience”

Chain-of-Thought Reasoning with Granite

🥇Top ML Papers of the Week

🥇Top ML Papers of the Week

How to Prompt OpenAI o1 + Should You Use It? - AI&YOU #72

🥇Top ML Papers of the Week

Formulation of Node Embeddings in Graphs: Node2Vec Algorithm - Part 6 of X of my notes

The Foundation: Understanding LLMs and Prompt Engineering, And Why It All Matters

Thoughtful prompts (Layer of Thoughts, Chain of Thoughts,Tree of Thoughts …)

The Art and Science of Prompt Engineering: Techniques for Maximizing LLM Performance

Explore topics