How Transformers Power GPT Models

Victor Chikezie

Technology Attorney | Consultant & Writer

Published Jul 11, 2025

I was in the middle of a draft wherein I tried to critique the ability of LLMs like ChatGPT to “think” and “believe”. Perhaps it’s just a linguistic struggle with the use of “thought” and “belief” in the context of LLMs that bothered me enough to want to write about it.

But I somehow ended up in a rabbit hole of uninteresting technical jargon on LLMs, “transformers”, and Chain-of-Thoughts. I eventually gave up the draft, not from disinterest, but because the long-form text I was working on can be summarized in this one sentence:

“Just because a model explains its steps doesn’t mean it is actually reasoning.”

Having established that, I turn my focus to the question on topic: What are Transformers in Artificial Intelligence and what do they have to do with LLMs and NLP.

Introduction

If you’ve ever wondered how ChatGPT can hold a conversation, answer questions, or write stories with surprisingly human-like language, the answer comes down to a breakthrough solution in artificial intelligence called transformers. This architecture revolutionized the way machines understand natural language.

What are Transformers?

Transformers are the core technology behind today’s most advanced AI models. One of the most powerful features of a transformer is something called self-attention, which enables tools like ChatGPT to understand context across entire paragraphs, make connections between ideas, even if they’re far apart, and respond in ways that are fluent, relevant, and surprisingly natural.

This method helps the model focus on the most important words, even if they’re far apart, while also keeping memory use manageable.

The model does this within something called a context window, which refers to the amount of text it can consider at once. The bigger the window, the more words the model can pay attention to at the same time, leading to a deeper understanding of the overall meaning.

The Self-attention Layer

Self-attention in machine learning is similar to how humans focus their attention in daily life. Both involve picking out the most important parts of a larger situation in order to understand it more clearly. In psychology, self-attention means focusing on your own thoughts or actions. In deep learning, it means the model focuses on the most relevant parts of the input text to make better predictions.

The Self-attention layer allows a model to focus on different parts of the input when processing each part, effectively capturing contextual information. Instead of treating each part of the input independently, it learns to relate each part to all other parts, creating a richer representation.

I don’t have the technical depth to grasp, let alone enjoy reading the ins and out of it, but you can read all about it in the paper “Attention is All You Need.” by Ashish Vaswani et al.

Transformers are so foundational, because with its self-attention layer, GPT models can process natural language tasks well because the attention layer allows the model compute the relation between words regardless of the distance between them

Example: In the sentence “The cat that the dog chased ran away,” the model needs to figure out who ran away. Thanks to self-attention, it correctly understands that the cat ran, not the dog.

This ability to weigh relationships between words helps the model stay focused on the context even in long or complex sentences.

How Self-Attention Works in Transformers

IBM has a great resource on how transformers work, and being a visual learner myself, I leveraged ChatGPT to turn that process into a visual chart.

1. Embedding the Input Sequence

When the model receives a sentence (or sequence), the model breaks it down into tokens, which are made up of words or parts of words. These tokens are then converted into embeddings, which are numerical representations of each token’s meaning.

I think of embeddings like coordinates on a map that place similar locations (in this case, words) closer together. For example, “cat” and “kitten” would be in nearby locations.

2. Turning Words into Numbers (Vectors)

Machines can’t “read” the way humans do, instead it translate words into numbers, called vectors. A vector is just a list of numbers that represents a word’s meaning. These numbers capture things like tone, context, and similarity to other words. For example, the words “king” and “queen” have different meanings, but their vectors are close in shape because they’re related.

These word-vectors flow through the model and get refined along the way, helping the system understand subtle differences in meaning.

For each token, the model creates three special vectors:

Query (Q) — What this token is looking for in other tokens
Key (K) — What this token offers to be found
Value (V) — The actual content or meaning the token contributes

These vectors are made by doing some math (matrix multiplication, a type of linear transformation) that reshapes the data so the model can compare words more easily.

3. Compute Attention Scores

The model then computes the attention scores by comparing the query of each token with the keys of all the other tokens to determine how well each token (word) matches with the others. The scores are then adjusted to keep the math from getting out of control as the input gets bigger. This step helps the model determine how much focus one word should give to another.

4. Transform Scores into Probabilities

GPT models are classified as statistical learners that model language using probabilistic distributions over sequences of tokens (words). What this simply means is that they operate through statistical associations learned from vast datasets, predicting to the best probable sequence, the next most likely word or phrase in a given context.

To transform the score into probabilities, the model uses something called the softmax function. This makes the scores easier to work with by converting them into a set of values that add up to 1.

These probabilities are then used to combine the value vectors, giving more weight to the most relevant tokens. The result is a weighted mix of information that captures context, relationships, and meaning, ultimately helping generate more accurate responses.

Learning Through Layers

A transformer is made up of many layers, each one helping the model understand more. Think of them like filters. The first layers might look at simple things, like word order. Later layers look at higher-level ideas like grammar, style, or tone. The more layers, the deeper the understanding.

Once the model has processed everything, it uses that knowledge to predict the next word. Just like how you finish someone’s sentence when you know what they’re going to say, the model uses what it knows to choose the most likely next word, then repeats the process word by word.

This reminds me of a critique I read on LinkedIn that described ChatGPT as merely an advanced autocomplete technology that can create entire paragraphs that flow naturally.

Curious to Learn More?

I tried to make sense of the technical concept of transformers, but you don’t need to be an expert to figure out that it is more complex than I have made it look. If you want to explore this further, check out this video on YouTube that provides an evergreen visual explainers that breaks down how Transformers work inside of LLMs.

AI Hackathon

Interested in trying your hand at building with AI even without experience? Hackathons are a great way to earn from and learn about Artificial Intelligence. DevPost is my go-to resource hub for finding hackathons ranging from paid and unpaid projects.

I tried to host an AI hackathon on LinkedIn, you can check it out here, its a simple exercise that forces us to think about AI-generated content and how to spot them.

How Transformers Power GPT Models

Victor Chikezie

Technology Attorney | Consultant & Writer

Introduction

What are Transformers?

The Self-attention Layer

How Self-Attention Works in Transformers

1. Embedding the Input Sequence

2. Turning Words into Numbers (Vectors)

3. Compute Attention Scores

4. Transform Scores into Probabilities

Learning Through Layers

Curious to Learn More?

AI Hackathon

Further Reading

Volt Papers

984 followers

More articles by this author

Others also viewed

AI Agents in the Workplace: Optimizing Workflows in Tech, Finance, Pharma, and Manufacturing

Understanding & Building LLM Applications!

How AI Agents Function: An Expert Overview

The AI Product Price Wars: How LLM Wrapper Products Are Driving a Race to the Bottom

The Executive's Guide to Language AI: Beyond ChatGPT to the Full NLP Arsenal

Understanding Large Language Models and Their Implications: An Interview with OpenAI's CTO

Explore the Evolution of GPT-3, the World's Most Influential Language Model: From its Humble Beginnings to Today's ChatGPT - Happy Friday!

What is GPT-3? How is it Shaping the Future of Work?

AI and Language Model Types

P5 Quick read: Road to transformers - Generative AI #5

Explore topics

Introduction

What are Transformers?

The Self-attention Layer

How Self-Attention Works in Transformers

1. Embedding the Input Sequence

2. Turning Words into Numbers (Vectors)

3. Compute Attention Scores

4. Transform Scores into Probabilities

Learning Through Layers

Curious to Learn More?

AI Hackathon

Further Reading

Volt Papers

984 followers

Cyber\Privacy Brief: August 01, 2025

Aug 1, 2025

Cyber Pulse | Weekly Briefing on Cyber Intelligence

Jul 28, 2025

What Regulates Cyberspace - Code or Law?

Jul 24, 2025

The Blockchain Bounty Hunter

May 15, 2025

ISP Insider | Issue Three

Apr 29, 2025

ISP Insider | Issue Two

Apr 14, 2025

ISP Insider - Issue One

Mar 31, 2025

III. The Final Step in Risk Management: Risk Re-Evaluation

Feb 10, 2025

II. Risk Management in Information Systems | Mastering Risk Mitigation

Feb 7, 2025

I. Risk Assessment - A Key Step in the Risk Management Process

Feb 3, 2025

Others also viewed

AI Agents in the Workplace: Optimizing Workflows in Tech, Finance, Pharma, and Manufacturing

Understanding & Building LLM Applications!

How AI Agents Function: An Expert Overview

The AI Product Price Wars: How LLM Wrapper Products Are Driving a Race to the Bottom

The Executive's Guide to Language AI: Beyond ChatGPT to the Full NLP Arsenal

Understanding Large Language Models and Their Implications: An Interview with OpenAI's CTO

Explore the Evolution of GPT-3, the World's Most Influential Language Model: From its Humble Beginnings to Today's ChatGPT - Happy Friday!

What is GPT-3? How is it Shaping the Future of Work?

AI and Language Model Types

P5 Quick read: Road to transformers - Generative AI #5

Explore topics