How Transformers Power GPT Models
I was in the middle of a draft wherein I tried to critique the ability of LLMs like ChatGPT to “think” and “believe”. Perhaps it’s just a linguistic struggle with the use of “thought” and “belief” in the context of LLMs that bothered me enough to want to write about it.
But I somehow ended up in a rabbit hole of uninteresting technical jargon on LLMs, “transformers”, and Chain-of-Thoughts. I eventually gave up the draft, not from disinterest, but because the long-form text I was working on can be summarized in this one sentence:
“Just because a model explains its steps doesn’t mean it is actually reasoning.”
Having established that, I turn my focus to the question on topic: What are Transformers in Artificial Intelligence and what do they have to do with LLMs and NLP.
Introduction
If you’ve ever wondered how ChatGPT can hold a conversation, answer questions, or write stories with surprisingly human-like language, the answer comes down to a breakthrough solution in artificial intelligence called transformers. This architecture revolutionized the way machines understand natural language.
What are Transformers?
Transformers are the core technology behind today’s most advanced AI models. One of the most powerful features of a transformer is something called self-attention, which enables tools like ChatGPT to understand context across entire paragraphs, make connections between ideas, even if they’re far apart, and respond in ways that are fluent, relevant, and surprisingly natural.
This method helps the model focus on the most important words, even if they’re far apart, while also keeping memory use manageable.
The model does this within something called a context window, which refers to the amount of text it can consider at once. The bigger the window, the more words the model can pay attention to at the same time, leading to a deeper understanding of the overall meaning.
The Self-attention Layer
Self-attention in machine learning is similar to how humans focus their attention in daily life. Both involve picking out the most important parts of a larger situation in order to understand it more clearly. In psychology, self-attention means focusing on your own thoughts or actions. In deep learning, it means the model focuses on the most relevant parts of the input text to make better predictions.
The Self-attention layer allows a model to focus on different parts of the input when processing each part, effectively capturing contextual information. Instead of treating each part of the input independently, it learns to relate each part to all other parts, creating a richer representation.
I don’t have the technical depth to grasp, let alone enjoy reading the ins and out of it, but you can read all about it in the paper “Attention is All You Need.” by Ashish Vaswani et al.
Transformers are so foundational, because with its self-attention layer, GPT models can process natural language tasks well because the attention layer allows the model compute the relation between words regardless of the distance between them
Example: In the sentence “The cat that the dog chased ran away,” the model needs to figure out who ran away. Thanks to self-attention, it correctly understands that the cat ran, not the dog.
This ability to weigh relationships between words helps the model stay focused on the context even in long or complex sentences.
How Self-Attention Works in Transformers
IBM has a great resource on how transformers work, and being a visual learner myself, I leveraged ChatGPT to turn that process into a visual chart.
1. Embedding the Input Sequence
When the model receives a sentence (or sequence), the model breaks it down into tokens, which are made up of words or parts of words. These tokens are then converted into embeddings, which are numerical representations of each token’s meaning.
I think of embeddings like coordinates on a map that place similar locations (in this case, words) closer together. For example, “cat” and “kitten” would be in nearby locations.
2. Turning Words into Numbers (Vectors)
Machines can’t “read” the way humans do, instead it translate words into numbers, called vectors. A vector is just a list of numbers that represents a word’s meaning. These numbers capture things like tone, context, and similarity to other words. For example, the words “king” and “queen” have different meanings, but their vectors are close in shape because they’re related.
These word-vectors flow through the model and get refined along the way, helping the system understand subtle differences in meaning.
For each token, the model creates three special vectors:
Query (Q) — What this token is looking for in other tokens
Key (K) — What this token offers to be found
Value (V) — The actual content or meaning the token contributes
These vectors are made by doing some math (matrix multiplication, a type of linear transformation) that reshapes the data so the model can compare words more easily.
3. Compute Attention Scores
The model then computes the attention scores by comparing the query of each token with the keys of all the other tokens to determine how well each token (word) matches with the others. The scores are then adjusted to keep the math from getting out of control as the input gets bigger. This step helps the model determine how much focus one word should give to another.
4. Transform Scores into Probabilities
GPT models are classified as statistical learners that model language using probabilistic distributions over sequences of tokens (words). What this simply means is that they operate through statistical associations learned from vast datasets, predicting to the best probable sequence, the next most likely word or phrase in a given context.
To transform the score into probabilities, the model uses something called the softmax function. This makes the scores easier to work with by converting them into a set of values that add up to 1.
These probabilities are then used to combine the value vectors, giving more weight to the most relevant tokens. The result is a weighted mix of information that captures context, relationships, and meaning, ultimately helping generate more accurate responses.
Learning Through Layers
A transformer is made up of many layers, each one helping the model understand more. Think of them like filters. The first layers might look at simple things, like word order. Later layers look at higher-level ideas like grammar, style, or tone. The more layers, the deeper the understanding.
Once the model has processed everything, it uses that knowledge to predict the next word. Just like how you finish someone’s sentence when you know what they’re going to say, the model uses what it knows to choose the most likely next word, then repeats the process word by word.
This reminds me of a critique I read on LinkedIn that described ChatGPT as merely an advanced autocomplete technology that can create entire paragraphs that flow naturally.
Curious to Learn More?
I tried to make sense of the technical concept of transformers, but you don’t need to be an expert to figure out that it is more complex than I have made it look. If you want to explore this further, check out this video on YouTube that provides an evergreen visual explainers that breaks down how Transformers work inside of LLMs.
AI Hackathon
Interested in trying your hand at building with AI even without experience? Hackathons are a great way to earn from and learn about Artificial Intelligence. DevPost is my go-to resource hub for finding hackathons ranging from paid and unpaid projects.
I tried to host an AI hackathon on LinkedIn, you can check it out here, its a simple exercise that forces us to think about AI-generated content and how to spot them.
Further Reading
- Vanna Winland — What is Self-Attention, IBM — https://www.ibm.com/think/topics/self-attention
AI Use Disclosure
We used AI to generate the visual chart inspired by IBM’s guide on how transformers work, and to help simplify complex terms, particularly mathematical and technical concepts related to artificial intelligence and LLMs. No other content in this article was created or written using AI.
Senior Technical Engineer at Cdtech Innovations Private Limited
1moYour exploration into transformers and how they parse and generate language contextually is crucial and understanding the attention mechanism and token embeddings can truly enhance prompt crafting. Have you considered how varying context windows might affect the coherence of GPT generated responses?