The beauty of the CPU

Hussein Nasser

Software Engineer | Talks about backend, databases and operating systems

Published May 9, 2025

If you are bored of contemporary topics of AI and need a breather, I invite you to join me to explore a mundane, fundamental and earthy topic.

The CPU.

Take the CPU, a beautiful instrument for executing instructions.

A simple yet very common operation is to add two numbers and store the sum in a variable in memory.

Both “a” and “b” are in the main memory, to the CPU it’s a colossal distance like the Sun and Moon.

You see the CPU cannot do anything with stuff in memory, the data has to be close to it, local cache called registers. Often 64bit in modern CPUs.

So you load the variable “a” into a CPU register, then load B into another register in the same CPU. You might have multiple cores so you really need to know which core to pick. You don’t want a to be loaded in core 1 register and b to be loaded in core 4’s register.

I’m embarrassed to say I’m glossing over extraordinary amount of work and complexity of what it takes to load data from DIMMs RAM to the CPU registers, but maybe I’ll explore that in another post.

You now have both numbers close by in the CPU registers, you then execute 1 instruction to add the two specified registers and store them in a third register. The third register is then written back to memory where “c” supposed to live. The process can then enjoy reading the value of “c” and work with it.

It is important to mention that the add itself is an instruction that lives in main memory (in a text block of a process) and that too is fetched from memory and stored in a special register called IR (instruction register)

But I must ask a question, what if I want to sum 100 pairs? You might say well that is 100 instructions similar to what we have just explored.

The is the thing, it doesn’t have to be a 100 instructions, what if tell you can sum 100 pairs in 25 instructions.

25 executions is more efficient than 100.

Meet SIMD, single instruction multiple data. This allows one instruction to operate on multiple data at once and have multiple outputs essentially. As long as the CPU supports it of course.

So in our example you can store 4 variables in special vector registers and another 4 in another vector and have the CPU execute one instruction to sum 4 integers at once. Why 4 ? well its just what the vector size the CPU supports.

Think of it as a function that takes 8 parameters, a1,a2,a3,a4,b1,b2,b3,b4 and sum all of then at once and produce c1,c2,c3,c4 all in one shot single instruction. This is with a 128 bit vector and 32bit integer.

Brilliant.

This can add up especially in CPU bound and heavy workload applications.

For example, there are several research to combine SIMD with B+Tree where we have alot of data keys and values (in pages) and we want to process it with SIMD.

I just love this stuff.

The things that are so fundamentals that we think can’t be improved, can actually be.

Ok now back to why LLMs are being disobedient.

Ved Joshi

software engineer and polymath

3mo

the beauty of Hussein Nasser’s lessons !

1 Reaction

Nodar Okroshiashvili

Data Scientist | Data Engineer | Python Developer

3mo

“If you are bored of contemporary topics of AI…” Thats how each good, really good article should be starting 🙂 Really nice write up Hussein

1 Reaction

Robson Cassiano

Senior Java Software Engineer @Epic Games | English Teacher CELTA

3mo

And to think it's made of stones.

Said Mohamed 🍉

DevOps Engineer @ Foxconn | B.S. Mathematics

3mo

Great article Hussein Nasser! This level of detail makes us appreciate the basic components of the systems we take for granted.

Amr Marey

Software Engineer

3mo

I like the way you share your courses tutorials and whatever. It's something like here's my experience and what I think about things, not just that specific known words can find on internet.

The beauty of the CPU

Hussein Nasser

Software Engineer | Talks about backend, databases and operating systems

More articles by this author

Others also viewed

Types of Memory

Demystifying Memory Sub-systems Part1: Caches

Top 10 CPUs in 2025: The Best High-End Processors for All Your Needs

ARM Interrupt Controllers - The Gateway to CPU's attention

Demystifying Control Memory in MCP Systems

OS Fundamentals: Part 1, Understanding the hardware, and its abstractions

CPU works. Oh really? But how?

Calculate CPU for containers in k8s dynamically

AVX-512 gotcha: avoid compressing words to memory with AMD Zen 4 processors

Understanding Spinlocks - How CPU supports Atomic locks

Explore topics

Memcached Architecture

Jul 4, 2025

Network Routing — A Deep Dive

May 23, 2025

It’s not you, modern software does feel slow

Apr 25, 2025

Apache Kafka Architecture

Apr 18, 2025

A dive into NodeJS I/O

Apr 7, 2025

How I learn Software Engineering

Mar 29, 2025

A Story about Lunch and cache invalidation

Feb 14, 2025

The Six Connections Limit in Chromium Browsers

Feb 3, 2025

The Beauty of the WAL - A deep dive

Jan 29, 2025

What makes a good database engineer

Jan 17, 2025