Why do we even need SIMD instructions ?

Daniel Lemire

Computer Scientist

Published Aug 9, 2025

Last week, I was chatting with a student and I was explaining what SIMD instructions were. I was making the point that, in practice, all modern processors have SIMD instructions or the equivalent. Admittedly, some small embedded processors do not, but they lack many other standard features as well. SIMD stands for Single Instruction, Multiple Data, a type of parallel computing architecture that allows a single instruction to process multiple data elements simultaneously. For example, you can compare 16 bytes with 16 other bytes using a single instruction.

Suppose you have the following string: stuvwxyzabcdefgh. You want to know whether the string contains the character ‘e‘. What you can do with SIMD instructions is load the input string in a register, and then compare it (using a single instruction) with the string eeeeeeeeeeeeeeee. The result would be something equivalent to 0000000000001000 indicating that there is, indeed, a letter e in the input.

Our programming languages tend to abstract away these SIMD instructions and it perfectly possible to have a long career in the software industry without even knowing what SIMD is. In fact, I suspect that most programmers do not know about SIMD instructions. If you are programming web applications in JavaScript, it is not likely to come up as a topic. (Fun fact, there was an attempt to introduce SIMD in JavaScript by the JavaScript SIMD API.)

Yet if SIMD is everywhere but few people know about it, is it even needed ?

Suppose that you are looking for the first instance of a given character in a string. In C or C++, you might implement a function like so:

The naive_find function searches for the first occurrence of a specific character within a range of characters defined by two pointers, startand end. It takes as input a pointer to the beginning of the range (start), a pointer to the end (end), and the character to find (character). The function iterates through the range character by character using a while loop, checking at each step if the current character (*start) matches the target character. If a match is found, the function returns a pointer to that position. Otherwise, it increments start to move to the next character. If no matching character is found before reaching end, the function returns end, indicating that the character was not found in the specified range. My function is not Unicode-aware, but it is still fairly generic.

What is wrong with this function? As implemented, it might require about 6 CPU instruction per character. Indeed, you have to compare the pointers, de-reference the pointer, compare the result, increment the point, and so forth. Either you or the compiler can improve this number somewhat, but that’s the basic result. Unfortunately, your processor may not be able to retire more than 6 instructions per cycle. In fact, it is likely that your processor might not even be able to sustain 6 instructions per cycle.

Thus, naively implemented, a simple search for a character in a string will run at the speed of your processor or less: if your processor runs at 4 GHz, you will run through the string at 4 GB/s. Importantly, that’s likely true irrespective of whether the string is small and fits in CPU cache, or whether it is large and located outside of the CPU cache.

Is that a problem ? Isn’t 4 GB/s very fast ? Well. It is slower than a disk. The disk in my aging PlayStation 5 has a bandwidth of 5 GB/s. You can go to Amazon and order a disk with a 15 GB/s bandwidth.

Instead, let us compare against the ‘find’ function that we implemented in the simdutf library, using SIMD instructions. The performance you get depends on the processor and the SIMD instructions it supports. Let me use my Apple M4 processor as a reference. It has relatively weak SIMD support with only 16-byte SIMD registers. It pales in comparison to recent AMD processors (Zen 5) which have full support for 64-byte SIMD registers. Still, we can use about 4 instructions per block of 16 bytes. That’s over 20 times fewer instructions per input character. For strings of a few kilobytes or more, I get the following speeds.

That is, the simdutf::find function is more than 20 times faster because it drastically reduces the number of required instructions.

Given our current CPU designs, I believe the SIMD instructions are effectively a requirement to achieve decent performance (i.e., process data faster than it can be read from a disk) on common tasks like a character search.

The source code of my benchmark is available. You might also be interested by the simdutf library which offers many fast string functions.

Why do we even need SIMD instructions ?

Daniel Lemire

Computer Scientist

More articles by this author

Others also viewed

Understanding Conditional Statements and Comparison Operators

Master JavaScript Random Number Generation: A Guide to Math.random() and Beyond

JavaScript Array Methods Code plus 8 Video Examples Included

JavaScript

What Are JavaScript Math Expressions?

"Maximizing Efficiency with For Loops in JavaScript"

What are Arrays in JavaScript: Understanding Data Structures with Real-Life Examples

Hoisting in Javascript

Understanding JavaScript Memory Architecture and Lifecycle

3 Keys Introduction To Objects Arrays

Explore topics

Predictable memory accesses are much faster

Aug 15, 2025

Innovation starts with consumers, not academia

Jul 16, 2025

Dot product on misaligned data

Jul 14, 2025

Why measuring productivity is hard

Jul 12, 2025

Memory-level parallelism :: Apple M2 vs Apple M4

Jul 9, 2025

Just say no to broken JSON

Jul 4, 2025

Base64 for compression

Jul 3, 2025

From code reuse to the impact of generative AI

Jun 27, 2025

Discover C++26’s compile-time reflection

Jun 22, 2025

Life expectancy of men in Canadian provinces

Jun 21, 2025