Why do we even need SIMD instructions ?

Why do we even need SIMD instructions ?

Last week, I was chatting with a student and I was explaining what SIMD instructions were. I was making the point that, in practice, all modern processors have SIMD instructions or the equivalent. Admittedly, some small embedded processors do not, but they lack many other standard features as well. SIMD stands for Single Instruction, Multiple Data, a type of parallel computing architecture that allows a single instruction to process multiple data elements simultaneously. For example, you can compare 16 bytes with 16 other bytes using a single instruction.

Suppose you have the following string: stuvwxyzabcdefgh. You want to know whether the string contains the character ‘e‘. What you can do with SIMD instructions is load the input string in a register, and then compare it (using a single instruction) with the string eeeeeeeeeeeeeeee. The result would be something equivalent to 0000000000001000 indicating that there is, indeed, a letter e in the input.

Our programming languages tend to abstract away these SIMD instructions and it perfectly possible to have a long career in the software industry without even knowing what SIMD is. In fact, I suspect that most programmers do not know about SIMD instructions. If you are programming web applications in JavaScript, it is not likely to come up as a topic. (Fun fact, there was an attempt to introduce SIMD in JavaScript by the JavaScript SIMD API.)

Yet if SIMD is everywhere but few people know about it, is it even needed ?

Suppose that you are looking for the first instance of a given character in a string.  In C or C++, you might implement a function like so:

The naive_find function searches for the first occurrence of a specific character within a range of characters defined by two pointers, startand end. It takes as input a pointer to the beginning of the range (start), a pointer to the end (end), and the character to find (character). The function iterates through the range character by character using a while loop, checking at each step if the current character (*start) matches the target character. If a match is found, the function returns a pointer to that position. Otherwise, it increments start to move to the next character. If no matching character is found before reaching end, the function returns end, indicating that the character was not found in the specified range. My function is not Unicode-aware, but it is still fairly generic.

What is wrong with this function? As implemented, it might require about 6 CPU instruction per character. Indeed, you have to compare the pointers, de-reference the pointer, compare the result, increment the point, and so forth. Either you or the compiler can improve this number somewhat, but that’s the basic result. Unfortunately, your processor may not be able to retire more than 6 instructions per cycle. In fact, it is likely that your processor might not even be able to sustain 6 instructions per cycle.

Thus, naively implemented, a simple search for a character in a string will run at the speed of your processor or less: if your processor runs at 4 GHz, you will run through the string at 4 GB/s. Importantly, that’s likely true irrespective of whether the string is small and fits in CPU cache, or whether it is large and located outside of the CPU cache.

Is that a problem ? Isn’t 4 GB/s very fast ? Well. It is slower than a disk. The disk in my aging PlayStation 5 has a bandwidth of 5 GB/s. You can go to Amazon and order a disk with a 15 GB/s bandwidth.

Instead, let us compare against the ‘find’ function that we implemented in the simdutf library, using SIMD instructions. The performance you get depends on the processor and the SIMD instructions it supports. Let me use my Apple M4 processor as a reference. It has relatively weak SIMD support with only 16-byte SIMD registers. It pales in comparison to recent AMD processors (Zen 5) which have full support for 64-byte SIMD registers. Still, we can use about 4 instructions per block of 16 bytes. That’s over 20 times fewer instructions per input character. For strings of a few kilobytes or more, I get the following speeds.

That is, the simdutf::find function is more than 20 times faster because it drastically reduces the number of required instructions.

Given our current CPU designs, I believe the SIMD instructions are effectively a requirement to achieve decent performance (i.e., process data faster than it can be read from a disk) on common tasks like a character search.

The source code of my benchmark is available. You might also be interested by the simdutf library which offers many fast string functions.

So we can figuratively pull a rabbit out of our hat for the mouth breathers.

To view or add a comment, sign in

Others also viewed

Explore topics