Single Instruction, Multiple Data (SIMD) is a parallel processing technique that allows a single instruction to operate on multiple data elements simultaneously. It is a fundamental concept in modern computer architecture and has been widely adopted to enhance the performance of various applications.
Instead of processing data elements one at a time, the processor can simultaneously perform the same operation on multiple data elements using SIMD. This parallel processing capability is made possible by specialised hardware components called SIMD registers, which can store and manipulate multiple data elements as a single unit.
SIMD can significantly accelerate code execution, but fully realising its benefits requires careful structuring of code and data to align with SIMD operations. Developers may need to rethink their algorithms and data structures, involving techniques such as data transposition, vectorisation, and loop unrolling.
Quantitative simulations in finance, which heavily rely on vector and matrix operations, are a domain that naturally lends itself to SIMD parallelism. Linear algebra and data frame libraries, such as polars, abstract away the use of SIMD.
Most modern processors, including those from Intel, AMD, and ARM, support SIMD. Each processor family has its own SIMD instruction set extensions, such as SSE (Streaming SIMD Extensions) and AVX (Advanced Vector Extensions) for x86 processors, and NEON for ARM processors. These instruction sets provide a rich set of SIMD operations that can be leveraged by developers to optimise their code.
Rust provides SIMD support through the experimental standard SIMD module and the widely-used wide crate. These tools provide a high-level interface for working with SIMD operations, making it easier for developers to harness the power of SIMD in their Rust programs. In this blog post, we will explore the concept of SIMD in more detail and demonstrate its usage in Rust. We will implement a simple bitset using three different approaches: a naive implementation, an implementation using the experimental standard SIMD module, and an implementation using the wide crate. These examples will demonstrate the performance benefits of SIMD and provide insights into its effective utilisation in Rust.
So, let’s dive in and unlock the potential of SIMD to supercharge our Rust programs!
Overview of the SIMD ecosystem
At the lowest level, one can write SIMD code using SIMD intrinsics. For Rust, these can be
found in the core::arch module.
This module is organised into architecture-specific submodules, such as
core::arch::aarch64
and core::arch::x86.
Therein lies the problem with directly using core::arch
- writing portable code using
it involves a lot of error-prone branching and conditional compilation.
Going one step up in the ecosystem, we have std::simd.
The aim of this module is to provide architecture-agnostic, portable abstractions for SIMD,
mapping high-level operations to architecture-specific ones for supported instruction sets.
Unfortunately, std::simd
is currently
a nightly-only experimental API.
A widely adopted (no pun intended) stable option in the open-source ecosystem is the wide crate. This crate offers data types that aim to leverage SIMD as much as possible.
At the highest level of the ecosystem sit the plethora of crates that leverage SIMD under the hood, but hide all the complexity. polars stands out as the most performant data frame library, making it a cornerstone of modern data science, not only in Rust, but also in the Python world.
As always, one should reach for the highest level of abstraction that does the job. Nonetheless, it’s good to be aware of what’s going on under the hood.
Bitsets
We’ll demonstrate the use of std::simd and wide through a simple fixed-size bitset implementation.
For those less familiar with bitsets, a bitset is a compact data structure that represents a collection of boolean values using individual bits. Each bit in the bitset corresponds to a specific boolean value, where a set bit (1) represents true and an unset bit (0) represents false. Bitsets provide a space-efficient way to store and manipulate boolean values, as they pack multiple values into a single unit of memory.
Bitsets can be used to represent a set of integers by mapping each integer to a specific bit position within the bitset. The presence or absence of an integer in the set is determined by the value of the corresponding bit.
Bitsets can be implemented using unsigned integers, where each bit of the integer represents a single boolean value. The number of bits in the unsigned integer determines the maximum size of the bitset. Bitwise operations, such as AND, OR, XOR, and NOT, are used to manipulate the individual bits of the unsigned integer, allowing for efficient set operations and membership testing.
This makes bitsets incredibly performant in some areas (even without SIMD), such as graph theory.
A concrete example is variations of the Bron-Kerbosch algorithm,
which involve many union (|)
, intersection (&)
and difference (& ~)
operations.
Using bitsets to represent sets of vertices and adjacency matrices can make the
implementation extremely fast, particularly for dense graphs.
We’re using bitsets as an example in this blog post because the underlying data and the required operations map closely to SIMD concepts.
A simple bitset using usize
Let’s implement a simple bitset with Vec<usize>
as the backing data structure. The idea is
that each usize
can store either 32 or 64 bits (depending on the CPU architecture), and we
use Vec
to store enough of these blocks to represent the complete set.
We will only implement the union
operation, but other operations can be easily
implemented following the same pattern. We’ll focus on the interesting bits (again, no pun intended),
omitting some of the code.
The complete code is on GitHub.
Our bitset struct
is very simple, just a wrapper around the vector of blocks:
As we’ll later implement bit blocks using SIMD as well, we used a generic B
as the type of blocks, rather than the concrete usize
type.
Let’s define what interface we need blocks to implement:
As we’ll only focus on the union
operation, we only require the BitOr
interface from the standard library. Other than that, we have SIZE
to tell us the number of bits the block can store and functions to
create new blocks, set the value of a bit, and get the value of a bit.
Using this trait, we can go ahead and implement the bitset.
In the union
function, we iterate over the blocks and
perform the bitwise or
operation on each. The rest of the code
isn’t really noteworthy.
Finally, let’s take a look at the Block
implementation for usize
:
This code treats every usize as a block of bits. Setting and getting the value of a bit involve some basic bit operations.
POINTER_WIDTH
requires further explanation. As the size of usize
depends on
the architecture, we have added a new constant called POINTER_WIDTH
, whose value is
set at compile time depending on the target.
Next, let’s swap this implementation out for a SIMD one!
Using std::simd
To leverage SIMD for the union operator, we’ll use std::simd::u64x4
to store
the bits. This is a SIMD vector with 4 lanes, each containing a 64-bit unsigned integer.
As std::simd
is nightly-only, let’s switch over to the nightly version of rustc
.
There are various ways this can be done.
We’ve added a rust-toolchain.toml
to the repository, which specifies nightly as the
channel to use for this project.
To make the use of nightly optional, you may also want to add a new
feature to Cargo.toml
to enable the SIMD-specific code.
Now that we have a simd
feature to conditionally include
the SIMD block, let’s implement it:
As std::simd::u64x4
already implements BitOr
, we have nothing to do
in that regard. The union
function will work fine with just this much code.
At the end, we’ll benchmark the code to show the benefits of SIMD. Before we do that, let’s quickly add the implementation using wide.
Using wide
As wide
is an external crate, add it as a dependency and add a new feature
to make the wide
implementation optional.
The actual implementation is nearly identical to the std::simd
one.
Now that we have 3 implementations to compare, let’s run some benchmarks.
Benchmarking the code
Predicting the performance gains from applying SIMD explicitly can be tricky. The overhead of data preparation, memory bottlenecks, and cache behaviours can influence the performance in unexpected ways. As a result, accurately predicting the performance gains requires benchmarking the specific code on target hardware.
Cargo provides the built-in cargo bench
command for benchmarking. For more sophisticated
and statistically reliable benchmarks, a commonly used crate is criterion.
Let’s add it as a dependency and configure the benchmarks in Cargo.toml
.
We’ll test the union
function with 1 million bits.
The complete code is on GitHub.
Running the tests on a MacBook Pro with an M3 Max CPU yields these results:
As expected, the two SIMD implementations (using std::simd
and wide
) perform
nearly the same. They both complete nearly twice as fast with the non-SIMD
implementation using usize
. These results are specific to the M3 CPU with Neon support.
Notes on compilation options
When compiling a project, rustc
(and therefore cargo
) picks a default
CPU target for the target architecture.
You can check your target architecture by running
The host field should show the architecture triple of the host.
You can see what CPU targets are available for the architecture by running
This will also show what the default CPU and the native CPU are, as identified by rustc
.
Based on this, rustc
enables certain target features. You can view the enabled features using the
--print cfg
flag.
This will reflect the features enabled for the default CPU. Requesting rustc
to use the
native CPU may enable further, more specific features.
For example, you may find that avx
is not enabled on an x86_64
system, unless you pass
in target-cpu=native
. The complete list of features available for a specific platform
can be listed using the --print target-features
flag.
You may want to specify target-cpu=native
for your benchmarks to enable features more
specific to the CPU. However, as the generated code will be optimised for that specific CPU,
this should only be used if the compiled code will run on machines with the same CPU.
You can also enable/disable individual features, allowing you to isolate the effect of certain features.
Conclusion
SIMD is a powerful technique that can significantly boost the performance of Rust code by leveraging parallel processing capabilities. By implementing a simple bitset using three different approaches - a naive implementation, one using the experimental standard SIMD module, and another using the wide crate - we have demonstrated the potential performance gains that can be achieved through SIMD optimisations. However, it’s important to keep in mind that the actual performance benefits may vary depending on factors such as data preparation, memory bottlenecks, and cache behaviours, making benchmarking on target hardware crucial for accurate predictions.
As Rust continues to evolve and mature, the ecosystem surrounding SIMD is
expected to grow and stabilise further.
With the ongoing efforts to stabilize std::simd
and the availability of
mature crates like wide
, Rust developers have a range of options to harness
the power of SIMD in their programs.
By carefully considering the trade-offs and measuring the performance impact through benchmarking,
developers can make informed decisions on when and how to apply SIMD optimisations to
their Rust code, ultimately unlocking new levels of performance and efficiency in their applications.
David Steiner
I'm a software engineer and architect focusing on performant cloud-native distributed systems.