Faster Rust with SIMD

Single Instruction, Multiple Data (SIMD) is a parallel processing technique that allows a single instruction to operate on multiple data elements simultaneously. It is a fundamental concept in modern computer architecture and has been widely adopted to enhance the performance of various applications.

Instead of processing data elements one at a time, the processor can simultaneously perform the same operation on multiple data elements using SIMD. This parallel processing capability is made possible by specialised hardware components called SIMD registers, which can store and manipulate multiple data elements as a single unit.

SIMD can significantly accelerate code execution, but fully realising its benefits requires careful structuring of code and data to align with SIMD operations. Developers may need to rethink their algorithms and data structures, involving techniques such as data transposition, vectorisation, and loop unrolling.

Quantitative simulations in finance, which heavily rely on vector and matrix operations, are a domain that naturally lends itself to SIMD parallelism. Linear algebra and data frame libraries, such as polars, abstract away the use of SIMD.

Most modern processors, including those from Intel, AMD, and ARM, support SIMD. Each processor family has its own SIMD instruction set extensions, such as SSE (Streaming SIMD Extensions) and AVX (Advanced Vector Extensions) for x86 processors, and NEON for ARM processors. These instruction sets provide a rich set of SIMD operations that can be leveraged by developers to optimise their code.

Rust provides SIMD support through the experimental standard SIMD module and the widely-used wide crate. These tools provide a high-level interface for working with SIMD operations, making it easier for developers to harness the power of SIMD in their Rust programs. In this blog post, we will explore the concept of SIMD in more detail and demonstrate its usage in Rust. We will implement a simple bitset using three different approaches: a naive implementation, an implementation using the experimental standard SIMD module, and an implementation using the wide crate. These examples will demonstrate the performance benefits of SIMD and provide insights into its effective utilisation in Rust.

So, let’s dive in and unlock the potential of SIMD to supercharge our Rust programs!

Overview of the SIMD ecosystem

At the lowest level, one can write SIMD code using SIMD intrinsics. For Rust, these can be found in the core::arch module. This module is organised into architecture-specific submodules, such as core::arch::aarch64 and core::arch::x86. Therein lies the problem with directly using core::arch - writing portable code using it involves a lot of error-prone branching and conditional compilation.

Going one step up in the ecosystem, we have std::simd. The aim of this module is to provide architecture-agnostic, portable abstractions for SIMD, mapping high-level operations to architecture-specific ones for supported instruction sets. Unfortunately, std::simd is currently a nightly-only experimental API.

A widely adopted (no pun intended) stable option in the open-source ecosystem is the wide crate. This crate offers data types that aim to leverage SIMD as much as possible.

At the highest level of the ecosystem sit the plethora of crates that leverage SIMD under the hood, but hide all the complexity. polars stands out as the most performant data frame library, making it a cornerstone of modern data science, not only in Rust, but also in the Python world.

As always, one should reach for the highest level of abstraction that does the job. Nonetheless, it’s good to be aware of what’s going on under the hood.

Bitsets

We’ll demonstrate the use of std::simd and wide through a simple fixed-size bitset implementation.

For those less familiar with bitsets, a bitset is a compact data structure that represents a collection of boolean values using individual bits. Each bit in the bitset corresponds to a specific boolean value, where a set bit (1) represents true and an unset bit (0) represents false. Bitsets provide a space-efficient way to store and manipulate boolean values, as they pack multiple values into a single unit of memory.

Bitsets can be used to represent a set of integers by mapping each integer to a specific bit position within the bitset. The presence or absence of an integer in the set is determined by the value of the corresponding bit.

Bitsets can be implemented using unsigned integers, where each bit of the integer represents a single boolean value. The number of bits in the unsigned integer determines the maximum size of the bitset. Bitwise operations, such as AND, OR, XOR, and NOT, are used to manipulate the individual bits of the unsigned integer, allowing for efficient set operations and membership testing.

This makes bitsets incredibly performant in some areas (even without SIMD), such as graph theory. A concrete example is variations of the Bron-Kerbosch algorithm, which involve many union (|), intersection (&) and difference (& ~) operations. Using bitsets to represent sets of vertices and adjacency matrices can make the implementation extremely fast, particularly for dense graphs.

We’re using bitsets as an example in this blog post because the underlying data and the required operations map closely to SIMD concepts.

A simple bitset using usize

Let’s implement a simple bitset with Vec<usize> as the backing data structure. The idea is that each usize can store either 32 or 64 bits (depending on the CPU architecture), and we use Vec to store enough of these blocks to represent the complete set.

We will only implement the union operation, but other operations can be easily implemented following the same pattern. We’ll focus on the interesting bits (again, no pun intended), omitting some of the code. The complete code is on GitHub.

Our bitset struct is very simple, just a wrapper around the vector of blocks:

#[derive(Clone)]
pub struct BitSet<B> {
    data: Vec<B>,
}

As we’ll later implement bit blocks using SIMD as well, we used a generic B as the type of blocks, rather than the concrete usize type.

Let’s define what interface we need blocks to implement:

pub trait Block: Clone + Sized + BitOr<Output = Self> {
    const SIZE: usize;
    fn zeros() -> Self;
    fn ones() -> Self;
    fn set(&mut self, idx: usize);
    fn get(&self, idx: usize) -> bool;
}

As we’ll only focus on the union operation, we only require the BitOr interface from the standard library. Other than that, we have SIZE to tell us the number of bits the block can store and functions to create new blocks, set the value of a bit, and get the value of a bit.

Using this trait, we can go ahead and implement the bitset.

#[inline(always)]
fn div_rem(idx: usize, block_size: usize) -> (usize, usize) {
    (idx / block_size, idx % block_size)
}

impl<B: Block> BitSet<B> {
    #[inline(always)]
    pub fn required_blocks(n: usize) -> usize {
        let (mut blocks, rem) = div_rem(n, B::SIZE);
        blocks += (rem > 0) as usize;

        blocks
    }

    pub fn ones(n: usize) -> Self {
        Self {
            data: vec![B::ones(); Self::required_blocks(n)],
        }
    }

    pub fn zeros(n: usize) -> Self {
        Self {
            data: vec![B::zeros(); Self::required_blocks(n)],
        }
    }

    #[inline]
    pub fn set(&mut self, idx: usize) {
        let (block_idx, bit_idx) = div_rem(idx, B::SIZE);
        if let Some(block) = self.data.get_mut(block_idx) {
            block.set(bit_idx);
        } else {
            panic!("setting {idx}, which is out of range");
        }
    }

    #[inline]
    pub fn get(&self, idx: usize) -> bool {
        let (block_idx, bit_idx) = div_rem(idx, B::SIZE);
        if let Some(block) = self.data.get(block_idx) {
            block.get(bit_idx)
        } else {
            panic!("getting {idx}, which is out of range");
        }
    }

    pub fn union(self, other: Self) -> Self {
        let zipped = self.data.into_iter().zip_longest(other.data);
        let blocks = zipped.into_iter().map(|pair| match pair {
            EitherOrBoth::Both(lhs, rhs) => lhs | rhs,
            EitherOrBoth::Left(lhs) => lhs,
            EitherOrBoth::Right(rhs) => rhs,
        });

        blocks.collect::<Vec<B>>().into()
    }
}

impl<B: Block> BitOr for BitSet<B> {
    type Output = Self;

    fn bitor(self, rhs: Self) -> Self::Output {
        self.union(rhs)
    }
}

impl<B: Clone + Block> From<Vec<B>> for BitSet<B> {
    fn from(data: Vec<B>) -> Self {
        Self { data }
    }
}

In the union function, we iterate over the blocks and perform the bitwise or operation on each. The rest of the code isn’t really noteworthy.

Finally, let’s take a look at the Block implementation for usize:

impl Block for usize {
    const SIZE: usize = POINTER_WIDTH;

    fn zeros() -> Self {
        0
    }

    fn ones() -> Self {
        usize::MAX
    }

    fn set(&mut self, idx: usize) {
        *self |= 1 << idx;
    }

    fn get(&self, idx: usize) -> bool {
        self & (1 << idx) != 0
    }
}

This code treats every usize as a block of bits. Setting and getting the value of a bit involve some basic bit operations.

POINTER_WIDTH requires further explanation. As the size of usize depends on the architecture, we have added a new constant called POINTER_WIDTH, whose value is set at compile time depending on the target.

#[cfg(target_pointer_width = "32")]
const POINTER_WIDTH: usize = 32;
#[cfg(target_pointer_width = "64")]
const POINTER_WIDTH: usize = 64;

Next, let’s swap this implementation out for a SIMD one!

Using std::simd

To leverage SIMD for the union operator, we’ll use std::simd::u64x4 to store the bits. This is a SIMD vector with 4 lanes, each containing a 64-bit unsigned integer.

As std::simd is nightly-only, let’s switch over to the nightly version of rustc. There are various ways this can be done. We’ve added a rust-toolchain.toml to the repository, which specifies nightly as the channel to use for this project.

To make the use of nightly optional, you may also want to add a new feature to Cargo.toml to enable the SIMD-specific code.

5 collapsed lines
[package]
name = "simd-playground"
version = "0.1.0"
edition = "2021"

[features]
simd = []
12 collapsed lines
wide = ["dep:wide"]

[dependencies]
itertools = "0.12"
wide = { version = "0.7", optional = true }

[dev-dependencies]
criterion = "0.5"

[[bench]]
name = "benchmark"
harness = false

Now that we have a simd feature to conditionally include the SIMD block, let’s implement it:

#[cfg(feature = "simd")]
impl Block for std::simd::u64x4 {
    const SIZE: usize = 64 * 4;

    fn zeros() -> Self {
        std::simd::u64x4::splat(0)
    }

    fn ones() -> Self {
        std::simd::u64x4::splat(u64::MAX)
    }

    #[inline(always)]
    fn set(&mut self, idx: usize) {
        let (lane, bit) = div_rem(idx, 64);
        self[lane] |= 1 << bit;
    }

    #[inline(always)]
    fn get(&self, idx: usize) -> bool {
        let (lane, bit) = div_rem(idx, 64);
        let value = self[lane] & (1 << bit);

        value != 0
    }
}

As std::simd::u64x4 already implements BitOr, we have nothing to do in that regard. The union function will work fine with just this much code.

At the end, we’ll benchmark the code to show the benefits of SIMD. Before we do that, let’s quickly add the implementation using wide.

Using wide

As wide is an external crate, add it as a dependency and add a new feature to make the wide implementation optional.

5 collapsed lines
[package]
name = "simd-playground"
version = "0.1.0"
edition = "2021"

[features]
simd = []
wide = ["dep:wide"]

[dependencies]
itertools = "0.12"
wide = { version = "0.7", optional = true }
7 collapsed lines

[dev-dependencies]
criterion = "0.5"

[[bench]]
name = "benchmark"
harness = false

The actual implementation is nearly identical to the std::simd one.

#[cfg(feature = "wide")]
impl Block for wide::u64x4 {
    const SIZE: usize = 64 * 4;

    fn zeros() -> Self {
        Self::MIN
    }

    fn ones() -> Self {
        Self::MAX
    }

    #[inline(always)]
    fn set(&mut self, idx: usize) {
        let (lane, bit) = div_rem(idx, 64);
        let mut array = self.to_array();
        array[lane] |= 1 << bit;
        *self = Self::from(array)
    }

    #[inline(always)]
    fn get(&self, idx: usize) -> bool {
        let (lane, bit) = div_rem(idx, 64);
        let value = self.to_array()[lane] & (1 << bit);

        value != 0
    }
}

Now that we have 3 implementations to compare, let’s run some benchmarks.

Benchmarking the code

Predicting the performance gains from applying SIMD explicitly can be tricky. The overhead of data preparation, memory bottlenecks, and cache behaviours can influence the performance in unexpected ways. As a result, accurately predicting the performance gains requires benchmarking the specific code on target hardware.

Cargo provides the built-in cargo bench command for benchmarking. For more sophisticated and statistically reliable benchmarks, a commonly used crate is criterion. Let’s add it as a dependency and configure the benchmarks in Cargo.toml.

12 collapsed lines
[package]
name = "simd-playground"
version = "0.1.0"
edition = "2021"

[features]
simd = []
wide = ["dep:wide"]

[dependencies]
itertools = "0.12"
wide = { version = "0.7", optional = true }

[dev-dependencies]
criterion = "0.5"

[[bench]]
name = "benchmark"
harness = false

We’ll test the union function with 1 million bits. The complete code is on GitHub.

Running the tests on a MacBook Pro with an M3 Max CPU yields these results:

union with usize        time:   [27.947 µs 28.298 µs 28.617 µs]
                        change: [-4.8234% +4.5867% +13.657%] (p = 0.38 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

union with std SIMD     time:   [15.410 µs 16.030 µs 16.595 µs]
                        change: [-5.6976% +3.1249% +14.338%] (p = 0.62 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

union with wide SIMD    time:   [15.454 µs 15.908 µs 16.285 µs]
                        change: [-5.3262% +0.8923% +7.6818%] (p = 0.79 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

As expected, the two SIMD implementations (using std::simd and wide) perform nearly the same. They both complete nearly twice as fast with the non-SIMD implementation using usize. These results are specific to the M3 CPU with Neon support.

Notes on compilation options

When compiling a project, rustc (and therefore cargo) picks a default CPU target for the target architecture.

You can check your target architecture by running

rustc -vV

The host field should show the architecture triple of the host.

You can see what CPU targets are available for the architecture by running

rustc --print target-cpus

This will also show what the default CPU and the native CPU are, as identified by rustc.

Based on this, rustc enables certain target features. You can view the enabled features using the --print cfg flag.

rustc --print cfg

This will reflect the features enabled for the default CPU. Requesting rustc to use the native CPU may enable further, more specific features.

rustc --C target-cpu=native --print cfg

For example, you may find that avx is not enabled on an x86_64 system, unless you pass in target-cpu=native. The complete list of features available for a specific platform can be listed using the --print target-features flag.

rustc --print target-features
rustc --target x86_64-unknown-linux-gnu --print target-features

You may want to specify target-cpu=native for your benchmarks to enable features more specific to the CPU. However, as the generated code will be optimised for that specific CPU, this should only be used if the compiled code will run on machines with the same CPU.

RUSTFLAGS='-C target-cpu=native' cargo bench --all-features

You can also enable/disable individual features, allowing you to isolate the effect of certain features.

RUSTFLAGS='-C target-feature=+avx2' cargo bench --all-features

Conclusion

SIMD is a powerful technique that can significantly boost the performance of Rust code by leveraging parallel processing capabilities. By implementing a simple bitset using three different approaches - a naive implementation, one using the experimental standard SIMD module, and another using the wide crate - we have demonstrated the potential performance gains that can be achieved through SIMD optimisations. However, it’s important to keep in mind that the actual performance benefits may vary depending on factors such as data preparation, memory bottlenecks, and cache behaviours, making benchmarking on target hardware crucial for accurate predictions.

As Rust continues to evolve and mature, the ecosystem surrounding SIMD is expected to grow and stabilise further. With the ongoing efforts to stabilize std::simd and the availability of mature crates like wide, Rust developers have a range of options to harness the power of SIMD in their programs. By carefully considering the trade-offs and measuring the performance impact through benchmarking, developers can make informed decisions on when and how to apply SIMD optimisations to their Rust code, ultimately unlocking new levels of performance and efficiency in their applications.