Exploring Parallelization (and the Galaxy)

22 November 2025991 words5 minute read

I've always been a big fan of space, and a clear intersection between space and computer science is the N-body simulation. The problem is these simulations are brutal to compute because every body interacts with every other body. The core force calculation is O(N2)O(N^2) per step, giving a total cost of O(N2T)O(N^2T) across TT timesteps.

Loading SVG...
Each body has a location, velocity, and mass. Its acceleration is the sum of all incoming gravitational forces. Straightforward but heavy compute.

Barnes-Hut [O(Nlog(N))]\big [ O( N \log (N))\big ] and Fast Multipole Method [O(N)]\big [ O(N) \big ] improve the complexity of the force calculation, and I recently implemented a 2D Barnes-Hut simulation in Python but was unsatisfied for multiple reasons:

  1. It was only 2D.
  2. It was written in Python.
  3. See 2.

Python is great, but it's not the language to write high performance, parallelizable code. That's where Rust comes in. Having access to low-level memory control will help maximize performance for this experiment.

Parallel Processing (without GPUs)

To squeeze real performance out of a CPU, you have two main avenues: parallel threads (MIMD) and vector instructions (SIMD). Threads handle the broad-level work and SIMD handles the fine-grain operations inside the work. If the data layout doesn't nicely align with the vector units, the CPU will take an even longer time processing.

As an aside: Modern GPU architecture uses SIMT, which is (roughly) a hybrid of MIMD and SIMD. Running simulations on a GPU, with its thousands of cores, would still obviously dog walk our hyper-optimized CPU performance, but this post is a pro-CPU safe space. We don't talk about those dirty GPUs in here... perhaps maybe in a future post.

Multiple Data, Multiple Instruction (MIMD)

Given NN bodies and KK CPU cores, we can divide our workload equally into NK\frac{N}{K} partitions and launch them as threads to execute on each CPU core. Unlike the GPU's SIMT model, the threads running on the CPUs are genuinely independent instruction streams. However, since the force on any particle depends on all NN particles, these threads still need to sync each step to exchange positions and accumulate forces before proceeding to the next time step.

Single Instruction, Multiple Data (SIMD)

Array of Structures (Scalar Processing)

We're all taught OOP when we first learn to program, so it just makes sense to organize our data in a similar structure. Each body and its data held in a struct and stored in a list. The more bodies, the longer the list.

// AoS: Array of Structures
struct BodyAoS {
    x: f64,
    y: f64,
    z: f64,
}
// E.g.
// [ {x: x_1, y: y_1, z: z_1}, {...}, ... ]

This follows the AoS approach and as it turns out, is absolutely miserable for vectorization. Why?

The CPU wants to load a chunk of one field (e.g., all the xx’s) into a wide vector register. AoS forces it to hop over yy and zz every time. This triggers slow, scattered memory loads (called gathers) and drags in unrelated values into cache lines. It works, but it wastes the hardware.

Let's visualize this layout in memory:

Memory:x1,y1,z1Body 1,x2,y2,z2Body 2,x3,y3,z3Body 3, \text{Memory}: \underbrace{x_1, y_1, z_1}_\text{Body 1}, \underbrace{x_2, y_2, z_2}_\text{Body 2}, \underbrace{x_3, y_3, z_3}_\text{Body 3}, \cdots

Stop polluting your memory cache. It's bad for the environment and your CPU latency.

Structure of Arrays (Vector Processing)

SoA fixes the layout problem by grouping identical fields together. Now the CPU can pull in a whole vector of positions with one clean load and cache lines fill with useful data instead of unwanted noise.

// SoA: Structure of Arrays
struct BodiesSoA {
    x: Vec<f64>,
    y: Vec<f64>,
    z: Vec<f64>,
}
/* E.g.
{ 
    x: <x_1, x_2, x_3, ...>,
    y: <y_1, y_2, y_3, ...>,
    z: <z_1, z_2, z_3, ...>
}
*/

Let's see what the memory looks like now:

Memory:x1,x2,x3,x4,x array,y1,y2,y3,y4,y array,z1,z2,z3,z4,z array, \text{Memory}: \underbrace{x_1, x_2, x_3, x_4,\cdots}_{x \text{ array}}, \underbrace{y_1, y_2, y_3, y_4,\cdots}_{y \text{ array}}, \underbrace{z_1, z_2, z_3, z_4,\cdots}_{z \text{ array}}, \cdots

Look at that data locality! This is the layout SIMD actually wants. All values adjacent in memory.

But switching the layout isn’t enough on its own. If the loop structure still looks like scalar code, the compiler won’t auto-vectorize anything meaningful. You now have to get your hands dirty if you want to see the compute payoff. This involves manually organizing the loop so the CPU can use the hardware you just optimized for.

Implementation and Benchmarking

Approaches

Let's implement and benchmark these four different approaches:

  1. AoS - will serve as our baseline and follows traditional OOP principles with no consideration for CPU architecture. There will be no optimization focus with this approach.
  2. SoA - a single-threaded, naive approach to SoA where only memory is restructured to be stored contiguously. The optimization focus for this approach is data layout and CPU caching.
  3. SoA (Optimized) - a single-threaded, but complete implentation of SIMD. More specifically, the core force loop is optimized by manually unrolling values to force the CPU to use its vector registers. The optimization focus is completely leveraging the memory contiguousness implemented in the previous approach for more efficient data parallelism.
  4. SoA (Hybrid) - A multi-threaded (MIMD) implementation of our third approach (SIMD) across all CPU cores with help from the rayon crate.

I expect performances to behave in decreasing order, with a significant gap between SoA (Hybrid) and AoS.

Results

Naive SoA vs AoS

Surprisingly, plain SoA is slower than AoS. Even though SoA fixes memory layout, the loop is still written and executed like chunky scalar code. Every iteration pulls from multiple arrays (causing even more overhead), which means more independent pointer jumps and more cache misses. AoS vectorizes poorly, but in this case still comes out ahead.

Loading SVG...
Naive SoA underperforms basic AoS while optimized SoA achieves a ~30% improvement in performance.

Optimized SoA

Once the inner loop is rewritten to actually feed the vector units using:

  • manual unrolling
  • pointer-based iteration
  • predictable strides

we see the results. The CPU can finally load and process wide batches of coordinates without worrying about additional operations. This cuts a huge chunk of memory stalls and latency which gets the ~1.3× speedup.

Loading SVG...
Parallelizing processing crushes all single-threaded approaches.

Hybrid SoA (SIMD + MIMD)

Hybrid SoA scales well across cores, but tops out around 3.4×. That’s still the best performer by a wide margin, but it fell short of what I expected. This is perhaps related to bandwidth, threading, or the cache, and is something to investigate later.

Conclusion

SoA isn’t a magic bullet. The data layout only pays off when the code is built to exploit it. Otherwise, AoS wins by default.

Next steps for this project:

  1. Port this to a GPU.
  2. Implement Fast Multipole Method.
  3. Compare scaling.

Related Work

The full code is available in this gist.