SIMD

SIMD (Single Instruction, Multiple Data) processes multiple values in a single CPU instruction. A 256-bit AVX register holds 4 Float64 values — one instruction operates on all 4 simultaneously. Mojo exposes SIMD as a first-class type, not a library hack.

Code

from std.sys.info import simd_width_of

fn main():
  # SIMD vector of 4 Float32 values
  var a = SIMD[DType.float32, 4](1.0, 2.0, 3.0, 4.0)
  var b = SIMD[DType.float32, 4](10.0, 20.0, 30.0, 40.0)

  # One instruction: adds all 4 pairs simultaneously
  var c = a + b
  print(c)  # [11.0, 22.0, 33.0, 44.0]

  # Query hardware SIMD width
  var width = simd_width_of[DType.float32]()
  print("SIMD width for float32:", width) 

Alignment

SIMD loads require aligned memory. Misaligned access causes a performance penalty or fault. When allocating buffers for SIMD, ensure alignment matches the vector width:

# Aligned allocation for SIMD operations
var ptr = alloc[Float32](256)

# Load SIMD-width chunk from aligned pointer
var vec = ptr.load[width=4](0)

Constraint

Create two SIMD vectors of 8 Float32 values. Compute their element-wise product and sum the result (a dot product of 8 elements in two instructions).

Why It Matters

Without SIMD, a loop over 1024 floats takes 1024 multiply instructions. With 8-wide SIMD, it takes 128. That's an 8x speedup from a single change. This is why data layout (Phase 2) matters — SIMD needs contiguous, aligned data.