SIMD
SIMD (Single Instruction, Multiple Data) processes multiple values in a single CPU instruction. A 256-bit AVX register holds 4 Float64 values — one instruction operates on all 4 simultaneously. Mojo exposes SIMD as a first-class type, not a library hack.
Code
from std.sys.info import simd_width_of
fn main():
# SIMD vector of 4 Float32 values
var a = SIMD[DType.float32, 4](1.0, 2.0, 3.0, 4.0)
var b = SIMD[DType.float32, 4](10.0, 20.0, 30.0, 40.0)
# One instruction: adds all 4 pairs simultaneously
var c = a + b
print(c) # [11.0, 22.0, 33.0, 44.0]
# Query hardware SIMD width
var width = simd_width_of[DType.float32]()
print("SIMD width for float32:", width)
Alignment
SIMD loads require aligned memory. Misaligned access causes a performance penalty or fault. When allocating buffers for SIMD, ensure alignment matches the vector width:
# Aligned allocation for SIMD operations
var ptr = alloc[Float32](256)
# Load SIMD-width chunk from aligned pointer
var vec = ptr.load[width=4](0)
Constraint
Create two SIMD vectors of 8 Float32 values. Compute their element-wise product and sum the result (a dot product of 8 elements in two instructions).
Why It Matters
Without SIMD, a loop over 1024 floats takes 1024 multiply instructions. With 8-wide SIMD, it takes 128. That's an 8x speedup from a single change. This is why data layout (Phase 2) matters — SIMD needs contiguous, aligned data.