Performance Primitives
Before touching SIMD or parallelism, you need to understand the primitives that make code fast on a single core: loop ordering, cache locality, tiling (blocking), and arithmetic intensity. These are hardware-level concepts that Mojo lets you control directly.
Code
Naive matrix multiply vs. tiled — same result, vastly different cache behavior:
fn matmul_naive(mut A: List[Float64],
mut B: List[Float64],
mut C: List[Float64],
N: Int):
for i in range(N):
for j in range(N):
for k in range(N):
var c_elem = C[i * N + j]
c_elem += A[i * N + k] * B[k * N + j]
C[i * N + j] = c_elem
fn matmul_tiled(mut A: List[Float64],
mut B: List[Float64],
mut C: List[Float64],
N: Int,
TILE: Int):
for ii in range(0, N, TILE):
for jj in range(0, N, TILE):
for kk in range(0, N, TILE):
for i in range(ii, min(ii + TILE, N)):
for j in range(jj, min(jj + TILE, N)):
for k in range(kk, min(kk + TILE, N)):
var c_elem = C[i * N + j]
c_elem += A[i * N + k] * B[k * N + j]
C[i * N + j] = c_elem
Arithmetic Intensity
The ratio of compute operations to memory operations. High arithmetic intensity means the CPU stays busy computing instead of waiting for memory:
- Vector add: 1 FLOP per 2 loads + 1 store = 0.33 (memory-bound)
- Matrix multiply: 2N FLOPs per 2 loads = N (compute-bound at scale)
Constraint
Implement both naive and tiled matrix multiply for a 64×64 matrix. Use a tile size of 8. Measure the difference by counting cache-friendly vs cache-unfriendly access patterns.
Why It Matters
A modern CPU can do 16+ FLOPS per cycle but only load 1 cache line per cycle. If your code is memory-bound, the CPU is idle most of the time. Tiling keeps data in L1 cache (32KB, ~1ns access) instead of going to main memory (~100ns). That's a 100x latency difference.