Performance Primitives

Before touching SIMD or parallelism, you need to understand the primitives that make code fast on a single core: loop ordering, cache locality, tiling (blocking), and arithmetic intensity. These are hardware-level concepts that Mojo lets you control directly.

Code

Naive matrix multiply vs. tiled — same result, vastly different cache behavior:

def matmul_naive(mut A: List[Float64], 
                mut B: List[Float64], 
                mut C: List[Float64], 
                N: Int):
    for i in range(N):
        for j in range(N):
            for k in range(N):
                var c_elem = C[i * N + j]
                c_elem += A[i * N + k] * B[k * N + j]
                C[i * N + j] = c_elem


def matmul_tiled(mut A: List[Float64], 
        mut B: List[Float64], 
        mut C: List[Float64], 
        N: Int, 
        TILE: Int):
    for ii in range(0, N, TILE):
        for jj in range(0, N, TILE):
            for kk in range(0, N, TILE):
                for i in range(ii, min(ii + TILE, N)):
                    for j in range(jj, min(jj + TILE, N)):
                        for k in range(kk, min(kk + TILE, N)):
                            var c_elem = C[i * N + j]
                            c_elem += A[i * N + k] * B[k * N + j]
                            C[i * N + j] = c_elem

Arithmetic Intensity

The ratio of compute operations to memory operations. High arithmetic intensity means the CPU stays busy computing instead of waiting for memory:

Vector add: 1 FLOP per 2 loads + 1 store = 0.33 (memory-bound)
Matrix multiply: 2N FLOPs per 2 loads = N (compute-bound at scale)

Constraint

Implement both naive and tiled matrix multiply for a 64×64 matrix. Use a tile size of 8. Measure the difference by counting cache-friendly vs cache-unfriendly access patterns.

Why It Matters

A modern CPU can do 16+ FLOPS per cycle but only load 1 cache line per cycle. If your code is memory-bound, the CPU is idle most of the time. Tiling keeps data in L1 cache (32KB, ~1ns access) instead of going to main memory (~100ns). That's a 100x latency difference.

← Arrays & Buffers Next: SIMD →