Kernel Design
A compute kernel is a tight function that does one thing: transform data. The goal is to minimize memory movement and maximize data reuse. Every concept from Phases 1-3 converges here — types, memory layout, SIMD, and cache locality all serve the kernel.
Code
A SIMD-vectorized dot product kernel:
from std.sys.info import simd_width_of
from std.memory import UnsafePointer
comptime width = simd_width_of[DType.float32]()
fn dot_product(a: UnsafePointer[Float32, _],
b: UnsafePointer[Float32, _],
n: Int) -> Float32:
var sum = SIMD[DType.float32, width](0.0)
# Vectorized main loop
for i in range(0, n - n % width, width):
var va = a.load[width](i)
var vb = b.load[width](i)
sum += va * vb
# Reduce SIMD vector to scalar
var result: Float32 = 0.0
for i in range(width):
result += sum[i]
# Handle remainder (n % width elements at the tail)
for i in range(n - n % width, n):
result += a[i] * b[i]
return result
fn main():
from std.memory import UnsafePointer
var a = alloc[Float32](8)
var b = alloc[Float32](8)
for i in range(8):
a.store(i, Float32(i + 1))
b.store(i, Float32(10 * (i + 1)))
var result = dot_product(a, b, 8)
print("Dot product:", result)
print("Expected:", 1*10 + 2*20 + 3*30 + 4*40 + 5*50 + 6*60 + 7*70 + 8*80)
a.free()
b.free()
Design Principles
- Minimize loads: Reuse data already in registers or L1 cache
- Vectorize: Process SIMD-width elements per iteration
- Handle remainders: When N isn't divisible by SIMD width
- Avoid branches: Predictable loop bounds enable hardware prefetching
Constraint
Implement a SIMD-vectorized element-wise multiply of two 1024-element Float32 arrays, storing the result in a third array. Handle the case where 1024 is not divisible by your SIMD width.
Why It Matters
This is where all prior knowledge pays off. A well-designed kernel on a single core can approach the theoretical peak FLOPS of the hardware. A poorly designed one wastes 90%+ of available compute. The difference between a 10ms and 100ms inference pass is kernel quality.