Projects
Each project here, requires integrating concepts from multiple phases. Having the primitives, we need to build the system.
Project 1: Optimized GEMM
Implement General Matrix Multiply (GEMM) with tiling, SIMD vectorization, and cache-aware blocking. Target: within 2x of NumPy's BLAS-backed performance for 512×512 Float32 matrices.
- Requires: Structs, Memory Model, Arrays, Performance Primitives, SIMD, Kernel Design
- Metric: GFLOPS achieved vs. theoretical peak
Project 2: Tensor Struct with Strides
Build a Tensor struct that supports arbitrary dimensions, strides, and views (slicing without copying). Implement reshape, transpose, and element access with stride-based indexing.
- Requires: Structs, Memory Model, Arrays, Ownership
- Metric: Zero-copy transpose and slice operations
Project 3: Execution Graph
Build a simple computation graph where nodes are operations (add, multiply, matmul) and edges are tensors. Implement forward evaluation by topologically sorting the graph and executing nodes in order.
- Requires: Structs, Memory Model, Ownership, Abstraction Costs
- Metric: Correct evaluation of a 10-node graph with shared inputs
Project 4: Operator Fusion
Extend the execution graph to fuse consecutive element-wise operations into a single kernel. Instead of writing intermediate results to memory, compute them in registers.
- Requires: All prior phases
- Metric: Fused kernel runs faster than sequential separate kernels
Validation
For each project, you should be able to explain: what memory layout you chose and why, where the bottleneck is (compute-bound or memory-bound), and what the compiler can and cannot optimize in your implementation.