Projects

Each project here, requires integrating concepts from multiple phases. Having the primitives, we need to build the system.

Project 1: Optimized GEMM

Implement General Matrix Multiply (GEMM) with tiling, SIMD vectorization, and cache-aware blocking. Target: within 2x of NumPy's BLAS-backed performance for 512×512 Float32 matrices.

  • Requires: Structs, Memory Model, Arrays, Performance Primitives, SIMD, Kernel Design
  • Metric: GFLOPS achieved vs. theoretical peak

Project 2: Tensor Struct with Strides

Build a Tensor struct that supports arbitrary dimensions, strides, and views (slicing without copying). Implement reshape, transpose, and element access with stride-based indexing.

  • Requires: Structs, Memory Model, Arrays, Ownership
  • Metric: Zero-copy transpose and slice operations

Project 3: Execution Graph

Build a simple computation graph where nodes are operations (add, multiply, matmul) and edges are tensors. Implement forward evaluation by topologically sorting the graph and executing nodes in order.

  • Requires: Structs, Memory Model, Ownership, Abstraction Costs
  • Metric: Correct evaluation of a 10-node graph with shared inputs

Project 4: Operator Fusion

Extend the execution graph to fuse consecutive element-wise operations into a single kernel. Instead of writing intermediate results to memory, compute them in registers.

  • Requires: All prior phases
  • Metric: Fused kernel runs faster than sequential separate kernels

Validation

For each project, you should be able to explain: what memory layout you chose and why, where the bottleneck is (compute-bound or memory-bound), and what the compiler can and cannot optimize in your implementation.