6.3 Matrix multiplication as a performance model

Explanation

General matrix multiplication, often called GEMM, computes:

C = A B

Use this shape convention: A has shape m x k, B has shape k x n, and C has shape m x n. Each output value is

C[i, j] = sum over p of A[i, p] * B[p, j]

The basic algorithm performs about 2 * m * k * n floating-point operations if one multiply and one add are counted separately. The inputs contain m * k + k * n values, and the output contains m * n values. Real speed depends on both FLOPs and how many bytes must move through memory.

For hand-written low-level Rust GEMM in this exercise, use flat Vec<f64> buffers with explicit shape and indexing. A simple row-major index is:

fn row_major(row: usize, col: usize, cols: usize) -> usize {
    row * cols + col
}

With row-major A, B, and C, the direct i, j, p loop order reads A[i, p] contiguously as p changes, but reads B[p, j] with a stride of n. That strided access can waste cache bandwidth. Changing loop order, working with row slices, or blocking the loops can improve reuse.

Contiguous memory matters because cache loads nearby values together. Row-major and column-major layouts choose different fastest-changing indices. Blocking, also called tiling, groups work so a small block of A, B, or C can be reused while it is still in cache. Vectorization lets the CPU apply one instruction to several values. Bounds-check elimination matters because safe Rust indexing checks that an index is in range unless the compiler can prove it from surrounding code.

A practical safe route is:

Check shapes up front: a.len() == m * k, b.len() == k * n, and c.len() == m * n.
Take row slices before inner loops when possible.
Keep loop bounds simple so the compiler can see that indices remain in range.
Measure the result instead of assuming the change helped.

For the library/tensor comparison path, use tenferro typed tensors, currently tenferro_tensor::TypedTensor in the tenferro-rs crate layout. The comparison should check both correctness and the cost difference between a teaching implementation and a tensor or library implementation.

Things to look up

GEMM
Matrix shape convention
Row-major layout
Column-major layout
Contiguous memory
Cache locality
Blocking / tiling
Vectorization
Bounds checks
Bounds-check elimination
FLOPs
Memory bandwidth
tenferro_tensor::TypedTensor
Rust benchmarking

Exercise

Ask an AI agent to create a Cargo project in work/chapter_6.3/ for this experiment. Keep correctness tests and benchmark runs separate.

In that project:

Generate a simple safe Rust GEMM using flat Vec<f64> buffers and explicit m, k, n shapes.
Test small matrices by hand, including at least one non-square case.
Benchmark a controlled run. Record matrix sizes, build mode, CPU/thread assumptions, elapsed time, and an approximate FLOP rate.
Check the current tenferro-rs API, crate names, import paths, and resolved Git revision, then compare the result with a tenferro typed tensor calculation for the same data and shapes.
Ask the agent to optimize the hand-written implementation using safe Rust. Consider shape checks outside the hot loop, row slices, loop order, and blocking.
Re-run the correctness tests after every optimization.
Explain the memory access pattern before and after optimization.
State whether the optimization preserved the scientific contract: same shapes, same mathematical result within the chosen tolerance, and no hidden change to the data layout assumptions.

Notes for the exercise

Unsafe indexing is not part of the main route for this exercise.
If unsafe indexing appears in an AI-generated answer, require a written list of invariants and tests that exercise the boundary cases.
Performance claims need measurements. Include the exact command and input size used for each timing result.
A faster result that fails a correctness test is not an optimization.
Compare memory access order separately from Rust syntax changes.
Because tenferro is under active development, verify the current API and resolved Git revision before trusting AI-generated tensor code.
Do not expect a short teaching implementation to match a tuned library on large dense matrices.