4.3 Cost model: FLOPS, bandwidth, and cache

Explanation

This page belongs to the performance and systems model. On a first pass, you may read the basic memory-layout material and then continue to validation, returning here when speed and memory traffic become important.

Performance is not determined only by the number of arithmetic operations. A processor may be able to perform many floating-point operations per second, but the calculation can still be slow if data arrive from memory too slowly. On modern computers, many scientific programs are limited by memory bandwidth before they reach peak FLOPS.

Registers and cache are small and fast; RAM and storage are larger and slower.

A useful rough order is:

  1. Registers and cache are fastest.
  2. RAM is slower.
  3. SSD/HDD access is much slower.

Bandwidth is how much data can move per unit time. Latency is the delay before data arrive. Cache locality means using data that are close together in memory so that cache works well.

For high performance, it is often more important to reduce memory transfer than to reduce the number of floating-point operations. Contiguous buffers and cache-friendly access order matter. Blocking, also called tiling, is a common idea: split a large computation into smaller blocks so that reused data stay in cache.

For dense matrix operations, a practical rule is to use the system-optimized BLAS and LAPACK libraries instead of writing the inner loops yourself. In Rust, tenferro is the default tensor path for this course, while ndarray and ndarray-linalg are common alternatives in the wider ecosystem when their API or linear algebra integration fits the task.

Performance claims should be measured. A hand-written loop that looks simple may be slower than a library call, and a clever expression may be limited by memory bandwidth. Measure the actual program, and keep correctness tests separate from longer timing experiments.

Things to look up

  • FLOPS
  • Memory bandwidth
  • Cache
  • Latency
  • Cache locality
  • Blocking / tiling
  • Contiguous memory
  • BLAS and LAPACK ecosystem
  • tenferro
  • ndarray-linalg
  • Benchmarking
  • BLAS
  • LAPACK

Exercise

Consider two programs that perform the same number of additions over a large array. One reads memory in order. The other jumps through memory with a large stride. Which one do you expect to be faster, and why?

Then write one sentence explaining how you would measure the difference without mixing the timing code with correctness tests.

Notes for the exercise

  • Do not explain performance only by counting arithmetic operations.
  • Mention memory access order.
  • Mention cache locality.
  • Distinguish bandwidth from FLOPS.
  • Connect this idea to large arrays in scientific computing.
  • Mention whether the data are stored in a contiguous buffer.
  • Do not trust a performance explanation until it has been measured on the relevant problem size.