6.3 Reproducibility as validation

Explanation

A result is more trustworthy when someone can reproduce how it was obtained. Reproducibility is not only a reporting habit. It is part of validation, because a result that cannot be inspected later cannot be checked seriously.

Saving only the final number is usually not enough:

0.33335

A useful saved result keeps the number together with the information needed to understand and rerun the calculation:

{
  "quantity": "integral of x^2 from 0 to 1",
  "method": "trapezoidal_rule",
  "n_intervals": 100,
  "value": 0.33335,
  "expected_scale": "near 1/3",
  "tolerance": 0.0001,
  "git_commit": "git commit hash here",
  "rustc_version": "rustc version here",
  "cargo_version": "cargo version here",
  "command": "cargo run --bin integrate -- --n-intervals 100",
  "cargo_lock_status": "committed and unchanged"
}

The exact format depends on the project. The important point is to save more than the final number: inputs, parameters, code version, software environment, random seed when relevant, output, and enough context to connect the output to the run that produced it.

For a Rust project, useful metadata includes the rustc version, cargo version, exact command, git commit, and whether Cargo.lock was committed and unchanged. If the output is small, JSON or plain text may be enough. Large or multidimensional array outputs should use an appropriate array or container format, such as .npy, .npz, or HDF5, rather than an ad hoc text dump.

Separate computation from plotting when possible. The computation program should write result data and metadata first. The plotting script should read that file instead of recomputing the result, so the plotted figure can be traced back to one saved run.

There are several levels of reproducibility:

  • rerun the same code with the same inputs and get the same result,
  • rerun after a code change and check that trusted examples still pass,
  • compare against independent reference data or a simpler reference method.

The last level is especially important when using AI agents. If the same agent can freely edit both implementation code and tests, it may accidentally or intentionally make the tests match the bug. One defense is to keep verification data separately: fixed inputs, expected outputs, tolerances, provenance, and notes that are reviewed separately from ordinary code edits.

Such data are often called oracle data. One public example is tensor-ad-oracles, which stores machine-readable JSON oracle data and mathematical notes for checking derivative correctness of tensor and linear algebra operations.

Things to look up

  • Reproducibility
  • Random seed
  • Metadata
  • Version information
  • Provenance
  • Oracle data
  • Tolerance
  • JSON
  • rustc --version
  • cargo --version
  • Cargo.lock
  • .npy
  • .npz
  • HDF5

Exercise

Suppose a program computes a numerical estimate and writes a result file. List the information that should be saved with the result so another student can inspect the calculation.

Then design a small verification-data record for one simple calculation, such as an average, a numerical integral, or a root-finding problem. Include inputs, expected output, tolerance, and provenance.

Advanced: ask an AI agent to inspect tensor-ad-oracles and summarize what kinds of information are stored for each verification case. Focus on the structure of the stored data, not on the details of automatic differentiation.

Notes for the exercise

  • Include parameters.
  • Include software or code version information.
  • Include rustc version, cargo version, command, git commit, and Cargo.lock status for a Rust project.
  • Include the random seed if randomness is used.
  • Include enough information to connect the output to the run that produced it.
  • Choose a result format that fits the data: JSON or text for small scalar or tabular outputs, and .npy, .npz, or HDF5 for large or multidimensional arrays.
  • Keep computation and plotting separate when a result will be plotted.
  • Say which checks should be fast and which checks may be expensive.
  • Explain why verification data should be reviewed carefully when an AI agent edits the project.