Skip to content

ADR 0004: Frozen data as JSON / NPZ, not pickle

Status: Accepted (2026-04-29)

Context

The codegen step emits SSOT data (cycles, gears, anchors, periods, fragments) for inclusion in the wheel. Three serialisation choices: JSON, NPZ, pickle.

Decision

  • JSON for everything text-shaped (cycles, gears, anchors, periods, fragments, manifest).
  • NPZ for numerical arrays only (deterministic HDC channel basis vectors).
  • No pickle anywhere.

Consequences

  • JSON is human-readable, language-agnostic, and CodeQL-clean.
  • NPZ is the standard numpy serialisation for arrays; smaller than JSON for big numerical data.
  • Pickle is forbidden — it executes arbitrary code on load (CWE-502); a malicious downstream consumer who tampered with _data/ could execute code in the user's interpreter.
  • Frozen-data drift caught by test_data_freshness.py's manifest-SHA check.

Alternatives considered

  • Pickle. Rejected on security grounds.
  • Protobuf / msgpack. Would add a runtime dependency for what's already a tiny corpus; not worth it.
  • CSV. Considered for tabular data (gears) but JSON's nested structure is a better fit for teeth: dict[reconstruction] -> int.