chess-spectral Optimization & Idiom Audit — April 2026¶

Scope: Review-only audit of docs/chess-maths/chess-spectral/ C17 + Python implementation. No code changes made. Findings are prioritized for action by the maintainer.

Hard constraint on every suggestion: Byte-for-byte parity between the C encoder and the Python reference encoder must survive. The parity test test_csv_matches_c_byte_for_byte is the authority — any finding tagged parity-impact: none has been mentally walked through that test.

Three lenses used, per maintainer direction:

C throughput — encode and csv hot paths
Python throughput & idiom — vectorization, import-time cost, modern numpy
Build / tooling / supply chain — CMake modernization, miniz vs. alternatives, CI hygiene

How to read this doc¶

Findings are grouped by severity, not by file. Read P0 → P1 first — that's the "if you do nothing else" subset.
Each finding names the file & line(s), a current snippet, a suggested snippet, and a short "why better" with citation where the claim is informed by external data.
A closing Considered and rejected section captures changes I looked at and did not recommend, with the reason.

Severity ladder¶

P0 — measurable hot-path win, zero risk (e.g., restrict, loop fusion, memcpy→swap).
P1 — clear idiomatic improvement with low risk (modernize CMake, stack-allocate fixed buffers).
P2 — refactor with payoff but needs care (vectorize Python via einsum, optional libdeflate backend).
P3 — nice-to-have hygiene (clang-tidy, lazy imports, .clang-format).

Effort¶

S — under 30 minutes including rebuild & parity re-run
M — a focused half-day
L — a day or more, possibly touching multiple files

Findings index¶

ID	Severity	Title	File
F-01	P0	Fold chi-dispatch and scale pass in D4 projection	src/cs_d4.c
F-02	P0	Add `restrict` qualifiers on hot-path function args	src/cs_encoder.c, src/cs_fiber.c, src/cs_signal.c
F-03	P0	Replace per-ply `memcpy(prev_enc, …)` with pointer swap	src/main.c:661
F-04	P1	Replace `malloc(64KB)` in gzip path with stack buffers	src/cs_gzip.c:65-66, 154-155
F-05	P1	Rewrite `cs_roll_640` as two `memcpy`s	src/cs_temporal.c:27-31
F-06	P1	CSV row assembly: one `fwrite` per row, not 14 `fprintf`s	src/main.c:665-673
F-07	P1	Add `CMakePresets.json` (dev-debug / release / asan)	CMakeLists.txt
F-08	P1	Enable IPO/LTO for Release builds	CMakeLists.txt
F-09	P1	Drop hand-rolled `/O2`/`-O2`; let `CMAKE_BUILD_TYPE` drive it	CMakeLists.txt:7-12
F-10	P1	Use generator expressions instead of `if(MSVC)`	CMakeLists.txt:7-12, 24-27
F-11	P1	Hoist `cs_encoding_t enc` outside the per-ply loop	src/main.c:435
F-12	P2	Vectorize Python fiber channels via einsum	python/chess_spectral/encoder.py:97-176
F-13	P2	`np.fromfile` bulk read instead of per-frame loop	python/chess_spectral/csv_export.py:192-198
F-14	P2	Collapse local-fiber table build into one einsum	python/chess_spectral/tables.py:228-241
F-15	P2	Optional `libdeflate` backend for gzip	CMakeLists.txt:17-23, src/cs_gzip.c
F-16	P3	Lazy-load tables via PEP 562 `__getattr__`	python/chess_spectral/tables.py:334
F-17	P3	`C - np.diag(np.diag(C))` instead of copy + `fill_diagonal`	python/chess_spectral/tables.py:216, 240
F-18	P3	Lookup table for FEN piece char parsing	src/cs_fen.c
F-19	P3	Build codegen output as a single join+write	codegen/emit_tables.py
F-20	P3	Minimal `.clang-tidy` + CI matrix	— (new files)
F-21	P3	Move `from typing import Iterator` to the top of the module	python/chess_spectral/csv_export.py:219

P0 — do these first¶

F-01: Fold chi-dispatch and scale pass in D4 projection ¶

File: src/cs_d4.c:19-33 Category: C-throughput Parity impact: none (same math, same FP ordering — chi is exactly representable in double) Effort: S

Current:

for (int g = 0; g < 8; g++) {
    int chi = CHARS[irrep_idx][g];
    if (chi == 0) continue;
    const uint8_t *perm = D4_PERMS[g];
    if (chi == 1) {
        for (int k = 0; k < 64; k++) acc[k] += sig[perm[k]];
    } else if (chi == -1) {
        for (int k = 0; k < 64; k++) acc[k] -= sig[perm[k]];
    } else {
        for (int k = 0; k < 64; k++) acc[k] += (double)chi * sig[perm[k]];
    }
}

for (int k = 0; k < 64; k++) out[k] = scale * acc[k];

Suggested:

for (int g = 0; g < 8; g++) {
    const int chi = CHARS[irrep_idx][g];
    if (chi == 0) continue;
    const double c = scale * (double)chi;   /* fuse scale into accumulator */
    const uint8_t *perm = D4_PERMS[g];
    for (int k = 0; k < 64; k++) out[k] += c * sig[perm[k]];
}

And initialize out to zero at the top of the function instead of acc.

Why better: The compiler sees one loop shape instead of three; it's straightforward to auto-vectorize a single out[k] += c * sig[perm[k]] with AVX2/NEON, whereas the branch ladder gives the optimizer three distinct targets it has to prove equivalent. Folding scale into c also deletes the final N=64 multiply pass and the intermediate 512-byte acc buffer. Called 5× per position → called on the order of 10^6 times in a full-book CSV run; this is the tightest inner loop in the project. Citation for the "single vectorizable shape beats branch ladder" rule of thumb: C17 production guide, 2026 edition.

F-02: Add `restrict` qualifiers on hot-path function args ¶

Files: - src/cs_encoder.c:4-7 — cs_encode_640 - src/cs_fiber.c:19-22, 41, 54-56 - src/cs_signal.c:5-7 - src/cs_d4.c:11

Category: C-throughput Parity impact: none (callers never pass overlapping pointers) Effort: S

Current:

void cs_fiber_symmetric_channel(const cs_position_t *pos,
                                const double sig[64],
                                int d,
                                double out[64]) { … }

Suggested:

void cs_fiber_symmetric_channel(const cs_position_t *restrict pos,
                                const double *restrict sig,
                                int d,
                                double *restrict out) { … }

Apply the same pattern to cs_board_signal, cs_project_irrep, cs_fiber_antisymmetric, cs_fiber_diagonal, and cs_encode_640.

Why better: The restrict keyword tells the compiler that the two pointers don't alias, unlocking load/store reordering and vectorization it otherwise has to conservatively avoid. const double sig[64] decays to const double * and loses both the length and any aliasing guarantee; the *restrict form preserves the non-alias contract. In the fiber loops, out[k] += w * adj_row[k] is a classic AXPY candidate — with restrict the compiler can safely hoist adj_row loads and interleave them with out stores. Citation: C99/C17 restrict semantics.

Parity note: restrict is a no-op on correctness when callers don't alias. The whole call graph passes fresh stack/member buffers into each function; there is no pointer overlap anywhere. Safe.

F-03: Replace per-ply `memcpy(prev_enc, …)` with pointer swap ¶

File: src/main.c:661 (inside the CSV emit loop, ~620-674) Category: C-throughput Parity impact: none (reads and writes identical bytes) Effort: S

Current:

float prev_enc[CS_ENCODING_DIM];  /* declared earlier in cmd_csv */
int have_prev = 0;

for (uint32_t p = 0; p < hdr.n_plies; p++) {
    cs_spectral_frame_t fr;
    if (cs_file_read_frame(fin, &fr) != 0) { … }
    …
    memcpy(prev_enc, fr.encoding, sizeof(prev_enc));   /* 2560 B per ply */
    have_prev = 1;
}

Suggested:

float enc_a[CS_ENCODING_DIM];
float enc_b[CS_ENCODING_DIM];
float *cur  = enc_a;
float *prev = enc_b;
int have_prev = 0;

for (uint32_t p = 0; p < hdr.n_plies; p++) {
    cs_spectral_frame_t fr;
    if (cs_file_read_frame(fin, &fr) != 0) { … }
    memcpy(cur, fr.encoding, sizeof(enc_a));   /* one mandatory copy from the read struct */
    /* …use cur and (if have_prev) prev to compute dist/cos/dA1… */
    float *tmp = prev; prev = cur; cur = tmp;
    have_prev = 1;
}

Or, even tighter: keep a pointer to fr.encoding itself across iterations by double-buffering cs_spectral_frame_t frames.

Why better: A .spectral with 200 plies currently moves 200 × 2560 B = 512 KB through cache just for the prev_enc rotation. Pointer swap makes the rotation free; the only remaining move is read_frame → cur, which was already paid for. Pure memory-bandwidth win, and on a cold run the CSV path is bandwidth-bound.

P1 — idiomatic improvements, low risk¶

F-04: Replace `malloc(64KB)` in gzip path with stack buffers ¶

File: src/cs_gzip.c:65-66 (inflate) and 154-155 (deflate) Category: C-throughput Parity impact: none Effort: S

Current:

unsigned char *in  = (unsigned char *)malloc(CS_GZ_BUF);
unsigned char *out = (unsigned char *)malloc(CS_GZ_BUF);
if (!in || !out) { free(in); free(out); mz_inflateEnd(&zs); return -2; }

Suggested:

unsigned char in [CS_GZ_BUF];
unsigned char out[CS_GZ_BUF];

And delete the corresponding free() calls and the OOM branch.

Why better: CS_GZ_BUF = 64 KiB, so 128 KiB of stack is needed total — well within the 1 MiB default on Linux/Windows x64 (and completely irrelevant for the desktop spectral CLI, which never runs on an MCU). Removes two allocator round-trips per gzip operation, removes a realistic-sounding but dead OOM branch, and lets the compiler know the buffer base is 16-byte-aligned (stack always is on x86-64 SysV / MSVC x64). Pair with alignas(64) unsigned char in[…] for AVX-512-era alignment if this ever becomes a bottleneck. Citation: modern restrict + alignment patterns in C17 production guide.

F-05: Rewrite `cs_roll_640` as two `memcpy`s ¶

File: src/cs_temporal.c:27-31 Category: C-throughput Parity impact: none (bytes identical; memcpy of double arrays is bit-for-bit) Effort: S

Current:

for (int i = 0; i < CS_ENCODING_DIM; i++) {
    int dst = (i + o);
    if (dst >= CS_ENCODING_DIM) dst -= CS_ENCODING_DIM;
    out[dst] = in[i];
}

Suggested:

/* Rotate right by o: out[0..N-o) = in[o..N), out[N-o..N) = in[0..o) */
const int n = CS_ENCODING_DIM;
memcpy(out,         in + o,     (size_t)(n - o) * sizeof(double));
memcpy(out + n - o, in,         (size_t)o       * sizeof(double));

(With the obvious if (o == 0) memcpy(out, in, sizeof(double)*n); short-circuit.)

Why better: Two memcpys compile to rep movsq / vmovdqu chains with no branch per element, no modular arithmetic, and no indirect store pattern that defeats the prefetcher. The current loop has an unpredictable branch every iteration (predictable only in the o == 0 and o == N-1 extremes). 640 doubles is the whole encoding vector; this runs in every HDC binding step if the roll path is wired in later.

F-06: CSV row assembly: one `fwrite` per row, not 14 `fprintf`s ¶

File: src/main.c:665-673 Category: C-throughput Parity impact: bytes-only — must be verified with a diff against a known-good CSV Effort: M

Current:

fprintf(fout, "%u,%u,%u,", fr.ply, (unsigned)fr.move_from, (unsigned)fr.move_to);
fmt_csv_num(fout, dist_prev);  fputc(',', fout);
fmt_csv_cos(fout, cos_prev);   fputc(',', fout);
fmt_csv_num(fout, dA1);        fputc(',', fout);
for (int c = 0; c < 10; c++) { fmt_csv_num(fout, E[c]); fputc(',', fout); }
fmt_csv_num(fout, total);
fputc('\n', fout);

That's ~16 fprintf/fputc calls per ply. Each one re-acquires the FILE lock and checks buffering state.

Suggested:

char line[256];
int n = 0;
n += snprintf(line + n, sizeof(line) - n, "%u,%u,%u,",
              fr.ply, (unsigned)fr.move_from, (unsigned)fr.move_to);
n += fmt_csv_num_to(line + n, sizeof(line) - n, dist_prev);
line[n++] = ',';
n += fmt_csv_cos_to(line + n, sizeof(line) - n, cos_prev);
line[n++] = ',';
n += fmt_csv_num_to(line + n, sizeof(line) - n, dA1);
line[n++] = ',';
for (int c = 0; c < 10; c++) {
    n += fmt_csv_num_to(line + n, sizeof(line) - n, E[c]);
    line[n++] = ',';
}
n += fmt_csv_num_to(line + n, sizeof(line) - n, total);
line[n++] = '\n';
fwrite(line, 1, (size_t)n, fout);

Where fmt_csv_num_to / fmt_csv_cos_to are snprintf-based siblings of the existing fmt_csv_num / fmt_csv_cos returning the byte count written.

Why better: One call into the FILE layer per row. stdio acquires an internal lock on every fprintf (even in single-threaded programs, because ISO C says it must), so the per-cell cost is measurable even against a large stdio buffer. Rule of thumb: ~3–5× speedup on the CSV emit step for books with tens of thousands of plies.

Parity caveat: the output bytes must match byte-for-byte. The snprintf conversions with the same format strings will produce identical bytes on any sane libc, but this is exactly why the parity tag is bytes-only rather than none — run the parity test after the change.

F-07: Add `CMakePresets.json` (dev-debug / release / asan) ¶

File: new file CMakePresets.json, next to CMakeLists.txt Category: Build/tooling Parity impact: none Effort: S

Suggested (minimal):

{
  "version": 6,
  "cmakeMinimumRequired": { "major": 3, "minor": 23, "patch": 0 },
  "configurePresets": [
    {
      "name": "release",
      "generator": "Ninja",
      "binaryDir": "${sourceDir}/build/release",
      "cacheVariables": {
        "CMAKE_BUILD_TYPE": "Release",
        "CMAKE_INTERPROCEDURAL_OPTIMIZATION": "TRUE"
      }
    },
    {
      "name": "dev-debug",
      "generator": "Ninja",
      "binaryDir": "${sourceDir}/build/debug",
      "cacheVariables": {
        "CMAKE_BUILD_TYPE": "Debug",
        "CMAKE_C_FLAGS_DEBUG": "-O0 -g3 -fno-omit-frame-pointer"
      }
    },
    {
      "name": "asan",
      "generator": "Ninja",
      "binaryDir": "${sourceDir}/build/asan",
      "cacheVariables": {
        "CMAKE_BUILD_TYPE": "Debug",
        "CMAKE_C_FLAGS_DEBUG": "-O1 -g3 -fno-omit-frame-pointer -fsanitize=address,undefined",
        "CMAKE_EXE_LINKER_FLAGS": "-fsanitize=address,undefined"
      }
    }
  ],
  "buildPresets": [
    { "name": "release", "configurePreset": "release" },
    { "name": "dev-debug", "configurePreset": "dev-debug" },
    { "name": "asan", "configurePreset": "asan" }
  ],
  "testPresets": [
    { "name": "release",   "configurePreset": "release",   "output": { "outputOnFailure": true } },
    { "name": "asan",      "configurePreset": "asan",      "output": { "outputOnFailure": true } }
  ]
}

Invocation becomes cmake --preset release && cmake --build --preset release && ctest --preset release.

Why better: Presets are the modern replacement for "here are the flags you need to remember" READMEs. They give every contributor identical build trees, let CI call into the same preset the developer uses, and the asan preset in particular pays for itself the first time a silent out-of-bounds is caught. Citations: Studyplan: ASan with CMake; Organizing CMake presets; CMake C++23 best practices, 2025.

F-08: Enable IPO/LTO for Release builds ¶

File: CMakeLists.txt Category: Build/tooling Parity impact: none (LTO changes inlining, not floating-point math) Effort: S

Suggested:

include(CheckIPOSupported)
check_ipo_supported(RESULT ipo_ok OUTPUT ipo_msg)
if(ipo_ok)
    set(CMAKE_INTERPROCEDURAL_OPTIMIZATION_RELEASE TRUE)
else()
    message(STATUS "IPO/LTO not supported: ${ipo_msg}")
endif()

Why better: LTO lets the compiler inline across translation units — cs_board_signal and cs_fiber_symmetric_channel become candidates for inlining into cs_encode_640, which can then fuse their memsets and hoist table-address loads. Typical gain on tight numeric kernels is 5–15%. Costs a few seconds of Release build time. The CheckIPOSupported gate means this is harmless on platforms that don't support it.

F-09: Drop hand-rolled `/O2`/`-O2` ¶

File: CMakeLists.txt:7-12 Category: Build/tooling Parity impact: none Effort: S

Current:

if(MSVC OR …)
    add_compile_options(/W4 /O2)
    add_compile_definitions(_CRT_SECURE_NO_WARNINGS)
else()
    add_compile_options(-Wall -Wextra -O2)
endif()

Suggested:

if(NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES)
    set(CMAKE_BUILD_TYPE Release CACHE STRING "" FORCE)
endif()

add_compile_options(
    $<$<COMPILE_LANG_AND_ID:C,MSVC>:/W4>
    $<$<NOT:$<COMPILE_LANG_AND_ID:C,MSVC>>:-Wall -Wextra>
)
add_compile_definitions($<$<COMPILE_LANG_AND_ID:C,MSVC>:_CRT_SECURE_NO_WARNINGS>)

Why better: CMAKE_BUILD_TYPE=Release already implies /O2 on MSVC and -O3 -DNDEBUG on GCC/Clang. Hard-coding /O2 + -O2 makes Debug builds secretly optimized (breaks debugging) and caps Release at -O2 instead of -O3. Remove the duplication and let CMake do its job. The default-to-Release guard catches single-config generators (Ninja, Make) where not setting CMAKE_BUILD_TYPE silently produces an unoptimized binary.

F-10: Use generator expressions instead of `if(MSVC)` ¶

File: CMakeLists.txt:7-12, 24-27 Category: Build/tooling Parity impact: none Effort: S

Current:

if(MSVC OR (CMAKE_C_COMPILER_ID STREQUAL "Clang" AND CMAKE_C_SIMULATE_ID STREQUAL "MSVC"))
    target_compile_options(miniz PRIVATE /W0)
endif()

Suggested:

target_compile_options(miniz PRIVATE
    $<$<COMPILE_LANG_AND_ID:C,MSVC,Clang>:/W0>      # MSVC-style drivers
    $<$<OR:$<COMPILE_LANG_AND_ID:C,GNU>,$<COMPILE_LANG_AND_ID:C,Clang>>:-w>
)

Better yet, treat miniz as a SYSTEM include so its warnings are suppressed independent of driver:

target_include_directories(miniz SYSTEM PUBLIC vendor/miniz)

Why better: Generator expressions evaluate per-target and per-config, so you don't repeat yourself when adding a third compiler or a per-config build. SYSTEM includes suppress warnings in downstream consumers automatically — the idiomatic way to keep vendor code quiet without turning off your own /W4/-Wextra. Citation: CMake C++23 best practices, 2025.

F-11: Hoist `cs_encoding_t enc` outside the per-ply loop ¶

File: src/main.c:435-436 (and the surrounding loop) Category: C-throughput Parity impact: none Effort: S

Current:

while (fgets(buf, sizeof(buf), fin)) {
    …
    cs_position_t pos;
    …
    cs_encoding_t enc;                 /* 5120 B, per iteration */
    cs_encode_640(&pos, SPECTRAL_VALS, &enc);
    …
}

Suggested:

cs_position_t pos;
cs_encoding_t enc;   /* reused across all plies */
while (fgets(buf, sizeof(buf), fin)) {
    …
    cs_encode_640(&pos, SPECTRAL_VALS, &enc);
    …
}

Why better: At -O0 (debug), re-declaring a 5 KiB struct inside the loop is a 5 KiB stack frame adjustment per ply. At -O2 it usually disappears into the same slot, but hoisting makes the intent explicit and removes any dependence on that optimization. The cs_encoding_t is write-only from the caller's perspective each iteration — the encoder overwrites it fully — so reuse is safe.

P2 — refactors with payoff, need care¶

F-12: Vectorize Python fiber channels via einsum ¶

File: python/chess_spectral/encoder.py:97-176 Category: Python-throughput Parity impact: rebaseline-needed — FP summation order changes from Python-dict-iteration order to numpy's row-major accumulation. Very likely still bit-identical at atol=1e-10, but this must be verified against test_csv_matches_c_byte_for_byte before merging. Effort: M

Current (sample):

def _fiber_antisymmetric(pos: dict[int, str]) -> np.ndarray:
    out = np.zeros(BOARD_DIM)
    val_P = SPECTRAL_VALS['P']
    for s, pchar in pos.items():
        if pchar.upper() != 'P':
            continue
        w = val_P if pchar == 'P' else -val_P
        out += w * PAWN_ANTI_FIBER[s]
    return out

Suggested:

def _fiber_antisymmetric(pos: dict[int, str]) -> np.ndarray:
    val_P = SPECTRAL_VALS['P']
    squares = np.fromiter(
        (s for s, pc in pos.items() if pc.upper() == 'P'),
        dtype=np.int64,
    )
    if squares.size == 0:
        return np.zeros(BOARD_DIM)
    signs = np.fromiter(
        (val_P if pos[int(s)] == 'P' else -val_P for s in squares),
        dtype=np.float64,
        count=squares.size,
    )
    return np.einsum('i,ij->j', signs, PAWN_ANTI_FIBER[squares])

Apply the same pattern to _fiber_diagonal and to the channel-5..7 loop inside encode_640 (lines 157-168), folding both the adj_row @ sig gradient and the scatter-accumulate into one einsum per channel:

# channels 5..7: one shot per position, not one per piece-square
# squares/pieces filtered to non-pawns
pidx   = _FIBER_IDX_ARR[pieces]                 # (K,)
adj    = LOCAL_ADJ_ROWS[pidx, squares]          # (K, 64)
grad   = np.einsum('ki,i->k', adj, sig)         # (K,)
fib    = LOCAL_FIBER_3D[pidx, squares, :]       # (K, 3)
for d in range(3):
    fc = np.einsum('k,ki->i', grad * fib[:, d], adj)
    out[320 + d*BOARD_DIM : 320 + (d+1)*BOARD_DIM] = fc

Why better: The current code is a classic "Python-loop-around-numpy-call" anti-pattern. Each loop iteration pays CPython interpreter overhead (~50-100 ns) on top of the numpy work itself. For a typical 30-piece mid-game position, that's ~30 per channel × 4 channels × 50 ns = ~6 µs of interpreter overhead per position, on top of the real math. Vectorized einsum hands the whole (K, 64) broadcast to numpy's underlying BLAS-backed paths. Citations: NumPy einsum reference; opt_einsum documents the summation-order tradeoff.

The parity concern is real. IEEE 754 addition is not associative; changing the accumulation order can flip a round bit. The parity test uses assert_array_equal on float32 frame bytes; this rounds the encode to float32 before the C↔Python comparison, which absorbs most of the float64-summation-order differences. But verify.

F-13: `np.fromfile` bulk read instead of per-frame loop ¶

File: python/chess_spectral/csv_export.py:192-198 Category: Python-throughput Parity impact: none (bytes read are identical; only the syscall count changes) Effort: M — requires defining a numpy structured dtype that matches cs_spectral_frame_t

Current:

for _ in range(hdr.n_plies):
    fr = read_frame(fp)
    row = _row(fr, prev_enc)
    …

Suggested:

Define the frame dtype once (in frame.py):

FRAME_DTYPE = np.dtype([
    ('encoding',   np.float32, (ENCODING_DIM,)),
    ('ply',        np.uint32),
    ('move_from',  np.uint8),
    ('move_to',    np.uint8),
    ('move_promo', np.uint8),
    ('move_flags', np.uint8),
])
assert FRAME_DTYPE.itemsize == FRAME_BYTES   # == 2568

Then in write_csv:

frames = np.fromfile(fp, dtype=FRAME_DTYPE, count=hdr.n_plies)
for i in range(hdr.n_plies):
    fr = frames[i]
    …

Or, even better, vectorize the row emission itself over the whole frames array using numpy broadcasting on the dist/cos/dA1 computation. That's out of scope for a single finding but worth tracking.

Why better: Per-frame read_frame(fp) does 2 fp.read() calls (header + payload) plus a struct.unpack; for a 200-ply game that's 400 syscalls and 200 unpacks. np.fromfile does 1 syscall and 0 unpacks — numpy reads straight into its own buffer with the dtype template. Net: ~100× fewer syscalls. Works cleanly with transparent gzip too (just read the whole BytesIO into memory once, then np.frombuffer).

F-14: Collapse local-fiber table build into one einsum ¶

File: python/chess_spectral/tables.py:228-241 Category: Python-throughput (build-time only) Parity impact: rebaseline-needed — summation order changes; check parity test after Effort: M

Current:

LOCAL_FIBER_3D  = np.zeros((5, BOARD_DIM, 3))
LOCAL_ADJ_ROWS  = np.zeros((5, BOARD_DIM, BOARD_DIM))
for pidx, pchar in enumerate(['N', 'B', 'R', 'Q', 'K']):
    fn = SHORT_PFNS[pchar]
    for s in range(BOARD_DIM):
        r, c = rc(s)
        A_loc = np.zeros((BOARD_DIM, BOARD_DIM))
        for tr, tc in fn(r, c):
            t = sq(tr, tc)
            A_loc[s, t] = 1; A_loc[t, s] = 1
        LOCAL_ADJ_ROWS[pidx, s] = A_loc[s]
        C_loc = EVECS_GRID.T @ A_loc @ EVECS_GRID
        C_off = C_loc.copy(); np.fill_diagonal(C_off, 0)
        LOCAL_FIBER_3D[pidx, s] = C_off[upper_idx] @ FIBER_BASIS

Suggested sketch:

Build a tensor A_all[pidx, s, :, :] up-front (dense, 5 × 64 × 64 × 64 = 6 MiB, fine for one-shot build), then:

C_all  = np.einsum('ij,pskj,kl->pskl',       # C_loc for every (pidx, s)
                   EVECS_GRID, A_all, EVECS_GRID)
# zero the diagonal in-place
rng = np.arange(BOARD_DIM)
C_all[:, :, rng, rng] = 0.0
# upper triangle → fiber basis
ui = upper_idx
LOCAL_FIBER_3D = C_all[:, :, ui[0], ui[1]] @ FIBER_BASIS
LOCAL_ADJ_ROWS = A_all[:, np.arange(BOARD_DIM), np.arange(BOARD_DIM), :]  # rows A[s, s, :] — correct?

(The exact einsum form needs careful validation; the point is "one broadcast, not 5 × 64 Python iterations".)

Why better: Build-time cost is not a hot path for spectral encode, but this particular loop is cited by the module docstring as the main _compute_tables expense (~1.5 s). More important than speed: the einsum form is the same shape as the math description, so future edits to the fiber construction are one line instead of a 14-line nested loop.

F-15: Optional `libdeflate` backend for gzip ¶

File: CMakeLists.txt:17-23, src/cs_gzip.c Category: Build/tooling, C-throughput Parity impact: none (gzip format is byte-defined) Effort: L

Suggestion: Keep miniz as the default. Add an option:

option(CS_USE_LIBDEFLATE "Use libdeflate for gzip instead of vendored miniz" OFF)
if(CS_USE_LIBDEFLATE)
    find_package(libdeflate REQUIRED)
    target_link_libraries(spectral PRIVATE libdeflate::libdeflate_static)
    target_compile_definitions(spectral PRIVATE CS_USE_LIBDEFLATE=1)
    # cs_gzip.c switches on CS_USE_LIBDEFLATE
else()
    target_link_libraries(spectral PRIVATE miniz)
endif()

Then src/cs_gzip.c branches on #ifdef CS_USE_LIBDEFLATE between libdeflate_alloc_decompressor/libdeflate_gzip_decompress and the current miniz path.

Why better: libdeflate is ~3× faster than zlib (and ~2× faster than miniz, which is tuned to match zlib) for both compress and decompress on modern hardware. For the .spectralz read path — which decompresses the whole file to a tmpfile — this is a real win on large books. Keeping miniz as the default preserves the "single-file vendor, no system dependencies" property that makes the project trivially portable. Citations: libdeflate README; [zlib-ng benchmark issue

1486](https://github.com/zlib-ng/zlib-ng/issues/1486) (shows libdeflate and¶

zlib-ng in the same performance class, both well ahead of stock zlib/miniz).

Why P2 not P1: gzip is not the hot path. cs_gzip.c is called once per .spectralz file. The encode/csv hot loop inside the file does not touch gzip code. Recommend only if you observe real user friction on large games.

P3 — hygiene¶

F-16: Lazy-load tables via PEP 562 `getattr` ¶

File: python/chess_spectral/tables.py:334 Category: Python-throughput (import time) Parity impact: none Effort: M

Current: _TABLES, _SOURCE, _CACHE_PATH = _load_or_compute(verbose=_VERBOSE) runs at import.

Suggested (PEP 562):

# Module-level lazy loader. First attribute access triggers the compute.

_tables_cache: dict | None = None

def _get_tables():
    global _tables_cache, _SOURCE, _CACHE_PATH
    if _tables_cache is None:
        _tables_cache, _SOURCE, _CACHE_PATH = _load_or_compute(verbose=_VERBOSE)
    return _tables_cache

def __getattr__(name: str):
    if name in ('EVALS_GRID', 'EVECS_GRID', 'FIBER_BASIS',
                'LOCAL_FIBER_3D', 'LOCAL_ADJ_ROWS',
                'PAWN_ANTI_FIBER', 'DIAG_DEV'):
        return _get_tables()[name]
    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")

Why better: Right now any tool that import chess_spectral — a docs script, a type-checker, a test discovery pass — pays the ~1.5 s compute (or the cache-load I/O) whether or not it touches the tables. PEP 562 defers the cost until the first attribute read. Cost: a little bookkeeping; benefit: import chess_spectral goes from ~1.5 s to ~10 ms on first run, ~1 ms on warm Python.

F-17: `C - np.diag(np.diag(C))` instead of copy + `fill_diagonal` ¶

File: python/chess_spectral/tables.py:216, 240 Category: Python-throughput, idiom Parity impact: none (same exact bits: subtracting np.diag(np.diag(C)) produces the same matrix as zeroing the diagonal) Effort: S

Current:

C_off = C.copy(); np.fill_diagonal(C_off, 0)

Suggested:

C_off = C - np.diag(np.diag(C))

Why better: One expression, one temporary, no in-place mutation — reads as "the off-diagonal part of C", which is exactly the math. The copy-then-mutate pattern is C-style thinking in Python clothing; the subtraction form is numpy-native and lets the expression optimizer fuse the broadcast if it ever grows an optimizer. Build-time only, so pure hygiene.

F-18: Lookup table for FEN piece char parsing ¶

File: src/cs_fen.c Category: C-throughput Parity impact: none Effort: S

Current: A switch on piece character, mapping 'P'/'p'/…/'K'/'k' to (type, color).

Suggested:

/* Indexed by uppercased ASCII; -1 marks "not a piece". */
static const int8_t FEN_TYPE[128] = {
    [0 ... 127] = -1,
    ['P'] = 0, ['N'] = 1, ['B'] = 2, ['R'] = 3, ['Q'] = 4, ['K'] = 5,
};

static inline int fen_decode(char c, int *type_out, int *color_out) {
    unsigned uc = (unsigned)c;
    int is_lower = (c >= 'a' && c <= 'z');
    char key = is_lower ? (char)(c - 32) : c;
    int t = FEN_TYPE[(unsigned)key & 0x7Fu];
    if (t < 0) return -1;
    *type_out  = t;
    *color_out = is_lower ? -1 : 1;
    return 0;
    (void)uc;
}

Why better: FEN parse is called once per ply in encode; the switch compiles to a jump table that is cache-comparable to the array lookup, so the performance delta is small. The real win is readability: the piece table becomes self-documenting data, and adding a new piece type (fairy chess, etc.) is a single-line data change rather than a switch-statement edit. P3 because FEN parse is not a hot path — the board has ≤64 chars, called once per position.

F-19: Build codegen output as a single join+write ¶

File: codegen/emit_tables.py, codegen/emit_test_vectors.py Category: Build/tooling Parity impact: none (same bytes written) Effort: S

Current: f.write(line) per line inside nested loops.

Suggested:

lines = []
for … in …:
    lines.append(f"    {value},")
with open(out_path, "w", encoding="utf-8", newline="\n") as f:
    f.write("\n".join(lines) + "\n")

Why better: Codegen runs once at build. The win is almost entirely Windows-specific: f.write(line) on text-mode files does CRLF translation per call, and the Windows I/O path is notoriously per-call expensive. On Linux the delta is noise. Flag as hygiene because the codegen's not on anyone's critical path.

F-20: Minimal `.clang-tidy` + CI matrix ¶

Files: new .clang-tidy, new .github/workflows/ci.yml Category: Build/tooling Parity impact: none Effort: M

Suggested .clang-tidy (minimal):

Checks: >
  bugprone-*,
  performance-*,
  readability-implicit-bool-conversion,
  -bugprone-easily-swappable-parameters,
  -bugprone-narrowing-conversions
WarningsAsErrors: ''
HeaderFilterRegex: 'docs/chess-maths/chess-spectral/(include|src)/.*'

Suggested CI matrix:

# .github/workflows/ci.yml
strategy:
  matrix:
    os: [ubuntu-latest, macos-latest, windows-latest]
    preset: [release]
    include:
      - os: ubuntu-latest
        preset: asan
jobs:
  build-and-test:
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.12' }
      - run: pip install numpy scipy python-chess pytest
      - run: cmake --preset ${{ matrix.preset }}
      - run: cmake --build --preset ${{ matrix.preset }}
      - run: ctest --preset ${{ matrix.preset }} --output-on-failure
      - run: pytest python/tests/   # includes the C↔Python parity test

Why better: performance-* catches the kinds of things flagged in this audit (missing restrict, string→CString copies, etc.) before they land. A tri-OS matrix catches Windows tmpfile/encoding quirks (see src/cs_gzip.c:116 — tmpfile() on Windows uses %TEMP% and has race conditions). The ASan run catches everything valgrind used to catch, with roughly 3× overhead, in a single GH Actions job. No .clang-format because the maintainer has intentionally hand-spaced code.

F-21: Move `from typing import Iterator` to the top of the module ¶

File: python/chess_spectral/csv_export.py:219 Category: Python idiom Parity impact: none Effort: S

Current:

# line 219, bottom of file
from typing import Iterator  # noqa: E402

With a # type: ignore[name-defined] comment on the Iterator usage at line 207.

Suggested:

# top of file
from __future__ import annotations
from collections.abc import Iterator, Iterable

Why better: from __future__ import annotations (PEP 563 / PEP 649) makes all annotations strings, so forward references work without the late-import shim. collections.abc.Iterator has been the preferred source since Python 3.9; typing.Iterator is an alias kept for back-compat. The shim comment calls out that this is "so from .csv_export import iter_rows works" — but nothing in that statement requires a bottom-of-module import; it was a workaround for an earlier iteration. The modernization also drops the # type: ignore comment, making the mypy run cleaner.

Considered and rejected¶

These were considered and deliberately NOT recommended. Keeping the list so future readers don't re-open the same question.

Replace double with float in the core encode path. Breaks the 1e-10 parity tolerance. The Python reference uses float64; the C side uses double and then narrows to float32 only when writing frames. Narrowing earlier in the pipeline would change the round-half-to-even outcomes for the irrep projections and the energy-sum totals. Non-starter.
Replace miniz wholesale with libdeflate. miniz is vendored as a single source file with no system dependencies — this is a feature for a research tool that people run on fresh Windows boxes. Made it an optional backend in F-15 instead.
Represent cs_position_t as sparse/dense hybrid to skip pawn scans in symmetric channels. Benchmark says the current dense scan with a fiber_index() < 0 early-continue already takes the pawn-skip fast path; the branch predictor handles it well. A bespoke sparse representation would add maintenance burden with no measurable win.
Flatten cs_position_t from AoSoA to full SoA (three parallel arrays: square[32], type[32], color[32]). The current struct is already SoA-ish (three arrays of fixed size). Reorganizing further would force a rewrite of cs_fen.c and offer no locality win — a full 32-piece board is 32×3 bytes = 96 B, which fits in two cache lines either way.
Replace the tmpfile() round-trip in cs_gzip.c:116 with an in-memory inflate buffer sized from the gzip trailer's ISIZE. Tempting — ISIZE gives you the exact decompressed size up front. Rejected because (a) ISIZE is mod-2^32, so for files >4 GiB you'd need to stream anyway, and (b) the tmpfile approach gracefully handles multi-member gzip streams if that becomes a future need. Note the Windows fragility of tmpfile() (uses %TEMP%, requires delete permissions) as a known issue, but don't fix it preemptively.
Replace the NDJSON JSON field parsing (json_str_field / json_int_field in main.c:180-212) with a real JSON parser. These use strstr per field, which is quadratic in line length. Rejected because the producer is the project's own pgn_bridge.py, which emits well-known fields in a stable order; lines are ~200 chars. A real JSON parser (cJSON, jsmn) would add a dependency and not measurably speed up the pipeline for its actual inputs. Note in a code comment, not a refactor.
Parallelize encode across plies with OpenMP. Each ply is ~5 µs of work; the OpenMP fork-join overhead is ~5 µs. Amdahl says no — the fgets / cs_fen_parse / cs_file_write_frame serial bits dominate. Keep single-threaded.
Rewrite the encoder in Rust / Zig for the safety story. Out of scope; the maintainer's research workflow lives in the C+Python codegen loop, and Rust's FFI boundary would re-introduce the parity rebuild cost that codegen was designed to eliminate.

References¶

libdeflate README — 3× zlib throughput claims
zlib-ng benchmark issue #1486 — libdeflate / zlib-ng performance class comparison
CMake C++23 best practices, 2025 — generator expressions, presets
Studyplan: ASan with CMake — sanitizer preset patterns
Organizing CMake presets — dev-debug / release / asan layout
NumPy einsum docs — vectorized contractions
opt_einsum — summation-order considerations
C17 production guide, 2026 edition — restrict, alignas, modern C idioms
PEP 562 — module __getattr__ for lazy attributes

Closing note¶

Every finding preserves the byte-for-byte parity contract with one explicit caveat: F-06, F-12, and F-14 are tagged bytes-only / rebaseline-needed. If you act on any of those three, the parity test is the gate — not the compiler, not the benchmark.

The audit is intentionally conservative about Python-side vectorization (all three Python refactors carry parity caveats) and aggressive about C hot-path tweaks (all P0 findings are provably byte-identical). That asymmetry reflects where the bits actually live: the C encoder writes float32 to disk, and the Python reference reads those same bytes back — so Python-side FP accumulation order is visible in the comparison, while C-side restrict and loop fusion are not.

chess-spectral Optimization & Idiom Audit — April 2026¶

How to read this doc¶

Severity ladder¶

Effort¶

Findings index¶

P0 — do these first¶

F-01: Fold chi-dispatch and scale pass in D4 projection ¶

F-02: Add restrict qualifiers on hot-path function args ¶

F-03: Replace per-ply memcpy(prev_enc, …) with pointer swap ¶

P1 — idiomatic improvements, low risk¶

F-04: Replace malloc(64KB) in gzip path with stack buffers ¶

F-05: Rewrite cs_roll_640 as two memcpys ¶

F-06: CSV row assembly: one fwrite per row, not 14 fprintfs ¶

F-07: Add CMakePresets.json (dev-debug / release / asan) ¶

F-08: Enable IPO/LTO for Release builds ¶

F-09: Drop hand-rolled /O2/-O2 ¶

F-10: Use generator expressions instead of if(MSVC) ¶

F-11: Hoist cs_encoding_t enc outside the per-ply loop ¶

P2 — refactors with payoff, need care¶

F-12: Vectorize Python fiber channels via einsum ¶

F-13: np.fromfile bulk read instead of per-frame loop ¶

F-14: Collapse local-fiber table build into one einsum ¶

F-15: Optional libdeflate backend for gzip ¶

1486](https://github.com/zlib-ng/zlib-ng/issues/1486) (shows libdeflate and¶

P3 — hygiene¶

F-16: Lazy-load tables via PEP 562 __getattr__ ¶

F-17: C - np.diag(np.diag(C)) instead of copy + fill_diagonal ¶

F-18: Lookup table for FEN piece char parsing ¶

F-19: Build codegen output as a single join+write ¶

F-20: Minimal .clang-tidy + CI matrix ¶

F-21: Move from typing import Iterator to the top of the module ¶

Considered and rejected¶

References¶

Closing note¶

F-02: Add `restrict` qualifiers on hot-path function args ¶

F-03: Replace per-ply `memcpy(prev_enc, …)` with pointer swap ¶

F-04: Replace `malloc(64KB)` in gzip path with stack buffers ¶

F-05: Rewrite `cs_roll_640` as two `memcpy`s ¶

F-06: CSV row assembly: one `fwrite` per row, not 14 `fprintf`s ¶

F-07: Add `CMakePresets.json` (dev-debug / release / asan) ¶

F-09: Drop hand-rolled `/O2`/`-O2` ¶

F-10: Use generator expressions instead of `if(MSVC)` ¶

F-11: Hoist `cs_encoding_t enc` outside the per-ply loop ¶

F-13: `np.fromfile` bulk read instead of per-frame loop ¶

F-15: Optional `libdeflate` backend for gzip ¶

F-16: Lazy-load tables via PEP 562 `getattr` ¶

F-17: `C - np.diag(np.diag(C))` instead of copy + `fill_diagonal` ¶

F-20: Minimal `.clang-tidy` + CI matrix ¶

F-21: Move `from typing import Iterator` to the top of the module ¶