Tournament baseline at depth 4 — first empirical recording¶
Date: 2026-05-09
Run config: python tests/run_evaluator_tournament.py --depth 4 --games-per-pair 2 --max-plies 150 --output tests/bench_baselines/tournament_d4.json --quiet
Wall clock: 2 246 s (37 min)
Variants: material, spectral_float64, spectral_hybrid_8bit_lru
Closes the §16.7 deferred item — partially¶
The 1.16.0 ship added the tournament runner infrastructure (tests/run_evaluator_tournament.py); the §16.7 amendment de-gated B-spike-3 (search-engine evaluator) from the contaminated Othello prior. The open empirical question:
Does
spectral_hybrid_8bit_lruactually beatmaterialat deep search?
This baseline is the first real recorded answer. It's directional, not statistically locked.
Final ELO ranking¶
| Variant | Final ELO |
|---|---|
| material | 1530.6 |
| spectral_float64 | 1485.6 |
| spectral_hybrid_8bit_lru | 1483.8 |
Material wins by ~45-47 ELO over both spectral variants. The two spectral variants are statistically indistinguishable at this depth + game count.
Pair records¶
| Pair (white | black) | Wins | Losses | Draws |
|---|---|---|---|---|
| material vs spectral_float64 | 1 | 0 | 1 | |
| material vs spectral_hybrid_8bit_lru | 1 | 0 | 1 | |
| spectral_float64 vs spectral_hybrid_8bit_lru | 1 | 1 | 0 |
Material won every white-side game and drew the black-side game once. The two spectral variants split.
Termination histogram¶
| Game | Termination |
|---|---|
| material (W) vs spectral_float64 (B) | max_plies (drew) |
| spectral_float64 (W) vs material (B) | checkmate (material won as black) |
| material (W) vs spectral_hybrid_8bit_lru (B) | max_plies (drew) |
| spectral_hybrid_8bit_lru (W) vs material (B) | checkmate (material won as black) |
| spectral_float64 (W) vs spectral_hybrid_8bit_lru (B) | checkmate (sf64 won) |
| spectral_hybrid_8bit_lru (W) vs spectral_float64 (B) | checkmate (sh8 won) |
Mean plies per game: 61. Most games ended via real terminations (4 checkmates + 2 max_plies abandons); not pathological.
Honest interpretation¶
Statistical caveat: 2 games per pair is NOT statistically meaningful. At this game count, single coin-flips dominate the ELO numbers; a 47-point gap could easily flip with another 4 games. The standard rule of thumb is ≥ 100 games per pair for ±30 ELO confidence.
Directional finding: at depth 4, material (the trivial integer-addition baseline) outperforms both spectral evaluator variants by ~45 ELO each. The two spectral variants are within noise of each other.
Why might this be? Three hypotheses, ordered by plausibility:
-
Spectral evaluators are tuned for static eval, not search. The 1.13.0
spectral_hybrid_8bit_lruwas benched at ~50 µs/call vsspectral_float64's ~870 µs (~17× speedup). But static-eval throughput is not search-tree throughput. The depth-4 search-tree bench (bench_search_tree.py,tournament_d4.jsonfrom 1.16.0) showedspectral_hybrid_8bit_lruslightly slower thanspectral_float64in nodes/sec — TT re-visits are rare at low depth. This tournament confirms: at depth 4, the spectral evaluator family doesn't have an algorithmic edge over material because the search isn't deep enough to reward fine-grained position evaluation. -
Material is a strong baseline for chess. Pure integer piece-counting is a 99% solution for casual chess; deviations from it (positional understanding, structural awareness) only matter when search depth is too shallow to find tactics. At depth 4 + quiescence, tactics dominate; both sides find them; material counts decide.
-
Spectral evaluator weights are uniform. The
spectral.evaluateandspectral_hybridfunctions sum channel energies with default uniform weights (1.0 across all 10/11 channels). No tuning. Material, by contrast, uses canonical chess piece values (P=1, N=3, B=3, R=5, Q=9, K=∞). The "untuned vs canonical" gap is real and would be addressed by a tuning sweep — out of scope here.
What this baseline does NOT close¶
- Higher-depth question still open: at depth 5-7, TT re-visits become more common, and the spectral_hybrid cache-hit advantage might surface. This baseline tested only depth 4.
- Game count too small for statistical confidence: a 47-ELO gap at 2 games per pair means very little. Real lock-in needs 100+ games per pair (5+ hours wall-clock at depth 4; 1-3 days at depth 5-6).
- Untuned spectral weights: the spectral evaluator family ships with uniform channel weights. A tuning sweep (e.g., logistic regression on (channel_energies → win_probability) over a labeled corpus) would likely close the gap somewhat — but that's a separate ship.
Recommended next steps¶
- Run depth 5 with 4-8 games per pair when compute budget allows (2-5 hour wall-clock estimate). Expected to take ~24-30 hours sequential or parallelizable.
- Run depth 4 with ≥ 100 games per pair for a statistically locked ±30 ELO baseline. ~30+ hour sequential estimate; small enough to be a CI artifact.
- Investigate spectral channel weight tuning before running larger sweeps — uniform weights leave performance on the table.
Provenance¶
- This baseline was originally dispatched via subagent SA1 with target depth=5 / 2 games per pair. SA1 silently stalled at the 30-minute mark with an empty output file (the tournament process exited but the agent didn't write its summary). Parent agent took over directly: scaled back to depth=4 to get a finishable baseline within available time, ran the bench, wrote this summary.
- The SA1 stall is a documented workflow constraint: subagents dispatched with long-running background processes (>20 min) can lose track of their own state. Mitigation for future RBS-deferred dispatches: provide an explicit fallback budget in the subagent prompt and a watchdog timeout.