Bit-packed HDC ALU — first run¶
Status: Research finding (this branch).
Date: 2026-05-04
Code:
- research/bit_alu.py — bit-packed binary HDC primitives
- research/bit_alu_benchmark.py — perf comparison vs the existing complex128 reference encoder
- research/cycle_alignment_investigation.py — solar vs sidereal day framing for permute = sigma_day
- Public facade: antikythera_spectral.bit_alu
TL;DR¶
- Memory is a 125× win. A D=940 hypervector takes 120 B as bit-packed
uint64[15]vs 15 KB ascomplex128[940]. Same algebraic structure (cyclic-group representation of the Antikythera dials), 1/125 of the memory. bindis a 2-10× speedup, growing with D. XOR over a uint64 array is much faster than complex-multiply over a complex128 array.permuteandsimilarityare slower in this first-run implementation —permuteuses Python big-int rotate (good correctness, poor scaling at D=13440);similarityuses Pythonint.bit_count()in a loop. Both have known optimisation paths (numpy bit-pack/unpack for permute;np.bitwise_countfrom numpy ≥ 2.0 for similarity) that would push them to native speed; not done in this PR.- The "ALU only" semantics are met. Every primitive (bind, bundle, permute, hamming, similarity) reduces to integer bit-operations: XOR, popcount, shift. No floating-point, no complex arithmetic, no matrix multiplies. The memory + bind wins are immediate; the permute / similarity wins are an optimisation away.
- Cycle alignment of
sigma_day⇔ one Julian day = one mean solar day. Decisive: the existing encoder keyscycle_period_daysin Julian days; the bit-ALU'spermute(v, 1, D)advance therefore corresponds to one Julian / mean-solar day. Sidereal-day framing would shift the Callippic residue by 1 over a full 940-tick rotation (drift = 2.57 days). Solar/Julian is correct.
What we built¶
research/bit_alu.py — primitives¶
| Operation | Bit ALU | Reference (complex128) |
|---|---|---|
random_hv(D, rng) |
uniform Bernoulli over uint64[n_words(D)] |
unit-norm complex Gaussian over complex128[D] |
bind(a, b) |
a ^ b (XOR) |
a * b (complex multiply) |
bundle(stack) |
bitwise majority threshold | sum + L2-normalise |
permute(v, n, D) |
cyclic bit-rotation by n |
np.roll(v, n) |
hamming_distance(a, b) |
popcount of XOR | n/a |
similarity(a, b, D) |
1 − 2H/D (bipolar-equivalent in [-1, 1]) |
\|⟨a, b⟩\| |
Plus an encoder analogue:
encode_ant_bit(jd, D)— same algebraic structure as_encode_denseinencode_ant.py(sum of permuted bases per dial), butnp.roll → permute,+ → bundle_majority, complex Gaussian basis → binary Bernoulli basis.decode_dial_bit(state, D, dial_idx, modulus)— argmax of similarity over candidate residues.
research/cycle_alignment_investigation.py — solar vs sidereal¶
Resolution of the user's question: "should one rotation of the key be a complete terrestrial or solar day?"
- Mean solar day: 86 400 SI s. The Julian Date increment. The Antikythera's hand-crank tick.
- Sidereal day: 86 164.09 SI s. Earth's rotation in the inertial frame.
- Ratio: 1.002 738 sidereal/solar, drift 235.9 s/day, drift over 940 days = 2.57 mean-solar days.
Empirical run at D = 940, n_ticks = 940:
| Framing | Span | Callippic residue (actual / expected) | Metonic residue (actual / expected) |
|---|---|---|---|
| Solar / Julian | 940 JD | 127 / 127 ✓ | 31 / 31 ✓ |
| Sidereal | 937.43 JD | 126 / 127 ⨯ | 31 / 31 ✓ |
The Callippic dial drifts by 1 residue position over a full bit-rotation cycle under sidereal framing. Metonic happens to be 31 in both because its modulus is small enough that the 0.273 % drift doesn't cross a residue boundary.
Conclusion: the bit-ALU's permute(v, 1, D) ⇔ one Julian / mean-solar day. The encoder's cycle_period_days are documented in Julian days (see encode_ant.DialSpec.residue and astronomical_cycles.Cycle.mechanism_days); the Julian Date is by definition tied to the mean solar day. The mechanism's hand-crank turns once per mean-solar day. All three pieces (encoder spec, time-scale convention, mechanism semantics) align on solar/Julian.
Sidereal-day framing IS meaningful for sidereal-frame DIALS (zodiac longitude in the inertial sky, planetary sidereal periods), but those are already expressed in Julian days inside encode_ant.py; one would convert at decode time if a sidereal interpretation of the residue were wanted.
Benchmark results¶
python -m research.bit_alu_benchmark --reps 5000 --encode-reps 100
Hardware: contemporary x86-64, single-threaded, NumPy 2.4.4. Results in microseconds per operation.
Memory footprint¶
| D | complex128 HV | bit_alu HV | Ratio |
|---|---|---|---|
| 940 (Callippic) | 15 040 B | 120 B | 125× |
| 13 440 (packing) | 215 040 B | 1 680 B | 128× |
Operation timings¶
| op | D | reps | cx128 µs/op | bit µs/op | speedup |
|---|---|---|---|---|---|
bind |
940 | 5 000 | 5.314 | 2.239 | 2.37× |
bundle (M=8) |
940 | 1 000 | 86.513 | 96.905 | 0.89× |
permute (n=1) |
940 | 1 000 | 36.928 | 53.571 | 0.69× |
similarity |
940 | 5 000 | 5.534 | 16.558 | 0.33× |
encode_full |
940 | 100 | 497.681 | 578.756 | 0.86× |
bind |
13 440 | 5 000 | 40.257 | 4.068 | 9.90× |
bundle (M=8) |
13 440 | 1 000 | 925.073 | 389.967 | 2.37× |
permute (n=1) |
13 440 | 1 000 | 62.325 | 1 143.138 | 0.05× |
similarity |
13 440 | 5 000 | 78.314 | 87.771 | 0.89× |
encode_full |
13 440 | 100 | 2 134.221 | 14 985.298 | 0.14× |
Reading the table¶
bindscales with D, decisively. 2.37× at D=940 → 9.90× at D=13 440. XOR overuint64[n]is much cheaper than complex-multiply overcomplex128[64n]. This is the operation HDC algorithms call most often.bundleis competitive at small D and 2.37× faster at large D. The bit-ALU bundle usesnp.unpackbits+sum+ threshold — vectorisable and cache-friendly at scale. At D=940 the unpack-pack overhead competes with numpy's native sum + L2-normalise; at D=13 440 the bit-ALU pulls ahead.permuteis the weak point. This first-run implementation packs to a Python big-int, rotates within the D-bit window with<</>>, and unpacks. Big-int operations are ~O(D) but constant-factor expensive at D=13 440 (1.1 ms vs 0.06 ms fornp.roll). Numpy bit-pack / unpack withnp.rollwould close most of this gap; vectorised cross-word shifts on uint64 arrays would close the rest. Tracked as a follow-up; see "What's next" below.similarityat D=940 is the second weak point. Python'sint.bit_count()in a sum-loop pays a per-word overhead.np.bitwise_count(numpy ≥ 2.0) would vectorise this to native speed. Same follow-up aspermute.- The full encoder is currently slower because
permuteis hot in the encode path (one rotate per dial).
What an actual ALU implementation would look like¶
The Python-level numbers above are bounded by interpreter overhead. At the hardware level — which is the natural framing for "ALU only":
bind=XORinstruction on a packed register (1 cycle for the whole hypervector if D ≤ register width; constant cycles per cache line for larger D).hamming_distance=XOR+POPCNTinstructions (POPCNT is 3-5 cycle latency on modern x86; ~0.3 µs for D=940 vectors using AVX2 + popcount). NEON has equivalent.permute= combined shift + OR across cache lines. Hardware-friendly; an FPGA could do a 940-bit cyclic shift in O(log D) gates.bundle (majority)= column-wise popcount + threshold. AVX-512 has_mm512_popcnt_epi64; SVE2 hascnt. Bit-serial-style, but parallel across the D axis.
A hand-tuned C / NEON / AVX-512 implementation of the bit ALU would close the remaining gaps to native speed and likely turn the 0.05–0.89× rows into 5–50× wins. That's a v0.4.x scope; for v0.3.0 the "first run" is to establish that the algebraic substrate is sound (this PR) and measure the pure-Python baseline.
What's next (deferred to v0.4.x)¶
- Numpy-fast
permute— replace Python big-int rotate withnp.unpackbits+np.roll+np.packbits, or with vectorised cross-word<</>>onuint64[]. Estimated 10-50× speedup at D=13 440. np.bitwise_countforsimilarity— drop the Python-levelint.bit_count()loop. Estimated 5-20× speedup at all D.- Encoder fast-path — once
permuteis fast, the full bit-ALU encoder should beat the complex128 encoder by the bind ratio (2-10×) at all D. - Bit-ALU H-battery rows — does the bit-ALU encoder pass the same H-battery checks as the complex128 encoder (round-trip, σ_day unit-operator, etc.)? Almost certainly yes (same algebraic structure), but worth pinning.
- Hardware-style rotation operators — the Antikythera's gear ratios are themselves a kind of rotation. There's a research thread on whether the
bit_alu.permute(v, k)for non-trivialkcorresponds to gear-ratio-style multi-day advance, and whether that's a faithful representation of the mechanism's gear DAG. Out of scope for this PR.
Cross-references¶
- Notebook §6 (E-H2 / E-H3 / E-H4) — the H-battery rows the bit-ALU should also pass once exercised.
- Notebook §3 (cyclic-group algebra B-H1 / B-H2) — the algebraic substrate the bit-ALU implements verbatim, just in BSC (Binary Spatter Code) instead of FHRR.
- ADR 0012 — the "algebra/eigenbasis modelling, not CAD/fabrication" scope discipline. The bit-ALU is the cleanest possible implementation OF that discipline: every operation is bitwise / phase-space algebra.
Sources¶
- Kanerva (2009). "Hyperdimensional Computing: An Introduction to Computing in Distributed Representation with High-Dimensional Random Vectors." Cognitive Computation 1:139.
- Plate (2003). Holographic Reduced Representation: Distributed Representation for Cognitive Structures. CSLI Publications.
- Schlegel, Neubert & Protzel (2022). "A Comparison of Vector Symbolic Architectures." Artif. Intell. Rev. 55:4523. (Survey; this module implements the BSC variant.)
- IAU 2009 / SOFA constants — sidereal day length, mean solar day length.