PCA canonical-frame voxel HDC — best of both worlds (2026-05-12)¶

Origin: PR B of 3-PR sequence after PR #337/#338. The Zernike falsification (PR #338) left two questions open from PR #337's flagged followups: 1. PCA canonical-frame alignment of voxel HDC 2. SO(3) bispectrum / phase-aligned SH

This PR tests (1): pre-rotate protein coordinates into PCA principal-axis canonical frame, then apply voxel HDC. Should give both rotation invariance and high discrimination.

Verdict: WIN. PCA canonical-frame voxel HDC achieves rotation invariance (1.000 at machine precision) AND better cross-protein discrimination than raw voxel HDC (-0.58 vs -0.34). It's the best geometric fingerprint of the four tested.

Reproduce: python -X utf8 docs/srmech/notes/pca_canonical_voxel_hdc_script.py. Runtime ~5s.

Method¶

Compute weighted PCA of protein Cα coordinates (weights = |Fiedler eigenvector|)
Sort eigenvectors by descending variance → canonical axis assignment
Sign-disambiguate via 3^rd-moment (skewness) along each axis (forces positive skew)
Apply rotation matrix to coordinates → canonical frame
Voxelize Fiedler eigenvector at canonical orientation
Sign-only HDC encoding (per PR #333)

Fiedler-weighted PCA ensures the canonical axes reflect bipartition structure, not just overall mass distribution.

Results¶

Rotation invariance — THE LOAD-BEARING TEST¶

Rotation	PCA-canonical	Raw voxel HDC (PR #336)
30° z	1.0000	0.668
90° z	1.0000	0.138
180° z	1.0000	0.006
90° x	1.0000	0.293
90° y	1.0000	0.186

All rotations: 1.0000 at machine precision. PCA canonicalization fully recovers SO(3) invariance.

Translation invariance¶

Translation 10Å: 1.0000 ✓

Thermal noise tolerance¶

Noise σ	PCA-canonical	Raw voxel HDC (PR #336)
0.5 Å	0.958	0.973
2.0 Å	0.881	0.868
5.0 Å	0.571	0.716

Mildly degraded vs raw voxel HDC at high noise. Reason: noisy coordinates → noisy PCA axis estimates → slightly rotated canonical frame → different voxel indices. The canonicalization step is itself sensitive to coordinate noise.

This is the documented tradeoff: PCA canonical frame trades thermal-noise robustness for rotation-invariance + improved cross-protein discrimination.

Cross-protein discrimination — THE SURPRISE WIN¶

Fingerprint	1UBQ vs 1BPI
Raw voxel HDC (PR #333/#336)	−0.343
SH power spectrum (PR #337)	+0.929
3D Zernike (PR #338)	+0.99999
PCA-canonical voxel HDC (this PR)	−0.582

PCA canonicalization IMPROVES cross-protein discrimination vs raw voxel HDC (−0.58 vs −0.34). Why: PCA aligns proteins' principal axes to a canonical orientation, so what's compared is intrinsic shape, not orientation noise. Two different proteins look MORE different in canonical frame because we're seeing actual shape difference instead of arbitrary embedding-orientation artifact.

The updated geometric-fingerprint landscape¶

Fingerprint	Rotation invariance	Cross-protein discrim	Thermal tolerance	Catalog status
Raw voxel HDC	BROKEN (0.14 @ 90°)	−0.34	0.97 / 0.87 / 0.72	ACTIVE
SH per-shell power	invariant (1.000)	+0.93 (poor)	(similar)	ACTIVE
3D Zernike (n_max=10)	invariant (1.000)	+0.99999 (collapsed)	—	REJECTED
PCA-canonical voxel HDC	invariant (1.000)	−0.58 (BEST)	0.96 / 0.88 / 0.57	ACTIVE-PRIMARY

PCA canonical voxel HDC is now the primary geometric fingerprint — strictly dominates the others on the rotation × discrimination axes:

vs Raw voxel HDC: PCA-canonical wins on rotation; ties or wins on discrimination
vs SH per-shell: PCA-canonical wins on discrimination; ties on rotation
vs 3D Zernike: PCA-canonical wins on discrimination by 1.6 units; ties on rotation

Use case decision tree (refined): - If you need SO(3) invariance AND maximum discrimination → PCA-canonical voxel HDC - If you need both AND are thermal-noise-robust above all → SH per-shell (less affected by coord noise since it integrates over angular regions) - If orientation is fixed/known → raw voxel HDC (still highest information content)

Why this works — the load-bearing insight¶

The voxel grid breaks SO(3) by quantizing axes to its cubic O_h symmetry (PR #336 finding). PCA canonicalization rotates the data into the grid's symmetry frame — solving the embedding-substrate symmetry mismatch from the data side rather than the substrate side.

This is the right architectural move: when substrate has discrete symmetry (cubic lattice) and data has continuous symmetry (SO(3)), canonicalize the data to the substrate's preferred orientation. Voxel HDC then operates in its natural setting.

The sign disambiguation via skewness handles the residual 2³ = 8-fold ambiguity (each eigenvector × ±1) by picking the convention that the third moment is positive along each axis. Empirically robust for proteins (skewness usually isn't near zero); could fail on rotationally-symmetric structures (where the PCA covariance has degenerate eigenvalues).

What this confirms across the question-tree arc¶

This spike retrospectively validates:

PR #336: the cubic-lattice symmetry-breaking diagnosis was correct. Fix it by data-side canonicalization, not substrate-side.
PR #337: SH power spectrum was a working alternative but not optimal — it discards too much information.
PR #338: 3D Zernike polynomial compression is the wrong direction — it discards even more.
The geometric-tier of the three-tier hierarchy is now stable: voxel HDC at canonical frame is the primary primitive; SH per-shell is the secondary primitive for noise-robust applications.

Honest caveats¶

PCA stability under rotation: my PCA + skewness disambiguation is robust to the rotations tested (1.0 at all 5) but could fail under degenerate-covariance proteins (rotationally symmetric structures like trimeric viral capsids). Worth testing in future.
Sign disambiguation tie-breaking: skewness can be small for some bipartitions. Real production code would have a fallback (e.g., enforce positive first non-zero coordinate per axis).
Sensitivity to coordinate noise: thermal 5Å drops to 0.57 vs raw 0.72. For thermally-noisy applications, SH power is more robust.
Sign-only encoding is still being used. The PCA canonicalization is orthogonal to the encoding step.

Open follow-ups (now refined)¶

Test rotationally-symmetric proteins (trimers, viral capsids) for PCA stability under degenerate covariance
Stochastic-noise-robust canonicalization (median orientation across noisy realizations)
PCA on the voxel field directly (not coordinates) — equivalent in theory; may be more numerically robust
PR C of this 3-PR sequence: dynamic Laplacian revisit on SH substrate

Files¶

pca_canonical_voxel_hdc_script.py — reproducible (~5s)
pca-canonical-voxel-hdc-per-test-2026-05-12.ndjson — per-test results
pca-canonical-voxel-hdc-comparison-2026-05-12.png — four-fingerprint bar chart (Voxel HDC raw / SH power / 3D Zernike / PCA-canonical)
pca-canonical-voxel-hdc-2026-05-12.md — findings

Citations¶

Principal Component Analysis (Pearson 1901; Hotelling 1933) — standard
Skewness-based PCA sign disambiguation: occasionally used in computer vision / chemometrics
Fiedler-weighted PCA: applies the standard PCA to a weighted moment matrix