Skip to content

πŸŒ€ SpectorQuant β€” SVASQ (Spector Vector-Aligned Scalar Quantization)

How Spector achieves INT8 precision rivaling INT12–INT16 using the Fast Walsh-Hadamard Transform. SVASQ is Spector's custom quantization technique that combines mathematical rotation with affine scalar quantization to minimize information loss. This page explains the theory, implementation, and why it outperforms standard scalar quantization.


πŸ€” The Problem with Standard Scalar Quantization

Standard INT8 quantization maps each dimension independently:

quantized[i] = round(255 Γ— (value[i] - min[i]) / (max[i] - min[i]))

This works well when all dimensions have similar variance. But real embeddings often have outlier dimensions β€” a few dimensions with much larger ranges than the rest:

Dim 0:  range [-0.05, 0.05]  β†’ 255 bins across 0.10 range β†’ precision: 0.0004
Dim 42: range [-3.50, 3.50]  β†’ 255 bins across 7.00 range β†’ precision: 0.0275

Dimension 42 has 70Γ— worse precision than dimension 0. Since distance computation sums all dimensions, these imprecise outlier dimensions dominate the quantization error β€” dragging down recall.

Note

This problem is particularly acute for transformer embeddings (BERT, GPT, etc.), which often have a few "dominant" dimensions with disproportionately large values.


πŸ’‘ The SVASQ Insight: Rotate First, Then Quantize

SVASQ solves the outlier problem with a two-step approach:

  1. Rotate the vector using a mathematical transform that spreads variance uniformly across all dimensions
  2. Quantize the rotated vector using standard INT8 β€” now every dimension has similar precision

The rotation doesn't change any distances (it's an orthogonal transform), but it dramatically improves quantization quality.

graph LR
    V["Raw Vector\n(uneven variance)"] --> FWHT["πŸŒ€ FWHT Rotation\n(spread variance)"]
    FWHT --> SQ["πŸ”’ INT8 Quantization\n(uniform precision)"]
    SQ --> Store["πŸ’Ύ Store\n(4Γ— compressed)"]

πŸ”¬ The Fast Walsh-Hadamard Transform (FWHT)

What It Does

The FWHT is an orthogonal transform (like the Fourier Transform, but using only +1 and -1 instead of complex exponentials). It multiplies each vector by a Hadamard matrix:

\[\hat{x} = H_n \cdot x\]

Where \(H_n\) is the Hadamard matrix of order \(n\) (a power of 2):

\[H_1 = [1], \quad H_2 = \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}, \quad H_4 = \begin{bmatrix} H_2 & H_2 \\ H_2 & -H_2 \end{bmatrix}\]

Why It Spreads Variance

Each output dimension of the Hadamard transform is a sum or difference of all input dimensions (with alternating signs). If one input dimension has a spike, the Hadamard transform distributes that spike's energy equally across all output dimensions.

Before FWHT: One outlier dimension (dim 42) has 100Γ— the variance of others.

After FWHT: Every output dimension has roughly equal variance β€” the outlier's energy is smeared uniformly.

Why It's Fast

Unlike the FFT (which requires O(n log n) complex multiplications), the FWHT uses only additions and subtractions β€” no multiplications at all:

// In-place FWHT: O(n log n) additions, zero multiplications
for (int len = 1; len < n; len <<= 1) {
    for (int i = 0; i < n; i += len << 1) {
        for (int j = i; j < i + len; j++) {
            float u = data[j];
            float v = data[j + len];
            data[j]       = u + v;  // butterfly add
            data[j + len] = u - v;  // butterfly subtract
        }
    }
}

On a modern CPU with SIMD, this processes 128-dim vectors in under 50 nanoseconds.

Key Properties

Property Value
Complexity O(n log n) β€” only additions/subtractions
Invertible Yes β€” FWHT(FWHT(x)) = nΒ·x
Orthogonal Yes β€” preserves L2 distances: β€–Hx - Hyβ€– = β€–x - yβ€–
Real-valued Yes β€” no complex numbers (unlike FFT)
Dimension requirement Power of 2 (pad if needed)

Important

Distance preservation is the critical property. Because the Hadamard matrix is orthogonal, L2(FWHT(x), FWHT(y)) = L2(x, y). This means quantizing in the rotated space doesn't introduce any systematic bias β€” only the random quantization noise, which is now spread uniformly.


πŸ—οΈ SVASQ Pipeline

Ingestion (Encoding)

For each vector x:

graph LR
    X["x (float32)"] --> Pad["Pad to\npower-of-2"]
    Pad --> FWHT["FWHT\nrotation"]
    FWHT --> Norm["Extract\nβ€–xΜ‚β€–β‚‚ norm"]
    FWHT --> Quant["INT8\nquantize"]
    Norm --> Store["Store: [norm₃₂ | int8[D]]"]
    Quant --> Store
  1. Pad the vector to the next power of 2 (e.g., 768 β†’ 1024)
  2. Apply FWHT β€” the in-place butterfly transform
  3. Extract and store the L2 norm of the rotated vector (float32, 4 bytes)
  4. Calibrate per-dimension min/max from a representative sample
  5. Quantize each rotated dimension to INT8: q[i] = round(255 Γ— (xΜ‚[i] - min[i]) / scale[i])
  6. Store as [4-byte norm | D bytes of INT8]

Search (Asymmetric Distance Computation)

graph LR
    Q["query (float32)"] --> Pad2["Pad to\npower-of-2"]
    Pad2 --> FWHT2["FWHT\nrotation"]
    FWHT2 --> Prep["Pre-multiply:\nq̃[i] = (q̂[i] - min[i]) / scale[i]"]
    Prep --> Scan["SIMD scan:\ndot(q̃, int8_stored)"]
    Scan --> Result["Approximate L2"]

The key optimization: query pushdown. Instead of dequantizing each stored vector, we transform the query into the quantized coordinate system:

q̃[i] = (q̂[i] - min[i]) / scale[i]

Then the approximate L2 distance reduces to a simple dot product between the transformed float32 query and the stored INT8 codes β€” which SIMD can compute at billions of operations per second.


🧬 Residual SVASQ: The IVF Superpower

When SVASQ is used inside an IVF index (like SpectorIndex), vectors are quantized as residuals β€” the difference from their assigned centroid:

\[r = x - c_{\text{nearest}}\]

Why Residuals Matter

Residual vectors are much tighter than absolute vectors: - Absolute coordinates might span [-3.0, 3.0] β†’ 255 INT8 bins cover a range of 6.0 - Residual coordinates span [-0.2, 0.2] β†’ 255 INT8 bins cover a range of 0.4

That's a 15Γ— improvement in quantization precision β€” the same 8-bit integer now represents a 15Γ— smaller step size.

Tip

INT8 residual quantization gives the spatial precision of INT12–INT16 absolute quantization. This is why SpectorIndex achieves excellent recall despite using only 1 byte per dimension.

The FWHT Order

When combining FWHT with IVF residual quantization, the order of operations matters:

graph LR
    X["x"] --> C["Find nearest\ncentroid c"]
    C --> R["r = x - c\n(residual)"]
    R --> FWHT3["FWHT(r)\n(rotate residual)"]
    FWHT3 --> Q["INT8 quantize\n(rotated residual)"]

CRITICAL: Apply FWHT to the residual, not to the raw vector. Applying FWHT before centroid assignment would break the spatial clustering β€” the Hadamard transform scrambles the dimensions, making K-Means clusters meaningless.


πŸ“Š SVASQ vs Other Quantization

Technique Compression Recall@10 Speed Notes
Float32 (baseline) 1Γ— 100% ⚑ Reference
Scalar INT8 4Γ— 95-99% ⚑⚑ Simple, good baseline
SVASQ INT8 ~4Γ— 97-99.5% ⚑⚑ FWHT rotation removes outlier impact
SVASQ-4 (INT4) 6-8Γ— 95-99% ⚑⚑ Nibble-packed FWHT + 3Γ— rescore recommended
Scalar INT4 8Γ— 85-95% ⚑⚑ Aggressive, needs rescore
Product Quantization 32Γ— 80-92% ⚑ Complex, requires training

SVASQ achieves the compression of standard INT8 with recall approaching float32 β€” because the FWHT rotation ensures every dimension contributes equally to the quantized distance.


πŸ”’ SVASQ-4: INT4 Nibble-Packed Quantization

SVASQ-4 extends the SVASQ pipeline to 4-bit quantization, achieving ~2Γ— additional compression over SVASQ-8 (6–8Γ— total vs float32).

Why It Works

The FWHT rotation that makes SVASQ-8 work is equally beneficial for INT4:

  • After FWHT, all dimensions contribute equally β†’ INT4 quantization noise is isotropic
  • With IVF residuals, the tight range means INT4 on residuals β‰ˆ INT6–INT7 on absolute vectors
  • 15 quantization levels (vs 255 for INT8) is sufficient for ranking with oversampling rescore

Memory Layout

[float32 normSq (4 bytes)] [INT4 Γ— paddedDim nibble-packed (paddedDim/2 bytes)]

Two 4-bit values are packed per byte using offset encoding (shifting [-7, 7] to [0, 14]):

byte = (hiNibble << 4) | loNibble
Dims Float32 SVASQ-8 SVASQ-4 SVASQ-4 Compression
384 β†’ 512 1,536 B 516 B 260 B 5.9Γ—
768 β†’ 1024 3,072 B 1,028 B 516 B 6.0Γ—
4096 16,384 B 4,100 B 2,052 B 8.0Γ—

Calibration

SVASQ-4 uses tighter clipping than SVASQ-8 (2.5Οƒ vs 3.0Οƒ) to optimize for 15 quantization levels:

SvasqParams params = SvasqCalibrator.calibrate4bit(corpus, dimensions, seed);
// params.bitWidth() == 4
// params.bytesPerVector() == 4 + paddedDim / 2

SIMD Kernel

The Svasq4SimdKernel extracts nibbles via shift+mask in each loop iteration, providing natural instruction-level parallelism:

// Load VL packed bytes = 2Γ—VL dimensions
ByteVector packed = ByteVector.fromMemorySegment(B_SPECIES, segment, offset, nativeOrder);

// Extract high nibbles (even dims) and low nibbles (odd dims)
ByteVector hi = packed.lanewise(LSHR, 4).and(0x0F);  // β†’ [0, 14]
ByteVector lo = packed.and(0x0F);                      // β†’ [0, 14]

// Widen to float32 and FMA with deinterleaved query arrays
accHi = ((FloatVector) hi.castShape(F_SPECIES, 0)).fma(qTildeHi[i], accHi);
accLo = ((FloatVector) lo.castShape(F_SPECIES, 0)).fma(qTildeLo[i], accLo);

The hi/lo split gives the CPU two independent FMA chains β€” one for even dimensions and one for odd β€” maximizing pipeline utilization.

Usage

SpectorEngine engine = SpectorEngine.builder()
    .dimensions(768)
    .capacity(500_000)
    .svasq4()              // SVASQ-4 with default 3Γ— rescore
    .build();
SpectorConfig config = SpectorConfig.DEFAULT
    .withDimensions(768)
    .withSvasq4(5);        // 5Γ— oversampling for higher recall
QuantizedHnswIndex index = QuantizedHnswIndex.svasq4(
    768, 100_000, SimilarityFunction.COSINE, HnswParams.DEFAULT, 3);

Expected Recall

Configuration Recall@10 Notes
SVASQ-4 (no rescore) ~95–97% Direct quantized distance only
SVASQ-4 (2Γ— rescore) ~96–98% Moderate oversampling
SVASQ-4 (3Γ— rescore) ~97–99% Recommended default
SVASQ-4 (5Γ— rescore) ~98–99% Higher latency, diminishing returns
SVASQ-8 (no rescore) ~97–99.5% For comparison

πŸ’» Implementation in Spector

SvasqCalibrator

Calibrates min/max statistics per dimension from a representative sample:

// SVASQ-8 calibration
SvasqParams params8 = SvasqCalibrator.calibrate(flatData, sampleSize, dimensions);

// SVASQ-4 calibration (tighter clipping for 15 levels)
SvasqParams params4 = SvasqCalibrator.calibrate4bit(flatData, sampleSize, dimensions);

SvasqStrategy / Svasq4Strategy

Encodes vectors and computes asymmetric distances:

// SVASQ-8
SvasqStrategy strategy = new SvasqStrategy(params, SimilarityFunction.EUCLIDEAN);

// SVASQ-4
Svasq4Strategy strategy4 = new Svasq4Strategy(params4, SimilarityFunction.EUCLIDEAN);

// Both implement QuantizationStrategy β€” same API
byte[] encoded = strategy.encode(residualVector);
float dist = strategy.computeDistance(segment, offset, qs);

SvasqSimdKernel / Svasq4SimdKernel

The Panama SIMD kernel that computes SVASQ distances directly from off-heap memory:

// SVASQ-8: Zero-copy INT8 codes from MemorySegment
float l2Dist = SvasqSimdKernel.computeL2(segment, offset, paddedDim, queryState);

// SVASQ-4: Zero-copy nibble-packed INT4 codes from MemorySegment
float l2Dist4 = Svasq4SimdKernel.computeL2(segment, offset, halfPaddedDim, queryState4);

πŸ“ Mathematical Proof: Distance Preservation

For completeness, here's why FWHT preserves L2 distance.

The Hadamard matrix \(H_n\) (normalized by \(1/\sqrt{n}\)) is orthogonal: \(H^T H = I\).

For any two vectors \(x, y\):

\[\|Hx - Hy\|^2 = (Hx - Hy)^T(Hx - Hy) = (x - y)^T H^T H (x - y) = (x - y)^T(x - y) = \|x - y\|^2\]

Therefore: \(L2(Hx, Hy) = L2(x, y)\). QED.

The quantization error is now distributed uniformly across all dimensions (because FWHT spread the variance), so the expected quantization error is minimized β€” this is the optimality condition proven by Lyubarskii & Vershynin (2010) for random orthogonal transforms.


πŸ”— See Also