π SpectorQuant β SVASQ (Spector Vector-Aligned Scalar Quantization)¶
How Spector achieves INT8 precision rivaling INT12βINT16 using the Fast Walsh-Hadamard Transform. SVASQ is Spector's custom quantization technique that combines mathematical rotation with affine scalar quantization to minimize information loss. This page explains the theory, implementation, and why it outperforms standard scalar quantization.
π€ The Problem with Standard Scalar Quantization¶
Standard INT8 quantization maps each dimension independently:
This works well when all dimensions have similar variance. But real embeddings often have outlier dimensions β a few dimensions with much larger ranges than the rest:
Dim 0: range [-0.05, 0.05] β 255 bins across 0.10 range β precision: 0.0004
Dim 42: range [-3.50, 3.50] β 255 bins across 7.00 range β precision: 0.0275
Dimension 42 has 70Γ worse precision than dimension 0. Since distance computation sums all dimensions, these imprecise outlier dimensions dominate the quantization error β dragging down recall.
Note
This problem is particularly acute for transformer embeddings (BERT, GPT, etc.), which often have a few "dominant" dimensions with disproportionately large values.
π‘ The SVASQ Insight: Rotate First, Then Quantize¶
SVASQ solves the outlier problem with a two-step approach:
- Rotate the vector using a mathematical transform that spreads variance uniformly across all dimensions
- Quantize the rotated vector using standard INT8 β now every dimension has similar precision
The rotation doesn't change any distances (it's an orthogonal transform), but it dramatically improves quantization quality.
graph LR
V["Raw Vector\n(uneven variance)"] --> FWHT["π FWHT Rotation\n(spread variance)"]
FWHT --> SQ["π’ INT8 Quantization\n(uniform precision)"]
SQ --> Store["πΎ Store\n(4Γ compressed)"]
π¬ The Fast Walsh-Hadamard Transform (FWHT)¶
What It Does¶
The FWHT is an orthogonal transform (like the Fourier Transform, but using only +1 and -1 instead of complex exponentials). It multiplies each vector by a Hadamard matrix:
Where \(H_n\) is the Hadamard matrix of order \(n\) (a power of 2):
Why It Spreads Variance¶
Each output dimension of the Hadamard transform is a sum or difference of all input dimensions (with alternating signs). If one input dimension has a spike, the Hadamard transform distributes that spike's energy equally across all output dimensions.
Before FWHT: One outlier dimension (dim 42) has 100Γ the variance of others.
After FWHT: Every output dimension has roughly equal variance β the outlier's energy is smeared uniformly.
Why It's Fast¶
Unlike the FFT (which requires O(n log n) complex multiplications), the FWHT uses only additions and subtractions β no multiplications at all:
// In-place FWHT: O(n log n) additions, zero multiplications
for (int len = 1; len < n; len <<= 1) {
for (int i = 0; i < n; i += len << 1) {
for (int j = i; j < i + len; j++) {
float u = data[j];
float v = data[j + len];
data[j] = u + v; // butterfly add
data[j + len] = u - v; // butterfly subtract
}
}
}
On a modern CPU with SIMD, this processes 128-dim vectors in under 50 nanoseconds.
Key Properties¶
| Property | Value |
|---|---|
| Complexity | O(n log n) β only additions/subtractions |
| Invertible | Yes β FWHT(FWHT(x)) = nΒ·x |
| Orthogonal | Yes β preserves L2 distances: βHx - Hyβ = βx - yβ |
| Real-valued | Yes β no complex numbers (unlike FFT) |
| Dimension requirement | Power of 2 (pad if needed) |
Important
Distance preservation is the critical property. Because the Hadamard matrix is orthogonal, L2(FWHT(x), FWHT(y)) = L2(x, y). This means quantizing in the rotated space doesn't introduce any systematic bias β only the random quantization noise, which is now spread uniformly.
ποΈ SVASQ Pipeline¶
Ingestion (Encoding)¶
For each vector x:
graph LR
X["x (float32)"] --> Pad["Pad to\npower-of-2"]
Pad --> FWHT["FWHT\nrotation"]
FWHT --> Norm["Extract\nβxΜββ norm"]
FWHT --> Quant["INT8\nquantize"]
Norm --> Store["Store: [normββ | int8[D]]"]
Quant --> Store
- Pad the vector to the next power of 2 (e.g., 768 β 1024)
- Apply FWHT β the in-place butterfly transform
- Extract and store the L2 norm of the rotated vector (float32, 4 bytes)
- Calibrate per-dimension min/max from a representative sample
- Quantize each rotated dimension to INT8:
q[i] = round(255 Γ (xΜ[i] - min[i]) / scale[i]) - Store as
[4-byte norm | D bytes of INT8]
Search (Asymmetric Distance Computation)¶
graph LR
Q["query (float32)"] --> Pad2["Pad to\npower-of-2"]
Pad2 --> FWHT2["FWHT\nrotation"]
FWHT2 --> Prep["Pre-multiply:\nqΜ[i] = (qΜ[i] - min[i]) / scale[i]"]
Prep --> Scan["SIMD scan:\ndot(qΜ, int8_stored)"]
Scan --> Result["Approximate L2"]
The key optimization: query pushdown. Instead of dequantizing each stored vector, we transform the query into the quantized coordinate system:
Then the approximate L2 distance reduces to a simple dot product between the transformed float32 query and the stored INT8 codes β which SIMD can compute at billions of operations per second.
𧬠Residual SVASQ: The IVF Superpower¶
When SVASQ is used inside an IVF index (like SpectorIndex), vectors are quantized as residuals β the difference from their assigned centroid:
Why Residuals Matter¶
Residual vectors are much tighter than absolute vectors: - Absolute coordinates might span [-3.0, 3.0] β 255 INT8 bins cover a range of 6.0 - Residual coordinates span [-0.2, 0.2] β 255 INT8 bins cover a range of 0.4
That's a 15Γ improvement in quantization precision β the same 8-bit integer now represents a 15Γ smaller step size.
Tip
INT8 residual quantization gives the spatial precision of INT12βINT16 absolute quantization. This is why SpectorIndex achieves excellent recall despite using only 1 byte per dimension.
The FWHT Order¶
When combining FWHT with IVF residual quantization, the order of operations matters:
graph LR
X["x"] --> C["Find nearest\ncentroid c"]
C --> R["r = x - c\n(residual)"]
R --> FWHT3["FWHT(r)\n(rotate residual)"]
FWHT3 --> Q["INT8 quantize\n(rotated residual)"]
CRITICAL: Apply FWHT to the residual, not to the raw vector. Applying FWHT before centroid assignment would break the spatial clustering β the Hadamard transform scrambles the dimensions, making K-Means clusters meaningless.
π SVASQ vs Other Quantization¶
| Technique | Compression | Recall@10 | Speed | Notes |
|---|---|---|---|---|
| Float32 (baseline) | 1Γ | 100% | β‘ | Reference |
| Scalar INT8 | 4Γ | 95-99% | β‘β‘ | Simple, good baseline |
| SVASQ INT8 | ~4Γ | 97-99.5% | β‘β‘ | FWHT rotation removes outlier impact |
| SVASQ-4 (INT4) | 6-8Γ | 95-99% | β‘β‘ | Nibble-packed FWHT + 3Γ rescore recommended |
| Scalar INT4 | 8Γ | 85-95% | β‘β‘ | Aggressive, needs rescore |
| Product Quantization | 32Γ | 80-92% | β‘ | Complex, requires training |
SVASQ achieves the compression of standard INT8 with recall approaching float32 β because the FWHT rotation ensures every dimension contributes equally to the quantized distance.
π’ SVASQ-4: INT4 Nibble-Packed Quantization¶
SVASQ-4 extends the SVASQ pipeline to 4-bit quantization, achieving ~2Γ additional compression over SVASQ-8 (6β8Γ total vs float32).
Why It Works¶
The FWHT rotation that makes SVASQ-8 work is equally beneficial for INT4:
- After FWHT, all dimensions contribute equally β INT4 quantization noise is isotropic
- With IVF residuals, the tight range means INT4 on residuals β INT6βINT7 on absolute vectors
- 15 quantization levels (vs 255 for INT8) is sufficient for ranking with oversampling rescore
Memory Layout¶
Two 4-bit values are packed per byte using offset encoding (shifting [-7, 7] to [0, 14]):
| Dims | Float32 | SVASQ-8 | SVASQ-4 | SVASQ-4 Compression |
|---|---|---|---|---|
| 384 β 512 | 1,536 B | 516 B | 260 B | 5.9Γ |
| 768 β 1024 | 3,072 B | 1,028 B | 516 B | 6.0Γ |
| 4096 | 16,384 B | 4,100 B | 2,052 B | 8.0Γ |
Calibration¶
SVASQ-4 uses tighter clipping than SVASQ-8 (2.5Ο vs 3.0Ο) to optimize for 15 quantization levels:
SvasqParams params = SvasqCalibrator.calibrate4bit(corpus, dimensions, seed);
// params.bitWidth() == 4
// params.bytesPerVector() == 4 + paddedDim / 2
SIMD Kernel¶
The Svasq4SimdKernel extracts nibbles via shift+mask in each loop iteration, providing natural instruction-level parallelism:
// Load VL packed bytes = 2ΓVL dimensions
ByteVector packed = ByteVector.fromMemorySegment(B_SPECIES, segment, offset, nativeOrder);
// Extract high nibbles (even dims) and low nibbles (odd dims)
ByteVector hi = packed.lanewise(LSHR, 4).and(0x0F); // β [0, 14]
ByteVector lo = packed.and(0x0F); // β [0, 14]
// Widen to float32 and FMA with deinterleaved query arrays
accHi = ((FloatVector) hi.castShape(F_SPECIES, 0)).fma(qTildeHi[i], accHi);
accLo = ((FloatVector) lo.castShape(F_SPECIES, 0)).fma(qTildeLo[i], accLo);
The hi/lo split gives the CPU two independent FMA chains β one for even dimensions and one for odd β maximizing pipeline utilization.
Usage¶
Expected Recall¶
| Configuration | Recall@10 | Notes |
|---|---|---|
| SVASQ-4 (no rescore) | ~95β97% | Direct quantized distance only |
| SVASQ-4 (2Γ rescore) | ~96β98% | Moderate oversampling |
| SVASQ-4 (3Γ rescore) | ~97β99% | Recommended default |
| SVASQ-4 (5Γ rescore) | ~98β99% | Higher latency, diminishing returns |
| SVASQ-8 (no rescore) | ~97β99.5% | For comparison |
π» Implementation in Spector¶
SvasqCalibrator¶
Calibrates min/max statistics per dimension from a representative sample:
// SVASQ-8 calibration
SvasqParams params8 = SvasqCalibrator.calibrate(flatData, sampleSize, dimensions);
// SVASQ-4 calibration (tighter clipping for 15 levels)
SvasqParams params4 = SvasqCalibrator.calibrate4bit(flatData, sampleSize, dimensions);
SvasqStrategy / Svasq4Strategy¶
Encodes vectors and computes asymmetric distances:
// SVASQ-8
SvasqStrategy strategy = new SvasqStrategy(params, SimilarityFunction.EUCLIDEAN);
// SVASQ-4
Svasq4Strategy strategy4 = new Svasq4Strategy(params4, SimilarityFunction.EUCLIDEAN);
// Both implement QuantizationStrategy β same API
byte[] encoded = strategy.encode(residualVector);
float dist = strategy.computeDistance(segment, offset, qs);
SvasqSimdKernel / Svasq4SimdKernel¶
The Panama SIMD kernel that computes SVASQ distances directly from off-heap memory:
// SVASQ-8: Zero-copy INT8 codes from MemorySegment
float l2Dist = SvasqSimdKernel.computeL2(segment, offset, paddedDim, queryState);
// SVASQ-4: Zero-copy nibble-packed INT4 codes from MemorySegment
float l2Dist4 = Svasq4SimdKernel.computeL2(segment, offset, halfPaddedDim, queryState4);
π Mathematical Proof: Distance Preservation¶
For completeness, here's why FWHT preserves L2 distance.
The Hadamard matrix \(H_n\) (normalized by \(1/\sqrt{n}\)) is orthogonal: \(H^T H = I\).
For any two vectors \(x, y\):
Therefore: \(L2(Hx, Hy) = L2(x, y)\). QED.
The quantization error is now distributed uniformly across all dimensions (because FWHT spread the variance), so the expected quantization error is minimized β this is the optimality condition proven by Lyubarskii & Vershynin (2010) for random orthogonal transforms.
π See Also¶
- Large-Scale Benchmarks β Empirical sweeps for real embeddings and HNSW shard promotions.
- Roadmap β Future compression improvements (SVASQ-PQ, padding-aware storage, norm f16)
- Understanding Quantization β All quantization techniques compared
- SpectorIndex Architecture β How SVASQ fits into the IVF-HNSW index
- SVASQ Whitepaper β Academic treatment with proofs and benchmarks
- Quantization Comparison β How Spector compares to other engines' quantization