SVASQ + SpectorIndex: A Technical Whitepaper¶
Vectorized Affine Scalar Quantization with Adaptive IVF-HNSW Indexing for High-Performance Approximate Nearest Neighbor Search
Spector Engine β 2026
Abstract¶
We present SVASQ (Vectorized Affine Scalar Quantization), a novel vector compression technique that applies the Fast Walsh-Hadamard Transform (FWHT) to spread dimensional variance before INT8 affine quantization, achieving near-lossless recall with 4Γ compression. We integrate SVASQ into SpectorIndex, an adaptive hybrid index combining Inverted File (IVF) coarse partitioning with per-partition Hierarchical Navigable Small World (HNSW) graphs that automatically promote from exact flat scans to quantized graph search as partitions grow. The system achieves 100Kβ250K vector ingestions per second (28β160Γ faster than standalone HNSW), sub-millisecond search latency, and perfect recall at full probe depth, implemented entirely on the JVM using Java 21's Vector API (Project Panama) for SIMD acceleration and off-heap memory for zero-GC search paths.
1. Introduction¶
Vector similarity search is the computational backbone of modern AI applications β retrieval-augmented generation (RAG), semantic search, recommendation systems, and multimodal retrieval all depend on finding the K nearest neighbors of a query vector among millions or billions of stored embeddings.
The fundamental tension in ANN search is the recallβspeedβmemory triangle: - HNSW [1] achieves excellent recall (95β99%) with O(log n) search, but suffers from slow O(n log n) construction and high memory consumption (graph edges consume 50β100% of vector storage). - IVF [2] enables fast ingestion and cache-friendly search through spatial partitioning, but standalone flat IVF has limited recall at low probe depths. - Product Quantization [3] provides aggressive compression (32β96Γ) but requires expensive codebook training, complex lookup-table-based distance computation, and suffers from significant recall degradation.
SpectorIndex addresses all three limitations simultaneously by combining the strengths of IVF, HNSW, and a novel quantization approach (SVASQ) that achieves the simplicity and speed of scalar quantization with recall approaching float32 exact search.
2. SVASQ: Vectorized Affine Scalar Quantization¶
2.1 The Outlier Problem in Scalar Quantization¶
Standard INT8 scalar quantization maps each dimension independently using a linear affine transform:
The quantization error per dimension is bounded by \(\epsilon_i \leq \frac{\text{max}_i - \text{min}_i}{510}\), which is inversely proportional to the dynamic range. When a small number of dimensions have disproportionately large ranges (common in transformer embeddings [4]), the quantization error concentrates in those dimensions, degrading distance approximation quality.
2.2 Variance Equalization via FWHT¶
SVASQ resolves the outlier problem by applying an orthogonal rotation before quantization. We use the Fast Walsh-Hadamard Transform (FWHT), which multiplies the vector by the normalized Hadamard matrix \(H_n\):
Theorem 1 (Distance Preservation). For any vectors \(x, y \in \mathbb{R}^n\): \(\|H_n x - H_n y\| = \|x - y\|\).
Proof. \(H_n\) is orthogonal (\(H_n^T H_n = nI\)), so \(\|(Hx - Hy)\|^2 = (x-y)^T H^T H (x-y) = n\|x-y\|^2\). After normalization by \(1/\sqrt{n}\), the distance is preserved exactly. β‘
Theorem 2 (Variance Spreading). Let \(x\) be a random vector with covariance \(\Sigma\). The Hadamard transform \(\hat{x} = H_n x / \sqrt{n}\) has covariance \(\hat{\Sigma} = H_n \Sigma H_n^T / n\). If \(\Sigma\) has one dominant eigenvalue \(\lambda_1 \gg \lambda_i\) for \(i > 1\), the diagonal entries of \(\hat{\Sigma}\) are approximately equal: \(\hat{\Sigma}_{ii} \approx \text{tr}(\Sigma)/n\).
Intuition: Each output dimension of the Hadamard transform is a sum/difference of all input dimensions with alternating signs. A single outlier dimension's variance is distributed across all output dimensions.
2.3 SVASQ Encoding Pipeline¶
Given a vector \(x \in \mathbb{R}^D\):
- Pad to \(\hat{D}\) = next power of 2 (zero-fill)
- FWHT: \(\hat{x} = \text{FWHT}(x)\) β in-place O(D log D) using only additions/subtractions
- Extract norm: \(\|x\|_2\) stored as float32 (4 bytes)
- Calibrate: Per-dimension \(\text{min}_i, \text{scale}_i\) from a representative sample (one-time)
- Quantize: \(q_i = \text{clamp}(\text{round}(255 \cdot (\hat{x}_i - \text{min}_i) / \text{scale}_i), 0, 255)\)
- Store:
[norm_f32 | q_0, q_1, ..., q_{DΜ-1}]β total \(\hat{D} + 4\) bytes
Storage cost: For 768-dim vectors: \(\hat{D} = 1024\), total = 1028 bytes/vector (vs. 3072 bytes for float32) β 3.0Γ compression.
2.4 Asymmetric Distance Computation (ADC)¶
At query time, we avoid dequantizing stored vectors. Instead, we transform the query into the quantized coordinate system (query pushdown):
The approximate L2Β² distance reduces to:
This is a weighted dot-product between float32 query coefficients and INT8 stored codes, which the Java Vector API (Panama) computes using fused multiply-add SIMD instructions at ~1 billion operations per second.
3. SpectorIndex: Adaptive IVF-HNSW Architecture¶
3.1 Two-Level Partitioning¶
SpectorIndex organizes vectors in a two-level hierarchy:
Level 1 (IVF): K-Means++ produces \(C\) centroids from a training sample. Each vector is assigned to its nearest centroid and stored as a residual \(r = x - c_{\text{nearest}}\).
Level 2 (Adaptive Shards): Each centroid's partition is a SpectorShard operating in one of two modes:
| Mode | Condition | Search | Memory |
|---|---|---|---|
| Flat | size < \(T\) | Exact SIMD scan over float32 residuals | Float32 buffer |
| HNSW | size β₯ \(T\) | SVASQ-quantized graph traversal | SVASQ codes + graph edges |
Where \(T\) is the shardThreshold (default: 20,000).
3.2 Why Flat Scan Beats HNSW for Small Partitions¶
Modern SIMD hardware can scan contiguous memory at extraordinary speed. Using the Java Vector API with 256-bit lanes:
- Flat scan throughput: ~1,000 vectors per microsecond (sequential memory access, hardware prefetcher engaged)
- HNSW graph traversal: ~10β50 nodes per microsecond (random memory access, L2 cache misses at ~100ns each)
For partitions of \(N < 20{,}000\) vectors, the flat scan completes in \(N / 1000 \approx 20\mu s\) β faster than HNSW's \(O(\log N)\) graph hops with their cache miss penalties.
3.3 Automatic Shard Promotion¶
When a shard's flat buffer reaches shardThreshold, it automatically promotes to HNSW mode:
- Calibrate SVASQ from the flat buffer (in-place, single pass)
- Build HNSW graph with pre-calibrated SVASQ strategy (bulk insertion)
- Null flat buffer to reclaim heap memory
- Volatile publication β a
volatilewrite to thepromotedflag establishes a happens-before edge, guaranteeing the HNSW index is visible to all concurrent search threads
The promotion is performed under an exclusive write-lock. In-flight flat scans hold the read-lock and complete before promotion begins; new searches arriving during promotion block on the read-lock.
3.4 Translation-Invariant Cross-Shard Merge¶
This is the most critical correctness property of the architecture.
After searching \(k_{\text{probe}}\) shards, the results must be merged into a global top-K. Each shard returns scores computed on residuals β vectors translated to different coordinate origins (centroids). For the merge to be correct, scores from different shards must be comparable.
L2 distance is translation-invariant:
The centroid \(c\) cancels algebraically, so the residual L2 distance equals the original-space L2 distance regardless of which shard the vector resides in.
Cosine similarity is NOT translation-invariant: \(\cos(q - c_1, x - c_1) \neq \cos(q - c_2, y - c_2)\) in general. Using cosine for cross-shard merge produces incorrect rankings.
Design rule: SpectorIndex always uses EUCLIDEAN distance internally for residual search and global merge, regardless of the user's configured similarity function. This is consistent with FAISS's
IndexIVFFlatand the SPANN architecture [5].
3.5 ADC for Graph Construction¶
When promoting a shard, the HNSW graph must be wired correctly. For each new node, the algorithm finds its nearest existing neighbors. We use Asymmetric Distance Computation (ADC):
- Incoming vector: exact float32 residual (treated as a "query")
- Existing nodes: already SVASQ-quantized
The ADC distance between an exact float32 vector and a quantized vector is more accurate than the Symmetric Distance (SDC) between two quantized vectors, producing a higher-quality graph with better recall.
4. Implementation: Java 21 + Project Panama¶
4.1 SIMD Distance Kernels¶
All distance computations use the Java Vector API (jdk.incubator.vector):
FloatVector va = FloatVector.fromArray(SPECIES, a, offset);
FloatVector vb = FloatVector.fromArray(SPECIES, b, offset);
FloatVector diff = va.sub(vb);
sum = diff.fma(diff, sum); // fused multiply-add
The JIT compiler maps these to AVX2/AVX-512 instructions, achieving 8β16 float operations per clock cycle.
4.2 Off-Heap Memory¶
SVASQ-quantized vectors and HNSW graph edges are stored in Panama MemorySegment (off-heap), avoiding GC pressure during search. The SvasqSimdKernel reads INT8 codes directly from off-heap memory without any intermediate byte[] allocation.
4.3 Zero-GC Flat Scan¶
The flat scan uses array-based top-K tracking (parallel float[] scores and int[] indices) instead of PriorityQueue. No per-candidate object allocation occurs during the scan β only the final ScoredResult[] is allocated once per search.
4.4 Virtual Thread Compatibility¶
All locks use ReentrantReadWriteLock, which calls LockSupport.park() for blocking. This unmounts (not pins) virtual threads, making SpectorIndex safe for high-concurrency virtual thread workloads on Java 21+.
5. Experimental Results¶
5.1 L2 vs Cosine Residual Search Comparison¶
We validated that L2 residual search produces perfect recall when all centroids are probed, compared to the incorrect use of cosine similarity for cross-shard merge:
| Dataset | L2 Residual (nProbe=ALL) | Cosine Residual (nProbe=ALL) |
|---|---|---|
| 10K (32 centroids) | 1.000 | 0.741 |
| 50K (32 centroids) | 1.000 | 0.726 |
| 100K (32 centroids) | 1.000 | 0.714 |
The ~26% recall degradation with cosine similarity is caused by its lack of translation invariance β residual distances from different centroid origins are not directly comparable under cosine.
5.2 Ingestion Throughput¶
| Dataset Size | SpectorIndex | Standalone HNSW | Speedup |
|---|---|---|---|
| 10K | 130K docs/s | 4,677 docs/s | 28Γ |
| 50K | 140K docs/s | 2,483 docs/s | 56Γ |
| 100K | 150K docs/s | 1,535 docs/s | 98Γ |
| 500K | 246K docs/s | β | β |
| 1M | 128K docs/s | β | β |
5.3 Search Latency (128-dim random Gaussian vectors)¶
| nProbe | 10K avg | 100K avg | 1M avg |
|---|---|---|---|
| 4 | 0.07ms | 0.33ms | 0.92ms |
| 8 | 0.08ms | 0.70ms | 2.00ms |
| 16 | 0.14ms | 1.5ms | 3.76ms |
| 32 | 0.29ms | 3.2ms | 7.45ms |
| 64 | β | β | 15.0ms |
5.4 Real-Embedding Validation (Qwen3-embedding, 4096-dim)¶
Note
For the comprehensive, empirical sweeps across multiple coarse partition configurations (\(C \in \{32, 64, 128, 256\}\)) and deep analyses of HNSW shard promotions, refer to the dedicated Large-Scale Real-Embedding Benchmarks page.
To validate the architecture with structured data, we embedded 10,000 diverse sentences (8 topic categories) using Qwen3-embedding (4096 dimensions) via local Ollama inference.
Result: recall@10 = 1.0000 across ALL configurations tested.
| nCentroids | nProbe | % Data Searched | Avg Latency | QPS | Recall@10 |
|---|---|---|---|---|---|
| 128 | 4 | 3.1% | 0.46ms | 2,173 | 1.0000 |
| 128 | 8 | 6.3% | 0.73ms | 1,368 | 1.0000 |
| 128 | 16 | 12.5% | 1.26ms | 792 | 1.0000 |
| 64 | 4 | 6.3% | 0.62ms | 1,601 | 1.0000 |
| 64 | 8 | 12.5% | 1.17ms | 856 | 1.0000 |
| 32 | 4 | 12.5% | 1.17ms | 857 | 1.0000 |
Even at nProbe=4 with 128 centroids β searching only 3.1% of the data β recall is perfect. This confirms that real embeddings form tight semantic clusters that IVF captures effectively. The random Gaussian results (Section 5.3) represent the worst-case scenario for IVF, not the typical production workload.
Comparison: random vs. real embeddings at nProbe=4, nCentroids=128:
| Metric | Random Gaussian (128-dim) | Real Qwen3 (4096-dim) |
|---|---|---|
| Recall@10 | 0.234 | 1.000 |
| Latency | 1.05ms | 0.46ms |
The 4.3Γ recall improvement and 2.3Γ latency improvement demonstrate that SpectorIndex is designed for real workloads, where data structure is the norm.
6. Discussion¶
6.1 Random vs. Structured Data¶
Recall at practical nProbe values is lower with random Gaussian vectors than with real embeddings because random high-dimensional data has no natural cluster structure β true nearest neighbors are distributed uniformly across Voronoi cells. Real embedding models (BERT, Sentence-BERT, CLIP, etc.) produce vectors with strong topic-based clustering, where nearest neighbors tend to reside in the same or adjacent IVF cells.
6.2 Scaling Analysis¶
SpectorIndex's architecture suggests the following scaling behavior:
- Memory: O(D Γ N) with ~4Γ compression via SVASQ
- Ingestion: O(D Γ N) β dominated by residual computation and flat buffer appends
- Search: O(D Γ N/C Γ nProbe) β linear in partition size, controlled by nProbe
- Optimal centroid count: C β βN minimizes the search cost Γ recall product
6.3 Limitations¶
- Training required: K-Means training requires a representative sample. For streaming workloads, online centroid updates would be needed.
- Static partitioning: Once centroids are learned, vector distribution changes can cause partition imbalance. Periodic re-training addresses this.
- No native deletion: Removing vectors from HNSW shards is not implemented. A tombstone approach with periodic compaction is recommended.
7. Related Work¶
- FAISS IndexIVFFlat [2]: IVF with flat scan per partition. SpectorIndex adds adaptive HNSW promotion and SVASQ quantization.
- SPANN [5]: Space-Partitioned ANN by Microsoft. Similar IVF + local graph concept; SpectorIndex adds SVASQ and adaptive flat/HNSW shard modes.
- ScaNN [6]: Google's ANN library using anisotropic quantization. SVASQ achieves similar variance equalization via FWHT instead of learned rotations.
- DiskANN [7]: SSD-optimized graph index. SpectorIndex is RAM-optimized with off-heap Panama memory.
8. Conclusion¶
SVASQ + SpectorIndex demonstrates that combining three orthogonal techniques β IVF partitioning, adaptive HNSW graphs, and FWHT-rotated scalar quantization β produces a vector index with:
- Ingestion speed rivaling flat arrays (100K+ docs/s)
- Search recall approaching exact brute-force (with sufficient nProbe)
- Memory efficiency of 4Γ scalar quantization with near-lossless quality
- Implementation simplicity on the JVM without native code or GPU dependencies
The critical insight that L2 distance must be used for cross-shard merge (due to translation invariance) ensures correct global rankings β a property shared with all production IVF implementations.
References¶
-
Malkov, Y.A. and Yashunin, D.A. (2018). "Efficient and robust approximate nearest neighbor using Hierarchical Navigable Small World graphs." IEEE TPAMI, 42(4), 824-836.
-
JΓ©gou, H., Douze, M., and Schmid, C. (2011). "Product quantization for nearest neighbor search." IEEE TPAMI, 33(1), 117-128.
-
Johnson, J., Douze, M., and JΓ©gou, H. (2019). "Billion-scale similarity search with GPUs." IEEE TBD, 7(2), 535-547. (FAISS)
-
Kovaleva, O., et al. (2019). "Revealing the Dark Secrets of BERT." EMNLP 2019. (Outlier dimensions in transformers)
-
Chen, Q., et al. (2021). "SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search." NeurIPS 2021.
-
Guo, R., et al. (2020). "Accelerating Large-Scale Inference with Anisotropic Vector Quantization." ICML 2020. (ScaNN)
-
Subramanya, S.J., et al. (2019). "DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node." NeurIPS 2019.