๐๏ธ SpectorIndex: IVF-HNSW-SVASQ Architecture¶
The flagship adaptive vector index of the Spector search engine. SpectorIndex combines Inverted File partitioning, Hierarchical Navigable Small World graphs, and SVASQ residual quantization into a single index that scales from 10K to millions of vectors with excellent recall, fast ingestion, and minimal memory.
๐ฏ Design Goals¶
SpectorIndex was designed to solve the fundamental limitations of standalone HNSW:
| Problem with HNSW | SpectorIndex Solution |
|---|---|
| Slow ingestion (O(n log n)) | IVF partitioning + flat buffer โ 100K+ docs/s |
| High memory (graph edges) | SVASQ INT8 residuals โ 4ร compression |
| Doesn't scale past ~10M | IVF coarse search โ only probe relevant partitions |
| No compression | SVASQ with FWHT โ near-lossless INT8 |
๐๏ธ Three-Layer Architecture¶
SpectorIndex combines three orthogonal techniques:
graph TD
subgraph "Layer 1: IVF โ Coarse Partitioning"
Q["๐ Query"] --> KM["K-Means\nFind nProbe\nclosest centroids"]
KM --> S1["Shard 1"]
KM --> S2["Shard 2"]
KM --> S3["Shard N"]
end
subgraph "Layer 2: Adaptive Shards"
S1 --> F1["< 20K vectors:\nExact Flat Scan\n(SIMD, zero GC)"]
S2 --> H1["โฅ 20K vectors:\nLocal HNSW Graph\n(SVASQ-quantized)"]
S3 --> F2["< 20K vectors:\nExact Flat Scan"]
end
subgraph "Layer 3: SVASQ Residual Quantization"
H1 --> V1["Residual: r = x - centroid\nFWHT rotation โ INT8\n4ร compression"]
end
F1 --> M["Global Merge\n(L2 distance, top-K)"]
H1 --> M
F2 --> M
M --> R["๐ฏ Results"]
Layer 1: IVF (Inverted File)¶
K-Means clustering partitions the vector space into nCentroids Voronoi cells. At query time, only the nProbe closest cells are searched, reducing the effective search space by nCentroids / nProbe.
Layer 2: Adaptive Shards¶
Each Voronoi cell contains a SpectorShard โ an adaptive data structure that operates in one of two modes:
- Flat mode (size <
shardThreshold): Stores float32 residuals in a contiguous buffer. Search is an exact SIMD scan โ faster than HNSW for small partitions because there's no pointer-chasing overhead. - HNSW mode (size โฅ
shardThreshold): A local SVASQ-quantized HNSW graph. The flat buffer is consumed during promotion and released to free heap memory.
Layer 3: SVASQ Residual Quantization¶
Vectors are stored as residuals (r = x โ centroid), then compressed with SVASQ:
1. Apply FWHT (Fast Walsh-Hadamard Transform) to spread variance
2. Quantize to INT8 with calibrated min/max per dimension
3. Store: [4-byte L2 norm | D bytes of INT8 codes]
๐ Lifecycle¶
Phase 1: Training¶
SpectorIndex index = SpectorIndex.builder()
.dimensions(768)
.nCentroids(256)
.nProbe(16)
.shardThreshold(20_000)
.build();
index.train(representativeVectors); // K-Means++ โ learn centroids
Training runs K-Means++ on a representative sample to learn nCentroids centroids. This is a one-time operation (typically < 10 seconds for 50K training vectors).
Phase 2: Ingestion¶
For each vector:
1. Find nearest centroid (KMeans.nearestCentroid)
2. Compute residual: r = vector - centroid
3. Store in the centroid's shard (flat buffer, no graph construction)
4. If shard crosses threshold โ automatic promotion to HNSW
Phase 3: Search¶
- Find
nProbeclosest centroids to query - For each probed centroid: compute residual query
q_res = query - centroid - Search each shard with its residual query
- Global merge using L2 distance (translation-invariant)
- Return top-K
๐ Critical Design Decision: L2 for Residual Search¶
This is the most important architectural decision in SpectorIndex, and getting it wrong destroys recall.
The Problem¶
When searching across multiple shards, each shard returns results with scores computed in its own translated coordinate system (centered at its centroid). If you use cosine similarity on residuals, scores from different shards are not comparable:
Shard A (centroid c_A): cosine(q - c_A, x - c_A) โ angle in c_A's space
Shard B (centroid c_B): cosine(q - c_B, y - c_B) โ angle in c_B's space
A score of 0.95 from shard A and 0.93 from shard B might not reflect their true relative similarity to the query.
The Solution¶
L2 distance is translation-invariant:
The centroid cancels out! So L2 on residuals gives the exact same distance as L2 on original vectors, regardless of which centroid's shard the vector resides in. This makes cross-shard scores directly comparable.
Important
SpectorIndex always uses EUCLIDEAN distance internally for residual search and global merge, regardless of the user's configured similarityFunction. The user's metric is used only for centroid routing (where it operates in absolute space). This is the same approach used by FAISS's IndexIVFFlat.
Mathematical Proof¶
For any two vectors \(q, x\) and any centroid \(c\):
This identity holds exactly in floating-point arithmetic (the centroid terms cancel algebraically before any rounding).
๐๏ธ Performance Characteristics¶
Ingestion: 100Kโ250K docs/s¶
SpectorIndex's ingestion is 28-160ร faster than standalone HNSW because: - No graph construction during add (flat buffer append) - Residual computation is O(D) โ just subtraction - Memory-mapped flat arrays with sequential writes - Graph construction is deferred until shard promotion
Search: Sub-millisecond at Optimal Config¶
Real embeddings (Qwen3-embedding, 4096-dim, 10K vectors):
| nCentroids | nProbe | % Searched | Latency | QPS | Recall@10 |
|---|---|---|---|---|---|
| 128 | 4 | 3.1% | 0.46ms | 2,173 | 1.0000 |
| 128 | 8 | 6.3% | 0.73ms | 1,368 | 1.0000 |
| 64 | 4 | 6.3% | 0.62ms | 1,601 | 1.0000 |
| 64 | 8 | 12.5% | 1.17ms | 856 | 1.0000 |
| 32 | 4 | 12.5% | 1.17ms | 857 | 1.0000 |
Tip
With real embeddings, even nProbe=4 at 128 centroids gives perfect recall while searching only 3.1% of the data. Real embeddings have natural cluster structure that IVF exploits beautifully.
Memory: 4ร Compression with SVASQ¶
After shard promotion, SVASQ quantization compresses stored residuals to ~(D + 4) bytes per vector โ approximately 4ร compression versus float32.
โ๏ธ Configuration Guide¶
Centroid Count (nCentroids)¶
The number of IVF partitions. More centroids = finer partitioning = better recall at low nProbe, but slower training.
| Dataset Size | Recommended nCentroids |
|---|---|
| 10Kโ50K | 32โ64 |
| 50Kโ500K | 64โ256 |
| 500Kโ5M | 256โ1024 |
| 5Mโ50M | 1024โ4096 |
Rule of thumb: nCentroids โ โN (square root of dataset size).
Probe Count (nProbe)¶
How many centroids to search at query time. The primary recall/speed knob.
nProbe |
Recall | Speed | Use Case |
|---|---|---|---|
| 4 | ~30% | โกโกโก | Filtering, not primary search |
| 16 | ~77% | โกโก | Fast approximate search |
| 32 | ~90% | โก | Balanced |
| 64+ | ~95%+ | ๐ | High-recall requirements |
Tip
With real (structured) embeddings, recall at any given nProbe is significantly higher than with random data. Expect 90%+ recall at nProbe=16 with production embedding models.
Shard Threshold (shardThreshold)¶
When a shard's size reaches this threshold, it promotes from flat scan to HNSW.
- Default: 20,000 โ optimal for most workloads
- Lower values: earlier promotion, higher memory usage, potentially faster search in large shards
- Higher values: longer flat scan period, lower memory, simpler data path
Oversampling Factor (oversamplingFactor)¶
After HNSW promotion, the number of candidates retrieved per shard is k ร oversamplingFactor. Higher values improve recall at the cost of more candidates to merge.
- Default: 3 โ retrieves 30 candidates per shard for top-10 queries
- Increase to 5-10 if recall is insufficient
๐ฌ Adaptive Shard Promotion¶
The adaptive shard design is inspired by the observation from the original research:
"Scanning a flat, contiguous MemorySegment of SVASQ vectors using an unrolled 256-bit FMA loop utilizes aggressive CPU pre-fetchers. Panama can evaluate roughly 1,000 vectors in < 1 microsecond."
For small partitions (< 20K vectors), a flat SIMD scan over contiguous memory is 5-10ร faster than HNSW pointer-chasing. Only when partitions grow large enough for the O(log n) advantage to kick in does HNSW become worthwhile.
graph LR
Add["add(vector)"] --> Check{"size โฅ threshold?"}
Check -->|No| Flat["Append to\nflat buffer"]
Check -->|Yes| Promote["๐ Promote"]
Promote --> Cal["Calibrate SVASQ\nfrom flat buffer"]
Cal --> Build["Build HNSW\n(bulk insert)"]
Build --> Free["Free flat buffer\n(reclaim heap)"]
Thread Safety During Promotion¶
Promotion holds the write-lock exclusively. The sequence ensures correctness:
- In-flight flat scans complete before promotion runs (they hold read-locks)
- New searches arriving during promotion block on the read-lock
- After promotion, a
volatileflag enables a lock-free fast path for all subsequent searches - The
volatilewrite establishes a happens-before edge, guaranteeing the HNSW index is visible to all threads
๐งฌ FWHT Order of Operations¶
When combining FWHT with IVF, the order matters:
Ingestion:
1. Find nearest centroid c (using original vector in absolute space)
2. Compute residual r = x - c
3. Apply FWHT to r (not to x โ FWHT before centroid assignment breaks clustering)
4. Quantize to INT8
Search:
1. Find nProbe closest centroids
2. For each centroid c: compute q_res = q - c
3. Apply FWHT to q_res
4. Pre-multiply scale/offset (SVASQ query pushdown)
5. Scan the shard
๐ See Also¶
- Large-Scale Benchmarks โ Empirical sweeps for real embeddings and HNSW shard promotions.
- SVASQ Deep Dive โ How SVASQ quantization works in detail
- HNSW Explained โ How the graph search algorithm works
- ANN Search Primer โ Overview of all ANN algorithm families
- SVASQ + SpectorIndex Whitepaper โ Academic treatment
- Performance Tuning โ Practical tuning advice