Skip to content

๐Ÿ›๏ธ SpectorIndex: IVF-HNSW-SVASQ Architecture

The flagship adaptive vector index of the Spector search engine. SpectorIndex combines Inverted File partitioning, Hierarchical Navigable Small World graphs, and SVASQ residual quantization into a single index that scales from 10K to millions of vectors with excellent recall, fast ingestion, and minimal memory.


๐ŸŽฏ Design Goals

SpectorIndex was designed to solve the fundamental limitations of standalone HNSW:

Problem with HNSW SpectorIndex Solution
Slow ingestion (O(n log n)) IVF partitioning + flat buffer โ†’ 100K+ docs/s
High memory (graph edges) SVASQ INT8 residuals โ†’ 4ร— compression
Doesn't scale past ~10M IVF coarse search โ†’ only probe relevant partitions
No compression SVASQ with FWHT โ†’ near-lossless INT8

๐Ÿ—๏ธ Three-Layer Architecture

SpectorIndex combines three orthogonal techniques:

graph TD
    subgraph "Layer 1: IVF โ€” Coarse Partitioning"
        Q["๐Ÿ” Query"] --> KM["K-Means\nFind nProbe\nclosest centroids"]
        KM --> S1["Shard 1"]
        KM --> S2["Shard 2"]
        KM --> S3["Shard N"]
    end

    subgraph "Layer 2: Adaptive Shards"
        S1 --> F1["< 20K vectors:\nExact Flat Scan\n(SIMD, zero GC)"]
        S2 --> H1["โ‰ฅ 20K vectors:\nLocal HNSW Graph\n(SVASQ-quantized)"]
        S3 --> F2["< 20K vectors:\nExact Flat Scan"]
    end

    subgraph "Layer 3: SVASQ Residual Quantization"
        H1 --> V1["Residual: r = x - centroid\nFWHT rotation โ†’ INT8\n4ร— compression"]
    end

    F1 --> M["Global Merge\n(L2 distance, top-K)"]
    H1 --> M
    F2 --> M
    M --> R["๐ŸŽฏ Results"]

Layer 1: IVF (Inverted File)

K-Means clustering partitions the vector space into nCentroids Voronoi cells. At query time, only the nProbe closest cells are searched, reducing the effective search space by nCentroids / nProbe.

Layer 2: Adaptive Shards

Each Voronoi cell contains a SpectorShard โ€” an adaptive data structure that operates in one of two modes:

  • Flat mode (size < shardThreshold): Stores float32 residuals in a contiguous buffer. Search is an exact SIMD scan โ€” faster than HNSW for small partitions because there's no pointer-chasing overhead.
  • HNSW mode (size โ‰ฅ shardThreshold): A local SVASQ-quantized HNSW graph. The flat buffer is consumed during promotion and released to free heap memory.

Layer 3: SVASQ Residual Quantization

Vectors are stored as residuals (r = x โˆ’ centroid), then compressed with SVASQ: 1. Apply FWHT (Fast Walsh-Hadamard Transform) to spread variance 2. Quantize to INT8 with calibrated min/max per dimension 3. Store: [4-byte L2 norm | D bytes of INT8 codes]


๐Ÿ”„ Lifecycle

Phase 1: Training

SpectorIndex index = SpectorIndex.builder()
    .dimensions(768)
    .nCentroids(256)
    .nProbe(16)
    .shardThreshold(20_000)
    .build();

index.train(representativeVectors);  // K-Means++ โ†’ learn centroids

Training runs K-Means++ on a representative sample to learn nCentroids centroids. This is a one-time operation (typically < 10 seconds for 50K training vectors).

Phase 2: Ingestion

index.add("doc-1", 0, vector);  // ~100K-250K docs/s

For each vector: 1. Find nearest centroid (KMeans.nearestCentroid) 2. Compute residual: r = vector - centroid 3. Store in the centroid's shard (flat buffer, no graph construction) 4. If shard crosses threshold โ†’ automatic promotion to HNSW

ScoredResult[] results = index.search(queryVector, 10);
  1. Find nProbe closest centroids to query
  2. For each probed centroid: compute residual query q_res = query - centroid
  3. Search each shard with its residual query
  4. Global merge using L2 distance (translation-invariant)
  5. Return top-K

This is the most important architectural decision in SpectorIndex, and getting it wrong destroys recall.

The Problem

When searching across multiple shards, each shard returns results with scores computed in its own translated coordinate system (centered at its centroid). If you use cosine similarity on residuals, scores from different shards are not comparable:

Shard A (centroid c_A):  cosine(q - c_A, x - c_A)  โ†’ angle in c_A's space
Shard B (centroid c_B):  cosine(q - c_B, y - c_B)  โ†’ angle in c_B's space

A score of 0.95 from shard A and 0.93 from shard B might not reflect their true relative similarity to the query.

The Solution

L2 distance is translation-invariant:

\[\|(q - c) - (x - c)\|^2 = \|q - x\|^2\]

The centroid cancels out! So L2 on residuals gives the exact same distance as L2 on original vectors, regardless of which centroid's shard the vector resides in. This makes cross-shard scores directly comparable.

Important

SpectorIndex always uses EUCLIDEAN distance internally for residual search and global merge, regardless of the user's configured similarityFunction. The user's metric is used only for centroid routing (where it operates in absolute space). This is the same approach used by FAISS's IndexIVFFlat.

Mathematical Proof

For any two vectors \(q, x\) and any centroid \(c\):

\[\|(q - c) - (x - c)\|^2 = \sum_i ((q_i - c_i) - (x_i - c_i))^2 = \sum_i (q_i - x_i)^2 = \|q - x\|^2\]

This identity holds exactly in floating-point arithmetic (the centroid terms cancel algebraically before any rounding).


๐ŸŽ๏ธ Performance Characteristics

Ingestion: 100Kโ€“250K docs/s

SpectorIndex's ingestion is 28-160ร— faster than standalone HNSW because: - No graph construction during add (flat buffer append) - Residual computation is O(D) โ€” just subtraction - Memory-mapped flat arrays with sequential writes - Graph construction is deferred until shard promotion

Search: Sub-millisecond at Optimal Config

Real embeddings (Qwen3-embedding, 4096-dim, 10K vectors):

nCentroids nProbe % Searched Latency QPS Recall@10
128 4 3.1% 0.46ms 2,173 1.0000
128 8 6.3% 0.73ms 1,368 1.0000
64 4 6.3% 0.62ms 1,601 1.0000
64 8 12.5% 1.17ms 856 1.0000
32 4 12.5% 1.17ms 857 1.0000

Tip

With real embeddings, even nProbe=4 at 128 centroids gives perfect recall while searching only 3.1% of the data. Real embeddings have natural cluster structure that IVF exploits beautifully.

Memory: 4ร— Compression with SVASQ

After shard promotion, SVASQ quantization compresses stored residuals to ~(D + 4) bytes per vector โ€” approximately 4ร— compression versus float32.


โš™๏ธ Configuration Guide

Centroid Count (nCentroids)

The number of IVF partitions. More centroids = finer partitioning = better recall at low nProbe, but slower training.

Dataset Size Recommended nCentroids
10Kโ€“50K 32โ€“64
50Kโ€“500K 64โ€“256
500Kโ€“5M 256โ€“1024
5Mโ€“50M 1024โ€“4096

Rule of thumb: nCentroids โ‰ˆ โˆšN (square root of dataset size).

Probe Count (nProbe)

How many centroids to search at query time. The primary recall/speed knob.

nProbe Recall Speed Use Case
4 ~30% โšกโšกโšก Filtering, not primary search
16 ~77% โšกโšก Fast approximate search
32 ~90% โšก Balanced
64+ ~95%+ ๐ŸŒ High-recall requirements

Tip

With real (structured) embeddings, recall at any given nProbe is significantly higher than with random data. Expect 90%+ recall at nProbe=16 with production embedding models.

Shard Threshold (shardThreshold)

When a shard's size reaches this threshold, it promotes from flat scan to HNSW.

  • Default: 20,000 โ€” optimal for most workloads
  • Lower values: earlier promotion, higher memory usage, potentially faster search in large shards
  • Higher values: longer flat scan period, lower memory, simpler data path

Oversampling Factor (oversamplingFactor)

After HNSW promotion, the number of candidates retrieved per shard is k ร— oversamplingFactor. Higher values improve recall at the cost of more candidates to merge.

  • Default: 3 โ€” retrieves 30 candidates per shard for top-10 queries
  • Increase to 5-10 if recall is insufficient

๐Ÿ”ฌ Adaptive Shard Promotion

The adaptive shard design is inspired by the observation from the original research:

"Scanning a flat, contiguous MemorySegment of SVASQ vectors using an unrolled 256-bit FMA loop utilizes aggressive CPU pre-fetchers. Panama can evaluate roughly 1,000 vectors in < 1 microsecond."

For small partitions (< 20K vectors), a flat SIMD scan over contiguous memory is 5-10ร— faster than HNSW pointer-chasing. Only when partitions grow large enough for the O(log n) advantage to kick in does HNSW become worthwhile.

graph LR
    Add["add(vector)"] --> Check{"size โ‰ฅ threshold?"}
    Check -->|No| Flat["Append to\nflat buffer"]
    Check -->|Yes| Promote["๐Ÿ”„ Promote"]
    Promote --> Cal["Calibrate SVASQ\nfrom flat buffer"]
    Cal --> Build["Build HNSW\n(bulk insert)"]
    Build --> Free["Free flat buffer\n(reclaim heap)"]

Thread Safety During Promotion

Promotion holds the write-lock exclusively. The sequence ensures correctness:

  1. In-flight flat scans complete before promotion runs (they hold read-locks)
  2. New searches arriving during promotion block on the read-lock
  3. After promotion, a volatile flag enables a lock-free fast path for all subsequent searches
  4. The volatile write establishes a happens-before edge, guaranteeing the HNSW index is visible to all threads

๐Ÿงฌ FWHT Order of Operations

When combining FWHT with IVF, the order matters:

Ingestion: 1. Find nearest centroid c (using original vector in absolute space) 2. Compute residual r = x - c 3. Apply FWHT to r (not to x โ€” FWHT before centroid assignment breaks clustering) 4. Quantize to INT8

Search: 1. Find nProbe closest centroids 2. For each centroid c: compute q_res = q - c 3. Apply FWHT to q_res 4. Pre-multiply scale/offset (SVASQ query pushdown) 5. Scan the shard


๐Ÿ”— See Also