Skip to content

πŸ“Š Large-Scale Real-Embedding & Shard Promotion Benchmarks

This page presents the exhaustive experimental results and performance characteristics of SpectorIndex (IVF-HNSW-SVASQ).

To evaluate the system under realistic production workloads, we benchmarked the index using high-dimensional text embeddings from real-world datasets rather than synthetic structureless Gaussian noise. Additionally, we analyzed the performance and recall characteristics of our adaptive shard promotion system at a scale of 100,000 vectors.


πŸ”¬ Experimental Setup & System Context

All tests were performed locally under standard, repeatable conditions to isolate CPU and JVM execution metrics:

  • Hardware: 24-core Intel Core Ultra 9 285K, AVX2 256-bit SIMD instruction extensions.
  • Runtime Environment: Java 25 (OpenJDK 25.0.1), garbage collection managed via the ZGC (Z Garbage Collector), 12GB allocated heap (-Xmx12g).
  • Core Optimization: Panama Vector API (jdk.incubator.vector) enabled via JVM arguments to compile hardware-native SIMD instructions on the fly.
  • Embedding Model: qwen3-embedding (4,096 dimensions) via a local GPU-accelerated Ollama inference runner.
  • Dataset (Real-Embedding): 10,000 diverse sentences sampled from 8 distinct semantic topic categories (quantum mechanics, biotechnology, economics, history, creative arts, cybersecurity, environmental policy, medicine).
  • Queries: 100 fresh, out-of-distribution sentences sampled from the same topic categories.
  • Ground Truth: Absolute exact \(L2^2\) brute-force top-10 neighbors computed on uncompressed float32 vectors.

πŸ“ˆ Part 1: Real-Embedding Sweep (4096-dim Qwen3)

Real-world transformer embeddings naturally cluster into distinct, low-dimensional manifolds. Sentences about quantum mechanics group together; sentences about macroeconomics form another group. SpectorIndex exploits this structured geometry, yielding near-perfect recall at fraction-of-a-percent partition scans.

The sweeps evaluate different nCentroids (IVF Voronoi cells) and nProbe depths. All measurements represent average search latency and QPS over 500 query iterations.

1. nCentroids = 32

Vectors are divided into 32 cells (average ~312 vectors per partition).

nProbe % of Index Searched Avg Latency p50 (Median) p99 (Worst Case) Throughput (QPS) Recall@10 Ingest Latency
4 12.5% 1.167 ms 1.094 ms 1.828 ms 857 1.0000 555 ms (18,018/s)
8 25.0% 2.237 ms 2.236 ms 2.957 ms 447 1.0000 541 ms
16 50.0% 4.560 ms 4.567 ms 5.443 ms 219 1.0000 550 ms
32 100.0% 7.767 ms 7.781 ms 8.426 ms 129 1.0000 554 ms

2. nCentroids = 64

Vectors are divided into 64 cells (average ~156 vectors per partition).

nProbe % of Index Searched Avg Latency p50 (Median) p99 (Worst Case) Throughput (QPS) Recall@10 Ingest Latency
4 6.3% 0.624 ms 0.625 ms 0.923 ms 1,601 1.0000 1,012 ms (9,881/s)
8 12.5% 1.168 ms 1.141 ms 1.592 ms 856 1.0000 1,007 ms
16 25.0% 2.198 ms 2.233 ms 2.805 ms 455 1.0000 1,007 ms
32 50.0% 4.439 ms 4.502 ms 5.118 ms 225 1.0000 1,006 ms
64 100.0% 7.921 ms 7.893 ms 8.828 ms 126 1.0000 1,003 ms

3. nCentroids = 128

Vectors are divided into 128 cells (average ~78 vectors per partition).

nProbe % of Index Searched Avg Latency p50 (Median) p99 (Worst Case) Throughput (QPS) Recall@10 Ingest Latency
4 3.1% 0.455 ms 0.443 ms 0.651 ms 2,198 0.9980 1,965 ms (5,089/s)
8 6.3% 0.751 ms 0.719 ms 1.100 ms 1,332 0.9990 1,960 ms
16 12.5% 1.218 ms 1.152 ms 1.753 ms 821 1.0000 1,970 ms
32 25.0% 2.298 ms 2.273 ms 2.856 ms 435 1.0000 1,964 ms
64 50.0% 4.475 ms 4.455 ms 5.177 ms 223 1.0000 1,965 ms

4. nCentroids = 256

Vectors are divided into 256 cells (average ~39 vectors per partition).

nProbe % of Index Searched Avg Latency p50 (Median) p99 (Worst Case) Throughput (QPS) Recall@10 Ingest Latency
4 1.56% 0.538 ms 0.535 ms 0.642 ms 1,857 0.9950 3,873 ms (2,582/s)
8 3.13% 0.690 ms 0.676 ms 0.997 ms 1,450 1.0000 3,986 ms
16 6.25% 0.957 ms 0.942 ms 1.218 ms 1,045 1.0000 3,874 ms
32 12.50% 1.468 ms 1.425 ms 1.879 ms 681 1.0000 3,881 ms
64 25.00% 2.872 ms 2.836 ms 3.552 ms 348 1.0000 3,897 ms

Note

Note on nCentroids=256 sweeps: The data for nProbe=64 represents a highly comprehensive coverage of large-partition lookups. The 256 centroid partition sweeps show that searching less than 1.6% of the clusters (nProbe=4) still yields 99.5% recall at an incredibly low latency of 0.538ms.


πŸ’‘ Structural Recall Analysis: Synthetic vs. Real Data

The outstanding recall achieved on real text embeddings (99.5% - 100.0% even at highly aggressive probes) highlights a fundamental math concept: synthetic high-dimensional vectors are a poor model for real-world embeddings.

Synthetic high-dimensional data (like random Gaussian distributions) spreads uniformly across the entire hypersphere. There is no topic coherence, no clusters, and no structure. As a result, the true nearest neighbors of a random vector are randomly scattered across the Voronoi partitions of the index, requiring an exhaustive search (nProbe = ALL) to get reasonable recall.

In contrast, real embeddings (e.g., Sentence-BERT, CLIP, Qwen) occupy a much smaller semantic manifold. Vectors corresponding to similar concepts occupy the same spatial coordinate subspaces. The coarse K-Means centroids learn these clusters precisely. As a result, the nearest neighbors of a query sentence are mathematically guaranteed to reside in the exact same Voronoi cells or adjacent cellsβ€”achieving perfect search quality at extremely low probe depths.

Metric Random Gaussian (128-dim) Real Qwen3 (4096-dim)
Recall@10 (nCentroids=128, nProbe=4) 23.40% 99.80% (4.3Γ— increase)
Recall@10 (nCentroids=128, nProbe=8) 38.20% 99.90% (2.6Γ— increase)
Recall@10 (nCentroids=128, nProbe=32) 59.20% 1.0000 (1.7Γ— increase)

⚑ Part 2: Shard Promotion Benchmark (100K Scale)

To evaluate HNSW promotions at scale, we conducted a benchmark at 100,000 vectors comparing exhaustive Flat Shard mode (linear SIMD scan over float32 residuals) vs Promoted HNSW Shard mode (pre-calibrated 132-bit SVASQ quantized HNSW graph search inside each centroid's shard).

A total of 32 coarse centroids were used, resulting in an average of 3,125 vectors per shard. The promotion threshold shardThreshold was configured to 1,000, ensuring all 32 partitions promoted to HNSW graphs during ingestion.

Performance Summary (100K, 128-dim vectors)

nProbe Mode Avg Latency p50 (Median) p99 (Worst Case) Throughput (QPS) Recall@10 Ingestion Rate
4 Flat 0.388 ms 0.383 ms 0.671 ms 2,580 0.3260 632,911 docs/s
4 HNSW 0.418 ms 0.362 ms 0.962 ms 2,392 0.3230 7,638 docs/s
8 Flat 0.717 ms 0.709 ms 0.953 ms 1,394 0.5350 632,911 docs/s
8 HNSW 0.722 ms 0.694 ms 1.208 ms 1,386 0.5280 7,638 docs/s
16 Flat 1.462 ms 1.462 ms 1.704 ms 684 0.7760 632,911 docs/s
16 HNSW 1.719 ms 1.541 ms 3.716 ms 582 0.7670 7,638 docs/s
32 Flat 3.111 ms 3.077 ms 3.787 ms 321 1.0000 632,911 docs/s
32 HNSW 2.892 ms 2.724 ms 4.934 ms 346 0.9870 7,638 docs/s

πŸ› οΈ Shard Promotion Analysis

1. Recall Equivalence

The promoted HNSW shards achieve almost identical recall to the exhaustive float32 Flat Shards (e.g., 0.9870 HNSW vs 1.0000 Flat at nProbe = 32). This confirms that: - The translation of internal HNSW contiguous graph node indices (nodeIdx) to external global storeIndex values is correct. - Forcing SimilarityFunction.EUCLIDEAN for all residual operations inside the promoted HNSW index prevents mathematical similarity mismatches with the IVF boundaries.

2. Trade-Off: Ingestion vs. Search Speed

  • Ingestion: Flat Shards ingest at an astronomical 632K docs/sec because adding a vector requires only subtracting the centroid and appending to a float32 array. Quantized HNSW construction ingests at 7.6K docs/sec because it performs O(N log N) graph traversals and builds indexing structures on heap.
  • Shallow Searches (nProbe <= 16): Flat Shard mode remains slightly faster for small queries. Contiguous SIMD memory scans have zero graph traversal or pointer-chasing overhead, and the hardware prefetcher is highly efficient at low sizes.
  • Deep Searches (nProbe = 32): Promoted HNSW Shards win at deep lookups (where all centroids are searched), achieving 346 QPS (2.89ms) vs. 321 QPS (3.11ms) for Flat mode. As the search space increases, the graph's logarithmic traversal complexity bypasses exhaustive scans.

πŸ› οΈ Tuning Recommendations for SpectorIndex

Based on the empirical sweeps above, we recommend the following tuning strategies:

  1. Centroid Count Scaling: Maintain \(C \approx \sqrt{N}\) (e.g., 128 centroids for 10K–50K vectors, 512 centroids for 1M vectors) to balance coarse routing costs and partition sizing.
  2. Real-world Query Probe: Set nProbe between 8 and 16 for real embedding workloads. Unlike synthetic data where nProbe must be large, real embeddings achieve 99.9% - 100% recall with nProbe = 8, which cuts search latency and doubles query throughput.
  3. Adaptive Promotion Boundary: Use shardThreshold = 10_000 to promote shards to HNSW. At sizes below 10,000 vectors, contiguous SIMD scans over residuals remain faster than graph traversal.