๐๏ธ Performance Tuning¶
Spector delivers sub-millisecond latency out of the box โ but there's always room to optimize for your specific workload. This page covers benchmarks, tuning strategies, and the science of finding the right recall/latency/memory trade-off.
๐ Benchmark Summary¶
All benchmarks measured on a 24-core x86 machine (Windows 11, Intel Core Ultra 9 285K), AVX2 256-bit, Java 25, ZGC, using clustered vectors (realistic distribution). Numbers represent actual measured results โ run
mvn -pl spector-bench exec:javato reproduce on your hardware.
Note
Methodology: Benchmarks use 200 measurement iterations with 50 warmup iterations per scenario. Vectors are generated with realistic cluster structure (50 clusters with Gaussian noise). Documents contain 200โ1500 words with paragraph structure. Recall is measured against brute-force ground truth. Your results may vary ยฑ20% depending on CPU model, OS scheduling, background load, and thermal throttling.
โก SIMD Kernel Latency¶
| Dimension | Cosine P50 | Cosine P99 | Dot Product P50 | Dot Product P99 |
|---|---|---|---|---|
| 32 | 500 ns | 1,500 ns | 200 ns | 400 ns |
| 128 | <100 ns | 100 ns | 100 ns | 1,300 ns |
| 384 | ~100 ns | 100 ns | ~100 ns | 100 ns |
| 768 | ~100 ns | 100 ns | ~100 ns | 100 ns |
Note
Values at 384+ are at System.nanoTime() resolution floor. JMH confirms millions of ops/sec.
๐ Search Latency (128-dim, top-10, clustered vectors)¶
| Scale | Keyword (BM25) | Vector (HNSW) | Hybrid (RRF) |
|---|---|---|---|
| 10K docs | 0.19 ms / 3.79 ms p99 | 0.05 ms / 0.10 ms p99 | 0.17 ms / 0.37 ms p99 |
| 50K docs | 0.42 ms / 0.68 ms p99 | 0.09 ms / 0.19 ms p99 | 0.50 ms / 0.81 ms p99 |
| 100K docs | 0.98 ms / 1.39 ms p99 | 0.13 ms / 0.26 ms p99 | 1.01 ms / 1.22 ms p99 |
๐ Search Throughput (queries/sec)¶
| Scale | Keyword | Vector | Hybrid |
|---|---|---|---|
| 10K | 5,194 | 18,824 | 5,828 |
| 50K | 2,406 | 10,980 | 1,988 |
| 100K | 1,019 | 7,556 | 994 |
๐ฅ Ingestion Throughput¶
| Dataset Size | Time | Rate | Memory |
|---|---|---|---|
| 10K | 2.5s | 3,931 docs/s | +19 MB |
| 50K | 15.1s | 3,308 docs/s | +93 MB |
| 100K | 38.2s | 2,618 docs/s | +187 MB |
๐งต Concurrency Scaling (50K docs, 384-dim, Hybrid Search)¶
| Threads | Throughput | Avg Latency | Scaling Factor |
|---|---|---|---|
| 1 | 3,739 ops/s | 0.26 ms | 1.0ร |
| 4 | 10,317 ops/s | 0.37 ms | 2.8ร |
| 8 | 11,812 ops/s | 0.58 ms | 3.2ร |
| 16 | 14,022 ops/s | 1.00 ms | 3.7ร |
Note
Concurrency scaling is measured with 384-dim vectors (production-realistic). 128-dim shows higher absolute throughput but the scaling factor is similar. Individual HNSW queries are sequential โ scaling comes from serving multiple queries concurrently.
๐งช Running Benchmarks¶
Full Benchmark Suite¶
Tip
Generates an HTML report at spector-bench/target/performance-report.html
Specific Benchmarks¶
# SIMD kernels only
mvn -pl spector-bench exec:java -Dexec.args="SimdKernelBenchmark"
# HNSW index operations
mvn -pl spector-bench exec:java -Dexec.args="HnswBenchmark"
# Concurrency scaling
mvn -pl spector-bench exec:java -Dexec.args="ConcurrencyBenchmark"
JSON Output for CI¶
๐ Baseline Regression Detection¶
# Generate baseline
mvn -pl spector-bench exec:java -Dexec.args="--baseline"
# Compare against baseline
mvn -pl spector-bench exec:java -Dexec.args="--compare"
๐๏ธ Tuning Strategies¶
๐ฏ Maximize Recall¶
Goal: recall@10 โฅ 95%
var config = SpectorConfig.DEFAULT
.withM(32) // More connections
.withEfConstruction(400) // Better graph quality
.withEfSearch(200); // Wider search beam
Trade-offs: 2ร memory, ~3ร build time, ~2ร query latency.
โก Minimize Latency¶
Goal: p99 < 0.5ms
Trade-offs: Lower recall (~80% recall@10), but sub-millisecond guaranteed.
๐ Maximize Throughput¶
Goal: Maximum queries/sec under concurrent load
var config = SpectorConfig.DEFAULT
.withM(16) // Balanced
.withEfSearch(50) // Not too high
.withGpu(true); // Batch processing
Key factors:
-
Virtual threads handle concurrency automatically
-
Keep
efSearchmoderate to reduce per-query work -
Enable GPU for batch workloads
-
Use IVF-PQ for large datasets (reduced memory = better cache behavior)
๐พ Minimize Memory¶
Goal: Fit large datasets in limited RAM
var config = SpectorConfig.DEFAULT
.withM(8) // Fewer connections
.withEfConstruction(100);
// Use IVF-PQ for 32ร vector compression
Memory per document (384-dim):
| Mode | Per Vector | 1M vectors |
|---|---|---|
| Float32 | ~1.8 KB | ~1.8 GB |
| INT8 | ~640 bytes | ~640 MB |
| IVF-PQ | ~288 bytes | ~288 MB |
๐ Parameter Tuning Guide¶
HNSW: efSearch vs Recall vs Latency¶
Note
Recall values below are measured with uniform random vectors (best case). Real embedding distributions with cluster structure may show lower recall at the same efSearch โ increase efSearch to 100โ200 for production workloads with real embeddings.
| efSearch | Recall@10 (random) | Recall@10 (clustered) | Avg Latency | Notes |
|---|---|---|---|---|
| 10 | ~70% | ~30-40% | 0.02 ms | Too low for most uses |
| 30 | ~85% | ~50-60% | 0.03 ms | Fast, moderate recall |
| 64 | ~90% | ~50-65% | 0.05 ms | Default |
| 100 | ~95% | ~70-80% | 0.10 ms | Good for production |
| 200 | ~98% | ~85-90% | 0.20 ms | High recall |
| 500 | ~99.5% | ~95%+ | 0.50 ms | Near-perfect |
IVF-PQ: nprobe vs Recall¶
| nprobe | Recall@10 | Relative Latency |
|---|---|---|
| 1 | ~40% | 1ร |
| 4 | ~70% | 4ร |
| 8 | ~85% | 8ร |
| 16 | ~92% | 16ร |
| 32 | ~97% | 32ร |
SpectorIndex (IVF-HNSW-SVASQ): nCentroids vs nProbe¶
SpectorIndex uses IVF partitioning with adaptive HNSW shards. The two key parameters are:
nCentroidsโ number of K-Means partitions (set at training time)nProbeโ number of partitions searched at query time (adjustable)
Rule of thumb: nCentroids โ โN (square root of dataset size).
Real embedding results (Qwen3-embedding, 4096-dim, 10K vectors):
| nCentroids | nProbe | % Data Searched | Avg Latency | QPS | Recall@10 |
|---|---|---|---|---|---|
| 128 | 4 | 3.1% | 0.46ms | 2,173 | 1.0000 |
| 128 | 8 | 6.3% | 0.73ms | 1,368 | 1.0000 |
| 128 | 16 | 12.5% | 1.26ms | 792 | 1.0000 |
| 64 | 4 | 6.3% | 0.62ms | 1,601 | 1.0000 |
| 64 | 8 | 12.5% | 1.17ms | 856 | 1.0000 |
| 32 | 4 | 12.5% | 1.17ms | 857 | 1.0000 |
Tip
With real embeddings (not random vectors), SpectorIndex achieves perfect recall at nProbe=4 because real embeddings form natural semantic clusters that K-Means captures effectively. Start with nProbe=4 and only increase if your recall target isn't met.
Note
For the complete, empirical sweeps across multiple partition configurations (\(C \in \{32, 64, 128, 256\}\)) and detailed HNSW shard promotion benchmarks, see the dedicated Large-Scale Benchmarks deep dive.
Ingestion throughput (SpectorIndex vs standalone HNSW):
| Dataset Size | SpectorIndex | Standalone HNSW | Speedup |
|---|---|---|---|
| 10K | 130K docs/s | 4,677 docs/s | 28ร |
| 50K | 140K docs/s | 2,483 docs/s | 56ร |
| 100K | 150K docs/s | 1,535 docs/s | 98ร |
| 500K | 246K docs/s | โ | โ |
| 1M | 128K docs/s | โ | โ |
๐ Scaling Strategies¶
โฌ๏ธ Vertical Scaling¶
-
Add CPU cores โ Concurrent throughput scaling (up to ~3.7ร at 16 threads measured)
-
Add RAM โ Support larger capacity without IVF-PQ compression
-
Add GPU โ 4ร brute-force search speedup at 100K+ vectors (data resident in VRAM)
โก๏ธ Horizontal Scaling (Distributed Mode)¶
-
Add nodes โ Linear throughput scaling per shard
-
Rule of thumb: 100Kโ500K docs per shard
-
See Distributed Mode for cluster setup
โ JVM Tuning¶
Recommended JVM arguments for production:
java \
--add-modules jdk.incubator.vector \
--enable-native-access=ALL-UNNAMED \
-XX:+UseZGC \
-XX:+ZGenerational \
-Xmx4g \
-Xms4g \
-jar spector-node.jar
| Argument | Purpose |
|---|---|
--add-modules jdk.incubator.vector |
Required for SIMD acceleration |
--enable-native-access=ALL-UNNAMED |
Required for Panama FFM (GPU, mmap) |
-XX:+UseZGC |
Low-pause GC (vectors are off-heap) |
-XX:+ZGenerational |
Generational ZGC for better throughput |
-Xmx4g -Xms4g |
Fixed heap avoids resize pauses |
Tip
Since all vectors live off-heap, GC pressure is minimal. The heap primarily holds the HNSW graph structure and BM25 inverted index.
๐ See Also¶
-
Configuration Guide โ All parameters with ranges
-
Core Concepts โ How algorithms affect performance
-
SpectorIndex Architecture โ IVF-HNSW-SVASQ design and tuning
-
Large-Scale Benchmarks โ Empirical sweeps for real embeddings and shard promotions
-
SVASQ Quantization โ How SVASQ compression works
-
GPU Acceleration โ GPU-specific performance
-
Distributed Mode โ Scaling across nodes