Skip to content

๐ŸŽ๏ธ Performance Tuning

Spector delivers sub-millisecond latency out of the box โ€” but there's always room to optimize for your specific workload. This page covers benchmarks, tuning strategies, and the science of finding the right recall/latency/memory trade-off.


๐Ÿ“Š Benchmark Summary

All benchmarks measured on a 24-core x86 machine (Windows 11, Intel Core Ultra 9 285K), AVX2 256-bit, Java 25, ZGC, using clustered vectors (realistic distribution). Numbers represent actual measured results โ€” run mvn -pl spector-bench exec:java to reproduce on your hardware.

Note

Methodology: Benchmarks use 200 measurement iterations with 50 warmup iterations per scenario. Vectors are generated with realistic cluster structure (50 clusters with Gaussian noise). Documents contain 200โ€“1500 words with paragraph structure. Recall is measured against brute-force ground truth. Your results may vary ยฑ20% depending on CPU model, OS scheduling, background load, and thermal throttling.

โšก SIMD Kernel Latency

Dimension Cosine P50 Cosine P99 Dot Product P50 Dot Product P99
32 500 ns 1,500 ns 200 ns 400 ns
128 <100 ns 100 ns 100 ns 1,300 ns
384 ~100 ns 100 ns ~100 ns 100 ns
768 ~100 ns 100 ns ~100 ns 100 ns

Note

Values at 384+ are at System.nanoTime() resolution floor. JMH confirms millions of ops/sec.

๐Ÿ” Search Latency (128-dim, top-10, clustered vectors)

Scale Keyword (BM25) Vector (HNSW) Hybrid (RRF)
10K docs 0.19 ms / 3.79 ms p99 0.05 ms / 0.10 ms p99 0.17 ms / 0.37 ms p99
50K docs 0.42 ms / 0.68 ms p99 0.09 ms / 0.19 ms p99 0.50 ms / 0.81 ms p99
100K docs 0.98 ms / 1.39 ms p99 0.13 ms / 0.26 ms p99 1.01 ms / 1.22 ms p99

๐Ÿš€ Search Throughput (queries/sec)

Scale Keyword Vector Hybrid
10K 5,194 18,824 5,828
50K 2,406 10,980 1,988
100K 1,019 7,556 994

๐Ÿ“ฅ Ingestion Throughput

Dataset Size Time Rate Memory
10K 2.5s 3,931 docs/s +19 MB
50K 15.1s 3,308 docs/s +93 MB
100K 38.2s 2,618 docs/s +187 MB
Threads Throughput Avg Latency Scaling Factor
1 3,739 ops/s 0.26 ms 1.0ร—
4 10,317 ops/s 0.37 ms 2.8ร—
8 11,812 ops/s 0.58 ms 3.2ร—
16 14,022 ops/s 1.00 ms 3.7ร—

Note

Concurrency scaling is measured with 384-dim vectors (production-realistic). 128-dim shows higher absolute throughput but the scaling factor is similar. Individual HNSW queries are sequential โ€” scaling comes from serving multiple queries concurrently.


๐Ÿงช Running Benchmarks

Full Benchmark Suite

mvn -pl spector-bench exec:java

Tip

Generates an HTML report at spector-bench/target/performance-report.html

Specific Benchmarks

# SIMD kernels only
mvn -pl spector-bench exec:java -Dexec.args="SimdKernelBenchmark"

# HNSW index operations
mvn -pl spector-bench exec:java -Dexec.args="HnswBenchmark"

# Concurrency scaling
mvn -pl spector-bench exec:java -Dexec.args="ConcurrencyBenchmark"

JSON Output for CI

mvn -pl spector-bench exec:java -Dexec.args="-rf json -rff results.json"

๐Ÿ“ Baseline Regression Detection

# Generate baseline
mvn -pl spector-bench exec:java -Dexec.args="--baseline"

# Compare against baseline
mvn -pl spector-bench exec:java -Dexec.args="--compare"

๐ŸŽ›๏ธ Tuning Strategies

๐ŸŽฏ Maximize Recall

Goal: recall@10 โ‰ฅ 95%

var config = SpectorConfig.DEFAULT
    .withM(32)                  // More connections
    .withEfConstruction(400)    // Better graph quality
    .withEfSearch(200);         // Wider search beam

Trade-offs: 2ร— memory, ~3ร— build time, ~2ร— query latency.


โšก Minimize Latency

Goal: p99 < 0.5ms

var config = SpectorConfig.DEFAULT
    .withM(12)
    .withEfConstruction(100)
    .withEfSearch(30);

Trade-offs: Lower recall (~80% recall@10), but sub-millisecond guaranteed.


๐Ÿš€ Maximize Throughput

Goal: Maximum queries/sec under concurrent load

var config = SpectorConfig.DEFAULT
    .withM(16)               // Balanced
    .withEfSearch(50)        // Not too high
    .withGpu(true);          // Batch processing

Key factors:

  • Virtual threads handle concurrency automatically

  • Keep efSearch moderate to reduce per-query work

  • Enable GPU for batch workloads

  • Use IVF-PQ for large datasets (reduced memory = better cache behavior)


๐Ÿ’พ Minimize Memory

Goal: Fit large datasets in limited RAM

var config = SpectorConfig.DEFAULT
    .withM(8)                // Fewer connections
    .withEfConstruction(100);
// Use IVF-PQ for 32ร— vector compression

Memory per document (384-dim):

Mode Per Vector 1M vectors
Float32 ~1.8 KB ~1.8 GB
INT8 ~640 bytes ~640 MB
IVF-PQ ~288 bytes ~288 MB

๐Ÿ“ˆ Parameter Tuning Guide

HNSW: efSearch vs Recall vs Latency

Note

Recall values below are measured with uniform random vectors (best case). Real embedding distributions with cluster structure may show lower recall at the same efSearch โ€” increase efSearch to 100โ€“200 for production workloads with real embeddings.

efSearch Recall@10 (random) Recall@10 (clustered) Avg Latency Notes
10 ~70% ~30-40% 0.02 ms Too low for most uses
30 ~85% ~50-60% 0.03 ms Fast, moderate recall
64 ~90% ~50-65% 0.05 ms Default
100 ~95% ~70-80% 0.10 ms Good for production
200 ~98% ~85-90% 0.20 ms High recall
500 ~99.5% ~95%+ 0.50 ms Near-perfect

IVF-PQ: nprobe vs Recall

nprobe Recall@10 Relative Latency
1 ~40% 1ร—
4 ~70% 4ร—
8 ~85% 8ร—
16 ~92% 16ร—
32 ~97% 32ร—

SpectorIndex (IVF-HNSW-SVASQ): nCentroids vs nProbe

SpectorIndex uses IVF partitioning with adaptive HNSW shards. The two key parameters are:

  • nCentroids โ€” number of K-Means partitions (set at training time)
  • nProbe โ€” number of partitions searched at query time (adjustable)

Rule of thumb: nCentroids โ‰ˆ โˆšN (square root of dataset size).

Real embedding results (Qwen3-embedding, 4096-dim, 10K vectors):

nCentroids nProbe % Data Searched Avg Latency QPS Recall@10
128 4 3.1% 0.46ms 2,173 1.0000
128 8 6.3% 0.73ms 1,368 1.0000
128 16 12.5% 1.26ms 792 1.0000
64 4 6.3% 0.62ms 1,601 1.0000
64 8 12.5% 1.17ms 856 1.0000
32 4 12.5% 1.17ms 857 1.0000

Tip

With real embeddings (not random vectors), SpectorIndex achieves perfect recall at nProbe=4 because real embeddings form natural semantic clusters that K-Means captures effectively. Start with nProbe=4 and only increase if your recall target isn't met.

Note

For the complete, empirical sweeps across multiple partition configurations (\(C \in \{32, 64, 128, 256\}\)) and detailed HNSW shard promotion benchmarks, see the dedicated Large-Scale Benchmarks deep dive.

Ingestion throughput (SpectorIndex vs standalone HNSW):

Dataset Size SpectorIndex Standalone HNSW Speedup
10K 130K docs/s 4,677 docs/s 28ร—
50K 140K docs/s 2,483 docs/s 56ร—
100K 150K docs/s 1,535 docs/s 98ร—
500K 246K docs/s โ€” โ€”
1M 128K docs/s โ€” โ€”

๐Ÿ“ Scaling Strategies

โฌ†๏ธ Vertical Scaling

  • Add CPU cores โ†’ Concurrent throughput scaling (up to ~3.7ร— at 16 threads measured)

  • Add RAM โ†’ Support larger capacity without IVF-PQ compression

  • Add GPU โ†’ 4ร— brute-force search speedup at 100K+ vectors (data resident in VRAM)

โžก๏ธ Horizontal Scaling (Distributed Mode)

  • Add nodes โ†’ Linear throughput scaling per shard

  • Rule of thumb: 100Kโ€“500K docs per shard

  • See Distributed Mode for cluster setup


โ˜• JVM Tuning

Recommended JVM arguments for production:

java \
  --add-modules jdk.incubator.vector \
  --enable-native-access=ALL-UNNAMED \
  -XX:+UseZGC \
  -XX:+ZGenerational \
  -Xmx4g \
  -Xms4g \
  -jar spector-node.jar
Argument Purpose
--add-modules jdk.incubator.vector Required for SIMD acceleration
--enable-native-access=ALL-UNNAMED Required for Panama FFM (GPU, mmap)
-XX:+UseZGC Low-pause GC (vectors are off-heap)
-XX:+ZGenerational Generational ZGC for better throughput
-Xmx4g -Xms4g Fixed heap avoids resize pauses

Tip

Since all vectors live off-heap, GC pressure is minimal. The heap primarily holds the HNSW graph structure and BM25 inverted index.


๐Ÿ”— See Also