âš¡ Performance & SIMD¶
Spector Memory is engineered for microsecond-scale latency. This page documents the benchmark results and the key performance techniques that make it possible.
Benchmark Summary¶
Measured on Intel Core Ultra 9 285K, Java 25, AVX2 256-bit (8 float lanes), ZGC:
| Benchmark | Result | Notes |
|---|---|---|
| SIMD L2 Distance (128-dim) | 0.8 µs/vector | 1.2M vectors/sec |
| SIMD L2 Distance (384-dim) | 1.5 µs/vector | 2.6M vectors/sec |
| SIMD L2 Distance (768-dim) | 2.2 µs/vector | 1.4M vectors/sec |
| SIMD L2 Distance (1024-dim) | 3.0 µs/vector | 1.0M vectors/sec |
| Reverse Index Lookup | 180 ns/lookup | O(1) packed-key ConcurrentHashMap |
| CognitiveScorer (10K × 128-dim) | 2.9 ms total | Full 6-phase pipeline |
| Batch Habituation (1K IDs) | 101 µs total | 100 ns per penalty computation |
| TierRouter.totalCount() | 17 ms / 100K calls | 170 ns per call |
| Full Pipeline (1K ingest + 100 recall) | < 50 ms/query | End-to-end latency |
| Real Embedding (qwen3-embedding 4096-dim) | 31 ms/embed | Via Ollama (network bound) |
Key Techniques¶
O(1) Reverse Index¶
Memory IDs are resolved in constant time using a packed-key ConcurrentHashMap<Long, String>:
// Pack (type, offset) into a single long — zero String concatenation
private static long reverseKey(MemoryType type, long offset) {
return ((long) type.ordinal() << 48) | (offset & 0x0000_FFFF_FFFF_FFFFL);
}
This yields 180 ns lookups at 50K entries.
SIMD Euclidean Distance¶
Quantized INT8 Euclidean distance uses the Java Vector API for hardware acceleration:
// Vectorized dequantization + L2 in a single SIMD pass
FloatVector vQuery = FloatVector.fromArray(SPECIES, queryVector, i);
ByteVector vQuantized = ByteVector.fromMemorySegment(SPECIES_BYTE, segment, offset + i, NATIVE);
FloatVector vFloat = vQuantized.castShape(SPECIES, 0); // INT8 → float32
FloatVector vDequant = vFloat.mul(vScale).add(vMin); // Affine dequantization
FloatVector vDiff = vQuery.sub(vDequant);
vSum = vDiff.fma(vDiff, vSum); // Fused multiply-add
This achieves 2.2 µs/vector at 768 dimensions (1.4M vectors/sec).
Batch Habituation¶
The habituation penalty module computes all penalties in a single batch call with amortized map access, processing 1K penalties in 101 µs total.
Inline Header Capture¶
ScoredRecord captures the CognitiveHeader inline during scoring, eliminating N×8 off-heap re-reads per recall query.
Direct TierRouter Access¶
totalCount() uses direct field access to typed store references rather than iteration, completing 100K calls in 17 ms (170 ns/call).
Parallel Tier Scanning¶
Each memory tier is scanned on a dedicated Virtual Thread via ConcurrentTasks.forkJoinAll():
gantt
title Parallel Recall: 5 concurrent scans
dateFormat X
axisFormat %L ms
section Working (100 records)
Scan :a1, 0, 1
section Episodic P1 (5K records)
Scan :a2, 0, 3
section Episodic P2 (3K records)
Scan :a3, 0, 2
section Semantic (200 headers)
Scan :a4, 0, 1
section Procedural (50 records)
Scan :a5, 0, 1
section Merge + Rank
Top-K :a6, 3, 4
Key insight: Episodic partitions use disjoint memory segments — each partition's mmap is a separate MemorySegment. This guarantees zero contention between virtual threads, enabling perfect parallel scaling.
Fallback: If parallel scanning fails (e.g., thread pool exhaustion), the pipeline falls back to sequential scanning with identical results.
Memory Footprint¶
| Component | Formula | 10K memories (768-dim) |
|---|---|---|
| Episodic partition | 64B header + N × (32B + vecBytes) | 64B + 10K × 800B = 7.8 MB |
| Working memory | capacity × (32B + vecBytes) | 100 × 800B = 78 KB |
| Semantic headers | capacity × 32B | 5K × 32B = 156 KB |
| Procedural store | capacity × (32B + vecBytes) | 500 × 800B = 390 KB |
| Forward index | ~120B per entry | 10K × 120B = 1.2 MB |
| Reverse index | ~60B per entry | 10K × 60B = 600 KB |
| Total | ~10.2 MB |
vs. Python Memory Layers
A Python memory system stores each memory as a Python object (~500-800 bytes overhead) plus the vector in NumPy (~3KB for 768-dim float32). Spector stores the same memory in 800 bytes (32B header + 768B INT8 vector) — a 5-10× reduction.
Test Suite¶
spector-core: 276 tests ✅ (includes 15 SIMD kernel verification tests)
spector-memory: 167 tests ✅ (includes performance benchmarks + index tests)
+ 10 Ollama real embedding E2E tests (gated by OLLAMA_LIVE=true)
Total: 443 tests, 0 failures
Running Benchmarks¶
# Run all memory tests (includes benchmark assertions)
mvn test -pl spector-memory
# Run only performance benchmarks
mvn test -pl spector-memory -Dtest=PerformanceBenchmarkTest
# Run Ollama real embedding E2E tests
OLLAMA_LIVE=true mvn test -pl spector-memory -Dtest=OllamaRealEmbeddingTest
Next Steps¶
- Off-Heap Panama Design — zero-GC architecture
- 6-Phase Scoring Pipeline — the SIMD hot-loop
- Architecture — system-level design