๐ฎ GPU Acceleration¶
Unlock massive parallel throughput with optional CUDA GPU acceleration. Spector loads GPU kernels via Panama FFM (Foreign Function & Memory), maintaining the zero-JNI philosophy. GPU shines for batch workloads โ single queries are already sub-millisecond on CPU SIMD.
๐ฏ When to Use GPU¶
graph TD
Q["How many concurrent queries?"] --> Single["Single query<br/>Low concurrency"]
Q --> Batch["Batch queries<br/>High concurrency"]
Single --> CPU["โ
CPU SIMD<br/>Best for HNSW traversal"]
Batch --> GPU["โ
GPU CUDA<br/>4ร speedup at 100K+ vectors"]
style CPU fill:#d4edda
style GPU fill:#d4edda
| Scenario | Recommendation |
|---|---|
| โ Batch search (multiple queries at once) | GPU |
| โ Large collections (>100K vectors) | GPU |
| โ High concurrency (many simultaneous users) | GPU |
| โ Brute-force similarity over IVF partitions | GPU |
| โก Single queries | CPU SIMD |
| โก Small datasets (<10K vectors) | CPU SIMD |
| โก Ultra-low latency (<0.1ms) | CPU SIMD |
๐ Requirements¶
Hardware¶
-
NVIDIA GPU with Compute Capability โฅ 7.0 (Volta or newer)
-
Recommended: RTX 3060+ or A100/H100 for production workloads
Software¶
| Component | Version | Notes |
|---|---|---|
| CUDA Toolkit | 12.x | Runtime libraries required |
| NVIDIA Driver | 525+ | Must match CUDA version |
| JDK | 25+ | With Panama FFM support |
๐ง Installation (Linux)¶
# Install CUDA toolkit
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install cuda-toolkit-12-4
# Verify
nvidia-smi
nvcc --version
โ Verify Spector GPU Detection¶
โ๏ธ Configuration¶
var config = SpectorConfig.DEFAULT
.withDimensions(384)
.withGpu(true)
.withGpuMemoryBudget(2048); // 2 GB
| Parameter | Default | Range | Description |
|---|---|---|---|
gpuEnabled |
false | โ | Enable CUDA acceleration |
gpuMemoryBudget |
256 MB | 256 MB โ GPU max | Maximum device memory |
gpuBatchWindow |
10 ms | 1โ100 ms | Batching window for query collection |
gpuMaxBatchSize |
1024 | 1โ1024 | Max queries per kernel launch |
Tip
Set gpuMemoryBudget to ~70% of available GPU memory to leave room for other processes.
๐ฌ GPU Kernels¶
Dot Product Kernel¶
Computes dot-product similarity between a query vector and a batch of document vectors.
| Property | Value |
|---|---|
| Input | query (float32[D]) + database (float32[N ร D]) |
| Output | similarity scores (float32[N]) |
| Dimensions | Multiples of 32, range 32โ2048 |
| Batch size | 1โ1,000,000 vectors per invocation |
| Tolerance | โค1e-5 absolute error vs CPU SIMD |
Cosine Similarity Kernel¶
Computes cosine similarity with cached norm computation.
| Optimization | Benefit |
|---|---|
| Pre-computes norms | Cached across queries |
| Detects pre-normalized vectors | Skips norm computation |
| Falls back to dot product | For normalized inputs |
| Tolerance | โค1e-6 vs CPU SIMD |
โฑ๏ธ Batch GPU Search¶
sequenceDiagram
participant Q1 as Query A (t=0ms)
participant Q2 as Query B (t=3ms)
participant Q3 as Query C (t=7ms)
participant GPU as ๐ฎ GPU Kernel
Note over Q1,GPU: Batch window = 10ms
Q1->>GPU: Queued
Q2->>GPU: Queued
Q3->>GPU: Queued
Note over GPU: t=10ms: Window closes
GPU->>GPU: Single kernel for [A, B, C]
GPU-->>Q1: Top-K results for A
GPU-->>Q2: Top-K results for B
GPU-->>Q3: Top-K results for C
Properties:
-
Each query receives its own independent top-K results
-
Individual query errors don't fail the batch
-
Achieves โฅ2ร throughput vs sequential for batch sizes >4
-
Large batches are automatically partitioned to fit GPU memory
๐พ Memory Management¶
The GpuMemoryManager handles device memory via Panama FFM:
// Allocation tied to Arena lifecycle
try (Arena arena = Arena.ofConfined()) {
MemorySegment deviceMem = gpuMemoryManager.allocateDevice(sizeBytes, arena);
// Use device memory...
} // Automatically freed when arena closes
Key behaviors:
-
โ Allocations are Arena-scoped with explicit lifecycle
-
โ Pinned host memory for efficient hostโdevice transfers
-
โ Budget enforcement prevents over-allocation
-
โ Device memory released within 100ms of Arena close
-
โ Metrics available via monitoring API
๐ Fallback Behavior¶
graph TD
A["GPU Kernel Call"] --> B{"GPU available?"}
B -->|No| C["โก CPU SIMD kernel<br/>(same interface)"]
B -->|Yes| D{"Kernel execution OK?"}
D -->|Error| E["Release device memory"]
E --> C
D -->|Success| F["โ
Return GPU results"]
Note
No code changes required. The same method signature returns results regardless of whether GPU or CPU executed the computation. Fallback is automatic and transparent.
Fallback triggers:
-
GPU not detected at startup
-
CUDA driver not installed
-
Insufficient GPU memory
-
CUDA kernel execution error
-
GPU memory budget exceeded
๐ Performance Characteristics¶
Single Query (CPU wins)¶
| Method | 100K vectors, 384-dim |
|---|---|
| โก CPU SIMD (AVX2) | ~0.05 ms |
| ๐ฎ GPU (kernel launch overhead) | ~0.5โ1 ms |
Batch Queries (GPU shines)¶
| Batch Size | CPU SIMD | GPU (resident) | GPU Speedup |
|---|---|---|---|
| 10K | 0.35 ms | 0.21 ms | 1.7ร |
| 100K | 9.13 ms | 2.24 ms | 4.1ร |
| 500K | 45.75 ms | 11.31 ms | 4.0ร |
| 1M | 90.77 ms | 22.09 ms | 4.1ร |
Important
GPU acceleration benchmarked on RTX 4060 Ti 16GB, 384-dim vectors, with database persistently resident in VRAM. The one-time upload cost is ~464ms for 1M vectors (1.5GB). Per-query cost only includes uploading the query vector (~1.5KB) and downloading results. GPU provides consistent 4ร speedup for brute-force search at scale.
๐ง Troubleshooting¶
| Symptom | Cause | Solution |
|---|---|---|
gpuAvailable: false |
CUDA not installed | Install CUDA toolkit, verify nvidia-smi |
| Slow GPU queries | Small batch sizes | Increase gpuBatchWindow or disable GPU |
| Out of GPU memory | Budget too low | Increase gpuMemoryBudget |
| CPU fallback always used | Native access not enabled | Add --enable-native-access=ALL-UNNAMED |
JVM Arguments for GPU¶
java --add-modules jdk.incubator.vector \
--enable-native-access=ALL-UNNAMED \
-jar spector-node.jar
๐ See Also¶
-
Core Concepts โ SIMD kernels that GPU extends
-
Performance Tuning โ When to use GPU vs CPU
-
Configuration Guide โ GPU parameters
-
Architecture Overview โ Where GPU fits in the system