Skip to content

๐ŸŽฎ GPU Acceleration

Unlock massive parallel throughput with optional CUDA GPU acceleration. Spector loads GPU kernels via Panama FFM (Foreign Function & Memory), maintaining the zero-JNI philosophy. GPU shines for batch workloads โ€” single queries are already sub-millisecond on CPU SIMD.


๐ŸŽฏ When to Use GPU

graph TD
    Q["How many concurrent queries?"] --> Single["Single query<br/>Low concurrency"]
    Q --> Batch["Batch queries<br/>High concurrency"]

    Single --> CPU["โœ… CPU SIMD<br/>Best for HNSW traversal"]
    Batch --> GPU["โœ… GPU CUDA<br/>4ร— speedup at 100K+ vectors"]

    style CPU fill:#d4edda
    style GPU fill:#d4edda
Scenario Recommendation
โœ… Batch search (multiple queries at once) GPU
โœ… Large collections (>100K vectors) GPU
โœ… High concurrency (many simultaneous users) GPU
โœ… Brute-force similarity over IVF partitions GPU
โšก Single queries CPU SIMD
โšก Small datasets (<10K vectors) CPU SIMD
โšก Ultra-low latency (<0.1ms) CPU SIMD

๐Ÿ“‹ Requirements

Hardware

  • NVIDIA GPU with Compute Capability โ‰ฅ 7.0 (Volta or newer)

  • Recommended: RTX 3060+ or A100/H100 for production workloads

Software

Component Version Notes
CUDA Toolkit 12.x Runtime libraries required
NVIDIA Driver 525+ Must match CUDA version
JDK 25+ With Panama FFM support

๐Ÿง Installation (Linux)

# Install CUDA toolkit
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install cuda-toolkit-12-4

# Verify
nvidia-smi
nvcc --version

โœ… Verify Spector GPU Detection

curl http://localhost:7070/api/v1/status
{
  "gpuAvailable": true,
  "gpuInfo": "NVIDIA RTX 4090, 24GB, CUDA 12.4"
}


โš™๏ธ Configuration

var config = SpectorConfig.DEFAULT
    .withDimensions(384)
    .withGpu(true)
    .withGpuMemoryBudget(2048);  // 2 GB
Parameter Default Range Description
gpuEnabled false โ€” Enable CUDA acceleration
gpuMemoryBudget 256 MB 256 MB โ€“ GPU max Maximum device memory
gpuBatchWindow 10 ms 1โ€“100 ms Batching window for query collection
gpuMaxBatchSize 1024 1โ€“1024 Max queries per kernel launch

Tip

Set gpuMemoryBudget to ~70% of available GPU memory to leave room for other processes.


๐Ÿ”ฌ GPU Kernels

Dot Product Kernel

Computes dot-product similarity between a query vector and a batch of document vectors.

Property Value
Input query (float32[D]) + database (float32[N ร— D])
Output similarity scores (float32[N])
Dimensions Multiples of 32, range 32โ€“2048
Batch size 1โ€“1,000,000 vectors per invocation
Tolerance โ‰ค1e-5 absolute error vs CPU SIMD

Cosine Similarity Kernel

Computes cosine similarity with cached norm computation.

Optimization Benefit
Pre-computes norms Cached across queries
Detects pre-normalized vectors Skips norm computation
Falls back to dot product For normalized inputs
Tolerance โ‰ค1e-6 vs CPU SIMD
sequenceDiagram
    participant Q1 as Query A (t=0ms)
    participant Q2 as Query B (t=3ms)
    participant Q3 as Query C (t=7ms)
    participant GPU as ๐ŸŽฎ GPU Kernel

    Note over Q1,GPU: Batch window = 10ms
    Q1->>GPU: Queued
    Q2->>GPU: Queued
    Q3->>GPU: Queued
    Note over GPU: t=10ms: Window closes
    GPU->>GPU: Single kernel for [A, B, C]
    GPU-->>Q1: Top-K results for A
    GPU-->>Q2: Top-K results for B
    GPU-->>Q3: Top-K results for C

Properties:

  • Each query receives its own independent top-K results

  • Individual query errors don't fail the batch

  • Achieves โ‰ฅ2ร— throughput vs sequential for batch sizes >4

  • Large batches are automatically partitioned to fit GPU memory


๐Ÿ’พ Memory Management

The GpuMemoryManager handles device memory via Panama FFM:

// Allocation tied to Arena lifecycle
try (Arena arena = Arena.ofConfined()) {
    MemorySegment deviceMem = gpuMemoryManager.allocateDevice(sizeBytes, arena);
    // Use device memory...
} // Automatically freed when arena closes

Key behaviors:

  • โœ… Allocations are Arena-scoped with explicit lifecycle

  • โœ… Pinned host memory for efficient hostโ†”device transfers

  • โœ… Budget enforcement prevents over-allocation

  • โœ… Device memory released within 100ms of Arena close

  • โœ… Metrics available via monitoring API


๐Ÿ”„ Fallback Behavior

graph TD
    A["GPU Kernel Call"] --> B{"GPU available?"}
    B -->|No| C["โšก CPU SIMD kernel<br/>(same interface)"]
    B -->|Yes| D{"Kernel execution OK?"}
    D -->|Error| E["Release device memory"]
    E --> C
    D -->|Success| F["โœ… Return GPU results"]

Note

No code changes required. The same method signature returns results regardless of whether GPU or CPU executed the computation. Fallback is automatic and transparent.

Fallback triggers:

  • GPU not detected at startup

  • CUDA driver not installed

  • Insufficient GPU memory

  • CUDA kernel execution error

  • GPU memory budget exceeded


๐Ÿ“Š Performance Characteristics

Single Query (CPU wins)

Method 100K vectors, 384-dim
โšก CPU SIMD (AVX2) ~0.05 ms
๐ŸŽฎ GPU (kernel launch overhead) ~0.5โ€“1 ms

Batch Queries (GPU shines)

Batch Size CPU SIMD GPU (resident) GPU Speedup
10K 0.35 ms 0.21 ms 1.7ร—
100K 9.13 ms 2.24 ms 4.1ร—
500K 45.75 ms 11.31 ms 4.0ร—
1M 90.77 ms 22.09 ms 4.1ร—

Important

GPU acceleration benchmarked on RTX 4060 Ti 16GB, 384-dim vectors, with database persistently resident in VRAM. The one-time upload cost is ~464ms for 1M vectors (1.5GB). Per-query cost only includes uploading the query vector (~1.5KB) and downloading results. GPU provides consistent 4ร— speedup for brute-force search at scale.


๐Ÿ”ง Troubleshooting

Symptom Cause Solution
gpuAvailable: false CUDA not installed Install CUDA toolkit, verify nvidia-smi
Slow GPU queries Small batch sizes Increase gpuBatchWindow or disable GPU
Out of GPU memory Budget too low Increase gpuMemoryBudget
CPU fallback always used Native access not enabled Add --enable-native-access=ALL-UNNAMED

JVM Arguments for GPU

java --add-modules jdk.incubator.vector \
     --enable-native-access=ALL-UNNAMED \
     -jar spector-node.jar

๐Ÿ”— See Also