Test Framework & LLM Judge¶

Spector's testing strategy goes beyond traditional unit tests. The project implements a comprehensive E2E test framework with a novel LLM-as-Judge system that uses a language model to semantically validate recall results — catching semantic drift and relevance degradation that deterministic assertions cannot detect.

Why LLM-Based Test Validation?¶

Semantic search systems face a fundamental testing challenge: correct behavior is subjective. When you query "PostgreSQL connection pool exhaustion" and get back a memory about "HikariCP timeout configuration," is that relevant? A traditional assertion checking for exact string matches would miss it. A human would consider it highly relevant.

The LLM Judge bridges this gap by asking a language model to evaluate whether test results are semantically relevant to the query — the same way a human reviewer would, but automated and repeatable.

flowchart LR
    subgraph Traditional["Traditional Testing"]
        T1["assertThat(results).isNotEmpty()"] --> T2["assertThat(results.get(0).id()).isEqualTo('db-001')"]
    end

    subgraph LLMJudge["LLM Judge Testing"]
        L1["llmAssertRecall(query, results)"] --> L2[".warnIfIrrelevant('database connection')"]
        L2 --> L3[".hasGoodRanking()"]
        L3 --> L4[".coversTopics('database', 'pool')"]
    end

    Traditional --> Brittle["❌ Brittle: breaks when<br/>model changes ranking"]
    LLMJudge --> Semantic["✅ Semantic: validates<br/>meaning, not IDs"]

Architecture¶

Module Structure¶

The testing infrastructure spans two Maven modules:

Module	Purpose
`spector-test-support`	Shared LLM Judge framework (module-agnostic)
`spector-memory` (test scope)	E2E test classes, seed data, context management

Component Architecture¶

graph TB
    subgraph TestClasses["E2E Test Classes (11)"]
        Ingestion["IngestionE2ETest"]
        Scoring["ScoringPipelineE2ETest"]
        Decay["DecayAndLtpE2ETest"]
        Graph["GraphE2ETest"]
        Inhibition["InhibitionE2ETest"]
        Persistence["PersistenceE2ETest"]
        Prospective["ProspectiveMemoryE2ETest"]
        Lifecycle["FullLifecycleE2ETest"]
        Negative["NegativeTestingE2ETest"]
        Profiles["CognitiveProfileE2ETest"]
        Agentic["AgenticConversationE2ETest"]
    end

    subgraph Infrastructure["Test Infrastructure"]
        Abstract["AbstractE2ETest<br/>Base class"]
        Context["E2EMemoryContext<br/>Singleton lifecycle"]
        SeedData["E2ESeedData<br/>Markdown parser"]
        Assertions["E2EAssertions<br/>Helper methods"]
        EntityExtractor["TestEntityExtractor<br/>Mock entities"]
    end

    subgraph JudgeFramework["LLM Judge Framework"]
        LlmAssertions["LlmAssertions<br/>Fluent API"]
        LlmTestJudge["LlmTestJudge<br/>Core engine"]
        JudgePrompts["JudgePromptTemplates<br/>3 prompt types"]
        JudgeVerdict["JudgeVerdict<br/>Structured result"]
        JudgeConfig["LlmJudgeConfig<br/>Environment config"]
    end

    subgraph Providers["LLM Providers"]
        Ollama["OllamaLlmProvider<br/>Ollama REST API"]
        EmbedAPI["TextGenerationProvider<br/>Interface"]
    end

    subgraph SeedFiles["Seed Data (16 files, 213 memories)"]
        S01["01-project-architecture"]
        S02["02-database-work"]
        S03["..."]
        S14["14-agentic-conversation"]
        S15["15-elevate-x-conversation"]
        S16["16-spector-engine-conversation"]
    end

    TestClasses --> Abstract
    Abstract --> Context
    Abstract --> LlmAssertions
    Context --> SeedData
    SeedData --> SeedFiles
    Context --> Ollama
    LlmAssertions --> LlmTestJudge
    LlmTestJudge --> JudgePrompts
    LlmTestJudge --> JudgeVerdict
    LlmTestJudge --> EmbedAPI
    JudgeConfig --> Context
    Ollama -.-> EmbedAPI

E2E Test Framework¶

Test Lifecycle¶

All E2E tests share a single SpectorMemory instance through the E2EMemoryContext singleton. This ensures:

Seed data is loaded once — 213 memories are embedded and ingested at the start of the test suite
Tests are independent — each test queries the shared memory and validates results
Ollama gating — tests are skipped unless OLLAMA_LIVE=true is set

sequenceDiagram
    participant Runner as JUnit 5
    participant Abstract as AbstractE2ETest
    participant Context as E2EMemoryContext
    participant Ollama as Ollama (embeddings)
    participant Memory as SpectorMemory
    participant Judge as LlmTestJudge

    Runner->>Abstract: @BeforeAll initContext()
    Abstract->>Abstract: Check OLLAMA_LIVE=true
    Abstract->>Context: E2EMemoryContext.get()
    Context->>Ollama: Create EmbeddingProvider
    Context->>Memory: Create SpectorMemory
    Context->>Context: Parse seed markdown files
    Context->>Ollama: Embed all 213 memories
    Context->>Memory: Ingest all memories

    opt LLM_JUDGE=true
        Context->>Judge: Create LlmTestJudge
    end

    Runner->>Abstract: Execute test methods
    Abstract->>Memory: memory.recall(query, options)
    Memory-->>Abstract: List of CognitiveResult

    opt LLM Judge enabled
        Abstract->>Judge: judgeRelevance(query, results, criteria)
        Judge->>Ollama: Generate judgment prompt
        Ollama-->>Judge: JSON verdict
        Judge-->>Abstract: JudgeVerdict
    end

Seed Data Format¶

Seed memories are authored in Markdown with YAML front matter. The E2ESeedData parser reads all .md files from src/test/resources/e2e/memories/:

---
id: db-001
type: EPISODIC
source: OBSERVED
tags: database, postgresql, connection-pool
valence: -10
---
The PostgreSQL connection pool kept exhausting under load.
Increased HikariCP maximum pool size from 10 to 25 and added
connection timeout of 30 seconds. The root cause was a missing
connection release in the batch processing loop.

Supported fields:

Field	Required	Values	Default
`id`	✅	Unique string identifier	—
`type`	✅	`EPISODIC`, `SEMANTIC`, `PROCEDURAL`, `PROSPECTIVE`	—
`source`	❌	`OBSERVED`, `REFLECTED`, `IMAGINED`	`OBSERVED`
`tags`	❌	Comma-separated tag list	`[]`
`valence`	❌	Integer from -20 to 20	`0`

Seed Data Categories¶

The test suite uses 16 seed data files organized by domain:

#	File	Memories	Domain
01	`project-architecture`	10	Java project structure, Spring Boot, microservices
02	`database-work`	13	PostgreSQL, HikariCP, migrations, query optimization
03	`deployment-cicd`	10	Docker, Kubernetes, GitHub Actions, CI/CD
04	`authentication-security`	8	OAuth2, JWT, security incidents
05	`known-facts`	14	Factual Java knowledge (patterns, frameworks)
06	`procedures`	6	Step-by-step runbooks
07	`entity-relationships`	8	People, teams, projects with entity links
08	`surprise-and-lateral`	10	Unexpected discoveries, lateral connections
09	`preferences-and-context`	9	User preferences, tool choices
10	`edge-cases`	20	Adversarial: duplicates, contradictions, multilingual
11	`temporal-sequences`	12	Time-ordered Redis migration chain
12	`ambiguous-queries`	11	Multi-meaning terms ("pool", "spring", "node")
13	`negative-evidence`	7	Anti-patterns, failed approaches
14	`agentic-conversation`	17	Real Promptly app development sessions
15	`elevate-x-conversation`	14	Real Elevate-X fitness app sessions
16	`spector-engine-conversation`	26	Real Spector engine debugging sessions

Total: 213 memories across all categories.

Test Classes¶

Class	Tests	Focus
`IngestionE2ETest`	4	Memory ingestion, embedding, storage verification
`ScoringPipelineE2ETest`	21	6-phase scoring, valence filtering, deduplication
`DecayAndLtpE2ETest`	5	Temporal decay, long-term potentiation
`GraphE2ETest`	7	Hebbian graph, entity-aware recall, co-activation
`InhibitionE2ETest`	7	Suppression, habituation, retrieval-induced forgetting
`PersistenceE2ETest`	5	WAL, disk persistence, crash recovery
`ProspectiveMemoryE2ETest`	3	Future intents, deadline tracking
`FullLifecycleE2ETest`	23	13-step lifecycle from ingestion to reflection
`NegativeTestingE2ETest`	18	Adversarial: gibberish, wrong domain, empty results
`CognitiveProfileE2ETest`	12	Profile auto-detection, scoring weight verification
`AgenticConversationE2ETest`	11	Agentic conversation recall, cross-domain isolation

LLM Judge Framework¶

How It Works¶

The LLM Judge follows a simple pipeline:

flowchart TB
    A["Test calls<br/>llmAssertRecall(query, results)"] --> B["LlmAssertions<br/>builds fluent chain"]
    B --> C{".warnIfIrrelevant()<br/>.isRelevantTo()<br/>.hasGoodRanking()<br/>.coversTopics()"}

    C --> D["JudgePromptTemplates<br/>builds structured prompt"]
    D --> E["LlmTestJudge<br/>calls LLM with low temperature"]
    E --> F["Ollama<br/>generates JSON verdict"]
    F --> G["Parse JSON<br/>extract relevant, confidence, reasoning"]
    G --> H{Verdict?}

    H -->|"RELEVANT ✅"| I["Log success"]
    H -->|"NOT_RELEVANT ❌"| J{"Assertion type?"}

    J -->|isRelevantTo| K["❌ Fail test<br/>with reasoning"]
    J -->|warnIfIrrelevant| L["⚠️ Log warning<br/>test continues"]

Prompt Engineering¶

Each judgment type uses a carefully engineered prompt that:

Sets the role: "You are a test validation judge"
Provides context: The query, relevance criteria, and truncated results
Defines leniency: "If at least 30% of results are relevant, judge as relevant"
Forces structured output: "Respond ONLY with this exact JSON format"
Limits scope: Max 10 results, 200 chars each to fit token budgets

Example prompt for relevance judgment:

You are a test validation judge. Your job is to determine whether
a set of memory recall results is relevant to a given query.

QUERY: "PostgreSQL connection pool exhaustion timeout"

RELEVANCE CRITERIA: Results should contain memories about database
connection issues

RESULTS:
- The PostgreSQL connection pool kept exhausting under load...
- Increased HikariCP maximum pool size from 10 to 25...
- Switched from Flyway to Liquibase for database migrations...

Respond ONLY with this exact JSON format, no other text:
{"relevant": true, "confidence": 0.85, "reasoning": "Brief explanation"}

Response Parsing¶

The LlmTestJudge handles common LLM output quirks:

Thinking tags: Strips <think>...</think> blocks (qwen3 models)
Markdown fences: Extracts JSON from ```json blocks
Extra text: Uses regex to find the JSON object anywhere in the response
Retry logic: Up to 2 retries on parse failure before giving up gracefully

Assertion Types¶

`isRelevantTo(criteria)` — Hard Assertion¶

llmAssertRecall(query, results)
    .isRelevantTo("Results must contain security-related memories");
// → Fails test if LLM judges NOT_RELEVANT

Use when: The semantic relationship is a hard business requirement. Example: security queries must never return cooking recipes.

`warnIfIrrelevant(criteria)` — Soft Warning¶

llmAssertRecall(query, results)
    .warnIfIrrelevant("Results should relate to database connection pooling");
// → Logs: ⚠ LLM Judge [llama3.1]: NOT_RELEVANT (confidence=0.75) — ...

Use when: Semantic quality is important but model-dependent ranking makes hard assertions flaky.

`hasGoodRanking()` — Ranking Quality¶

llmAssertRecall(query, results)
    .hasGoodRanking();
// → Warns if #5 is clearly more relevant than #1

Use when: Verifying that the scoring pipeline produces sensible ordering.

`coversTopics(topics...)` — Topic Coverage¶

llmAssertRecall(query, results)
    .coversTopics("database", "connection pool", "timeout");
// → Warns if expected topics are missing from results

Use when: Ensuring recall results span the expected knowledge domains.

Chaining¶

All assertions return this for fluent chaining:

if (isLlmJudgeEnabled()) {
    llmAssertRecall("AI safety guardrails workout generation", results)
        .isRelevantTo("Results must describe fitness safety validation")
        .hasGoodRanking()
        .coversTopics("safety", "workout", "calorie");
}

Configuration¶

Environment Variables¶

Variable	Default	Description
`OLLAMA_LIVE`	`false`	Gate for all E2E tests (embedding + recall)
`LLM_JUDGE`	`false`	Enable LLM-based semantic validation
`LLM_JUDGE_MODEL`	`llama3.1`	Ollama model for judging
`LLM_JUDGE_URL`	`http://localhost:11434`	Ollama server URL
`LLM_JUDGE_CONFIDENCE`	`0.6`	Minimum confidence threshold

Running Tests¶

# Run all E2E tests with Ollama embeddings (no LLM judge)
mvn test -pl spector-memory -DOLLAMA_LIVE=true

# Run with LLM judge enabled
mvn test -pl spector-memory -DOLLAMA_LIVE=true -DLLM_JUDGE=true

# Run specific test class
mvn test -pl spector-memory -DOLLAMA_LIVE=true \
    -Dtest=AgenticConversationE2ETest

# Run with custom model
mvn test -pl spector-memory -DOLLAMA_LIVE=true \
    -DLLM_JUDGE=true -DLLM_JUDGE_MODEL=qwen3:0.6b

Design Decisions¶

Why Non-Blocking by Default?¶

LLM models are non-deterministic — the same prompt can produce different verdicts across runs, especially with smaller models. Making all LLM assertions hard-fail by default would create flaky tests. Instead:

warnIfIrrelevant is the recommended default — it provides semantic signal in test logs without blocking CI/CD
isRelevantTo is reserved for invariants that should never be violated (e.g., cross-domain isolation)
The LLM_JUDGE_FAIL_ON_REJECT flag exists for strict validation environments

Why Ollama?¶

Local-first: No API keys, no network dependency, no cost
Reproducible: Pin to a specific model version for consistent results
Fast: Small models (qwen3:0.6b) produce judgments in < 500ms
CI-friendly: Ollama runs as a sidecar container in CI pipelines

Why Separate Module?¶

spector-test-support is a standalone module rather than test-scoped code inside spector-memory because:

Reusability: Any Spector module can depend on it for LLM-based testing
Clean dependencies: The judge framework depends on spector-provider-api and spector-providers, not on spector-memory
Independent versioning: Test infrastructure evolves on its own schedule

Current Metrics¶

Metric	Value
Seed memory files	16
Total seed memories	213
E2E test classes	11
Total E2E tests	116
LLM assertions	19
Hard assertions (`isRelevantTo`)	4
Soft assertions (`warnIfIrrelevant`)	12
Topic coverage (`coversTopics`)	2
Ranking checks (`hasGoodRanking`)	1