spector-commons 📄¶
Ingestion utilities, text tokenizers, semantic chunkers, and document content extractors for Spector.
spector-commons handles the preprocessing phase of document ingestion. It parses raw file formats (HTML, PDF, plain text), extracts core text content, and chunks it using character, token-level, or streaming boundaries to fit model context windows before embedding generation.
🏗️ Core Architecture & Roles¶
- Semantic Chunkers (
TextChunker/TokenChunker): Segments large text blocks into overlapping passages to maintain query context and respect model token limits. - Streaming Chunkers (
StreamingChunker): High-throughput chunking controller designed to ingest streams of tokens/characters with sliding context windows. - Content Extraction (
ContentExtractor/PdfDocumentReader): Pure Java, zero-dependency HTML parser and PDF decoder designed to extract structured text without heavy external libraries.
🚀 Key APIs¶
Token-level Overlapping Chunking¶
String text = "Large document content...";
int maxTokens = 256;
int overlap = 32;
List<Chunk> chunks = TokenChunker.chunk(text, maxTokens, overlap);
for (Chunk chunk : chunks) {
System.out.printf("Chunk %d (%d tokens) -> %s%n", chunk.index(), chunk.tokenCount(), chunk.text());
}