Skip to content

πŸ“Š Semantic Data Model β€” Deep Dive

The Semantic Data Model is Synaptiq's structured representation of an organization's data universe β€” enabling accurate, governed AI reasoning.


Schema Registry

Auto-Inference Pipeline

flowchart LR
    Source["Data Source<br/>JSON / YAML / CSV"] --> Sample["Document Sampling<br/>Statistical analysis"]
    Sample --> Infer["Type Inference<br/>String, Number, Date, Enum"]
    Infer --> Enrich["Enrichment<br/>Cardinality, nullability"]
    Enrich --> Register["Schema Registry<br/>Per-tenant storage"]

Entity Model

{
  "entities": [
    {
      "name": "Product",
      "fields": [
        { "name": "id", "type": "string", "primary": true },
        { "name": "name", "type": "string", "searchable": true },
        { "name": "price", "type": "number", "metric": "revenue" },
        { "name": "category", "type": "enum", "dimension": true,
          "values": ["Electronics", "Clothing", "Home", "Sports"] },
        { "name": "rating", "type": "number", "metric": "satisfaction" }
      ],
      "relationships": [
        { "target": "Order", "type": "one-to-many", "via": "productId" }
      ]
    }
  ]
}

Metrics & Dimensions

Concept Definition Example
Metric Quantitative, aggregatable value Revenue, Order Count, Avg Rating
Dimension Qualitative, categorical attribute Region, Category, Time Period
Measure Computed from metrics Profit Margin = (Revenue - Cost) / Revenue
Vocabulary Domain-specific terms "Churn" = inactive > 90 days

Vector Search Integration

// Vector search index definition
{
  "type": "vectorSearch",
  "fields": [
    {
      "path": "embedding",
      "type": "vector",
      "numDimensions": 768,
      "similarity": "cosine"
    },
    {
      "path": "tenantId",
      "type": "filter"
    }
  ]
}

Embedding Pipeline

flowchart LR
    Doc["πŸ“„ Document"] --> Chunk["βœ‚οΈ Chunk<br/>1000 tokens"] --> Embed["πŸ”’ Embed<br/>nomic-embed-text"] --> Store["πŸ’Ύ MongoDB<br/>Vector Store"]
Setting Default Description
Chunk size 1000 tokens Size of each document chunk
Chunk overlap 200 tokens Overlap between adjacent chunks
Embedding model nomic-embed-text Ollama embedding model
Dimensions 768 Vector dimensions
Similarity Cosine Similarity metric
Top-K 5 Number of results per query

How the AI Uses the Schema

When a user asks a question, the semantic schema is injected into the system prompt:

You have access to the following data model:

Entity: Product
  - name (string, searchable)
  - price (number, metric: revenue)  
  - category (enum: Electronics, Clothing, Home, Sports)
  - rating (number, metric: satisfaction)

Entity: Order
  - orderId (string, primary)
  - customerId (string, FK β†’ Customer)
  - total (number, metric: revenue)
  - status (enum: pending, shipped, delivered, returned)

Relationships:
  Customer β†’ Orders (one-to-many)
  Order β†’ Products (many-to-many)

This ensures the AI: - βœ… Uses real field names, not hallucinated ones - βœ… Applies correct aggregations (sum, avg, count) - βœ… Respects data types (doesn't try to sum strings) - βœ… Understands relationships for joins and drill-downs