RAG Routing

One API endpoint. Ten supported retrieval architectures. The router learns from your historical query traffic and dispatches each new question to the backend most likely to answer it correctly — at the lowest cost that still passes your quality bar.

Talk to us Read the deep-dive →

The three architectures, conceptually

Most production RAG systems ship one retrieval architecture and call it done. We ship a router that picks across architecturally distinct stacks — the right choice is rarely the same for every query in your corpus.

Tier 1 · Flat-Vector RAG

FAST & CHEAP

embed → cosine top-k
→ stuff context
→ generate

Best for

Single-fact lookups, FAQ-shaped queries, "what is X?" questions on flat-chunked corpora.

Latency:< 300 ms p95Cost:cents per queryBackends:Qdrant · Cloudflare · Vertex · MongoDB · Redis

Tier 2 · Hybrid + Rerank

BALANCED

BM25 lexical + dense vector
→ Reciprocal Rank Fusion
→ cross-encoder reranker
→ generate

Best for

Queries where lexical and semantic signals disagree — codes, names, acronyms, technical vocabulary, error strings.

Latency:~ 800 msCost:still lowToday:composable workflow node · auto-router roadmap

Tier 3 · Page-Index + Agent

DEEP & DELIBERATE

hierarchical TOC tree built
at ingest → agent walks tree
→ opens / reads sections
→ generate

Best for

Multi-hop reading of long structured documents — legal contracts, financial 10-Ks, technical PDFs where context spans non-adjacent sections.

Latency:multi-secondCost:highest — but only when neededBackend:PageIndex · RAPTOR · LightRAG · neo4j-hybrid

How the router actually decides

Most published RAG routers classify the query upfront by complexity. Ours doesn't. We use learned routing: every successful query is stored with the backend that answered it, and new queries are matched against that history by embedding similarity.

The lookup algorithm — what runs on every query

Hash the question with SHA-256, truncated to a 16-character key, and check the per-customer routing store in Cloudflare KV for an exact prior match. If it's been answered before, dispatch immediately to the backend that did best last time.
On miss, embed the question and cosine-search against the cached index of historical question embeddings. If the nearest neighbour's similarity exceeds 0.88, dispatch to its associated backend.
On no match above threshold, fall back to the customer's default backend for that corpus.
After the answer is rendered, the (question hash, backend, quality score) tuple is written back into the per-customer routing-history store, seeding future lookups.

Why "learned" instead of "classified"? Empirically the same query shape behaves differently on different corpora. "Compare X across Y" on legal contracts wants Tier 3 page-index traversal; the same shape on a flat FAQ corpus is fine on Tier 1. Letting the routing model learn that distinction per-corpus from historical evidence, rather than guessing from query syntax, is the design choice that actually shipped.

The ten backends we route between today

The router dispatches to one of ten named backends. Three of them are "Tier 3-shaped" (hierarchical or graph-enhanced); the others are pure-vector engines we treat as Tier 1 with different operational tradeoffs.

pageindexHierarchical TOC tree + agentic traversal. The Tier 3 archetype.

raptorTree-traversal retrieval over recursively summarised document hierarchies (ICLR 2024).

neo4j-hybridGraph-enhanced retrieval combining vector embeddings with explicit entity / relationship structure.

lightragFlat-graph dual-mode retrieval — entity + community search, the HKU LightRAG approach.

qdrantSelf-hosted dense-vector engine for high-throughput, low-latency lookups.

cloudflare-v2Vectorize at the edge — sub-300 ms p95 from Cloudflare's global network.

couchbase-byokBYO Couchbase vector store for customers with existing operational dependencies.

vertex-ai-vector-search-v2Google Cloud Vertex AI vector search for customers on Google's data stack.

mongodb-atlasAtlas Vector Search for customers running document data on MongoDB.

redis-vector-searchRedis vector search for ultra-low-latency in-memory retrieval workloads.

Tier 2 (BM25 + dense fusion + cross-encoder reranker) ships in our workflow canvas as a composable node today. The auto-router targets it next as the per-corpus routing data justifies.

API surface — one endpoint, audit-grade transparency

The router is invisible to your caller. One request shape; the response includes the routing decision so you can audit which backend answered (and why).

# One endpoint. The router decides which backend to use.
curl -X POST https://api.divinci.app/v1/rag/query \
  -H "Authorization: Bearer $DIVINCI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What clauses in the 2024 amendment override section 7.3?",
    "corpus":   "legal-contracts-q4"
  }'
# Response — chunks the agent needs to ground the answer.
{
  "items": [
    {
      "content":  "Section 7.3 is superseded by …",
      "metadata": { "doc": "amendment-2024.pdf", "section": "II.4.b" },
      "score":    0.91
    }
    /* … */
  ],
  "routing": {
    "backend":      "pageindex",           // dispatched tier-3 page-index
    "match_source": "learned-history",     // arena · auto-fix · or fallback
    "similarity":   0.92,                  // ≥ 0.88 threshold
    "ttl_remaining":"23d 14h"              // freshness window before re-benchmark
  }
}

The routing metadata is currently logged internally and surfaced via the audit trail. Inline response delivery is rolling out across Q3 2026.

How this differs from existing routers

RAG routing isn't a new idea — academic routers like Adaptive-RAG and Probing-RAG already classify queries by complexity. The differentiation is that Divinci routes across architecturally distinct retrieval stacks, learned from your own traffic, behind one managed endpoint.

Offering	What it routes between	Routing axis	Managed?
Divinci RAG Routing	10 backends (PageIndex, RAPTOR, LightRAG, neo4j, 6 vector engines)	Architecture · learned from history	Yes — single endpoint
LlamaIndex RouterRetriever	BYO retrievers	LLM/Pydantic selector	No — library you assemble
Adaptive-RAG (Jeong et al.)	no-retrieval / single-step / iterative	Depth · query complexity classifier	Research
Cloudflare AI Search (ex-AutoRAG)	One hybrid pipeline	No routing	Yes
AWS Bedrock Knowledge Bases	One hybrid pipeline	No routing	Yes
Azure AI Search Agentic Retrieval	Hybrid + separate agentic mode	User picks mode manually	Yes
VectifyAI PageIndex	Single architecture (hierarchical traversal)	No routing	OSS standalone

The honest weakness in our pitch: per-query RAG routing as a concept isn't new. We didn't invent routing. The genuine differentiation is the combination of (a) routing across architecturally distinct stacks rather than depth variants, (b) PageIndex / RAPTOR / LightRAG-style hierarchical traversal included as a first-class backend rather than a separate product, and (c) one managed endpoint instead of a library you assemble and operate yourself.

How routing preferences get seeded

Your routing model isn't pre-trained — it learns from your traffic. Three signals feed the routing-history store.

Arena selection. Run a query through RAG Arena across multiple backends, score the variants side-by-side, pick the winner. The (question, winning-backend) pair lands in the routing store.
Auto-fix outputs. When our auto-fix runs comparative retrievals on representative queries during ingestion or scheduled audits, the best-performing backend per query is written into the same store.
Production feedback. Successful queries (those that scored above your quality threshold via our online evaluation gate — see the regression-testing post) write their (question hash, backend) pair back into the routing store at request-time, with a 30-day TTL so the routing model stays fresh as your corpus evolves.

Where this is genuinely production-grade vs roadmap: Steps 1 and 2 ship today. Step 3's automatic feedback loop is partially shipped — successful queries write back, but tier-2 (BM25 + RRF + reranker) is currently composed as a workflow node rather than auto-routed. We'll fold Tier 2 into the auto-router as the routing data shows clear win conditions for it.

When this matters most

A homogeneous corpus with uniform query shapes benefits little — pick one backend manually and you're done. The wedge is mixed corpora and mixed query shapes.

A legal team that asks both "what is the definition of force majeure in our standard contract?" (Tier 1, sub-300 ms) and "across our 47 vendor contracts, which ones have non-standard termination clauses and what are the patterns?" (Tier 3, multi-second page-index traversal) doesn't want to pick one backend. They want the simple question to come back fast and cheap, and the deep question to come back correctly even if it costs more — without operating two stacks.

That's the case where one managed endpoint routing across architecturally distinct backends earns its keep. If your traffic is uniform, you don't need it. If your traffic is mixed — most real enterprise corpora are — you do.

Deeper reading and adjacent products

The architecture deep-dive lives in our blog post The Future of RAG Systems: Beyond Simple Document Retrieval. The arena that powers Step 1 above is at RAG Arena & Dynamic Routing. Routing decisions are audit-anchored via the same release-manifest pattern we use across the platform — see Validating and Releasing Custom LMs in Regulated Fields. And if you want to know how we evaluate retrieval quality online (the signal that feeds Step 3 above), the regression-testing post is where to start.

RAG Routing — One API, Many Architectures

RAG Routing

The three architectures, conceptually

Best for

Best for

Best for

How the router actually decides

The lookup algorithm — what runs on every query

The ten backends we route between today

API surface — one endpoint, audit-grade transparency

How this differs from existing routers

How routing preferences get seeded

When this matters most