RAG Routing — One API, Many Architectures
RAG Routing
One API endpoint. Ten supported retrieval architectures. The router learns from your historical query traffic and dispatches each new question to the backend most likely to answer it correctly — at the lowest cost that still passes your quality bar.
The three architectures, conceptually
Most production RAG systems ship one retrieval architecture and call it done. We ship a router that picks across architecturally distinct stacks — the right choice is rarely the same for every query in your corpus.
→ stuff context
→ generate
Best for
Single-fact lookups, FAQ-shaped queries, "what is X?" questions on flat-chunked corpora.
→ Reciprocal Rank Fusion
→ cross-encoder reranker
→ generate
Best for
Queries where lexical and semantic signals disagree — codes, names, acronyms, technical vocabulary, error strings.
at ingest → agent walks tree
→ opens / reads sections
→ generate
Best for
Multi-hop reading of long structured documents — legal contracts, financial 10-Ks, technical PDFs where context spans non-adjacent sections.
How the router actually decides
Most published RAG routers classify the query upfront by complexity. Ours doesn't. We use learned routing: every successful query is stored with the backend that answered it, and new queries are matched against that history by embedding similarity.
The lookup algorithm — what runs on every query
- Hash the question with SHA-256, truncated to a 16-character key, and check the per-customer routing store in Cloudflare KV for an exact prior match. If it's been answered before, dispatch immediately to the backend that did best last time.
- On miss, embed the question and cosine-search against the cached index of historical question embeddings. If the nearest neighbour's similarity exceeds 0.88, dispatch to its associated backend.
- On no match above threshold, fall back to the customer's default backend for that corpus.
- After the answer is rendered, the (question hash, backend, quality score) tuple is written back into the per-customer routing-history store, seeding future lookups.
The ten backends we route between today
The router dispatches to one of ten named backends. Three of them are "Tier 3-shaped" (hierarchical or graph-enhanced); the others are pure-vector engines we treat as Tier 1 with different operational tradeoffs.
Tier 2 (BM25 + dense fusion + cross-encoder reranker) ships in our workflow canvas as a composable node today. The auto-router targets it next as the per-corpus routing data justifies.
API surface — one endpoint, audit-grade transparency
The router is invisible to your caller. One request shape; the response includes the routing decision so you can audit which backend answered (and why).
# One endpoint. The router decides which backend to use.
curl -X POST https://api.divinci.app/v1/rag/query \
-H "Authorization: Bearer $DIVINCI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"question": "What clauses in the 2024 amendment override section 7.3?",
"corpus": "legal-contracts-q4"
}'
# Response — chunks the agent needs to ground the answer.
{
"items": [
{
"content": "Section 7.3 is superseded by …",
"metadata": { "doc": "amendment-2024.pdf", "section": "II.4.b" },
"score": 0.91
}
/* … */
],
"routing": {
"backend": "pageindex", // dispatched tier-3 page-index
"match_source": "learned-history", // arena · auto-fix · or fallback
"similarity": 0.92, // ≥ 0.88 threshold
"ttl_remaining":"23d 14h" // freshness window before re-benchmark
}
}
The routing metadata is currently logged internally and surfaced via the audit trail. Inline response delivery is rolling out across Q3 2026.
How this differs from existing routers
RAG routing isn't a new idea — academic routers like Adaptive-RAG and Probing-RAG already classify queries by complexity. The differentiation is that Divinci routes across architecturally distinct retrieval stacks, learned from your own traffic, behind one managed endpoint.
| Offering | What it routes between | Routing axis | Managed? |
|---|---|---|---|
| Divinci RAG Routing | 10 backends (PageIndex, RAPTOR, LightRAG, neo4j, 6 vector engines) | Architecture · learned from history | Yes — single endpoint |
| LlamaIndex RouterRetriever | BYO retrievers | LLM/Pydantic selector | No — library you assemble |
| Adaptive-RAG (Jeong et al.) | no-retrieval / single-step / iterative | Depth · query complexity classifier | Research |
| Cloudflare AI Search (ex-AutoRAG) | One hybrid pipeline | No routing | Yes |
| AWS Bedrock Knowledge Bases | One hybrid pipeline | No routing | Yes |
| Azure AI Search Agentic Retrieval | Hybrid + separate agentic mode | User picks mode manually | Yes |
| VectifyAI PageIndex | Single architecture (hierarchical traversal) | No routing | OSS standalone |
The honest weakness in our pitch: per-query RAG routing as a concept isn't new. We didn't invent routing. The genuine differentiation is the combination of (a) routing across architecturally distinct stacks rather than depth variants, (b) PageIndex / RAPTOR / LightRAG-style hierarchical traversal included as a first-class backend rather than a separate product, and (c) one managed endpoint instead of a library you assemble and operate yourself.
How routing preferences get seeded
Your routing model isn't pre-trained — it learns from your traffic. Three signals feed the routing-history store.
- Arena selection. Run a query through RAG Arena across multiple backends, score the variants side-by-side, pick the winner. The (question, winning-backend) pair lands in the routing store.
- Auto-fix outputs. When our auto-fix runs comparative retrievals on representative queries during ingestion or scheduled audits, the best-performing backend per query is written into the same store.
- Production feedback. Successful queries (those that scored above your quality threshold via our online evaluation gate — see the regression-testing post) write their (question hash, backend) pair back into the routing store at request-time, with a 30-day TTL so the routing model stays fresh as your corpus evolves.
When this matters most
A homogeneous corpus with uniform query shapes benefits little — pick one backend manually and you're done. The wedge is mixed corpora and mixed query shapes.
A legal team that asks both "what is the definition of force majeure in our standard contract?" (Tier 1, sub-300 ms) and "across our 47 vendor contracts, which ones have non-standard termination clauses and what are the patterns?" (Tier 3, multi-second page-index traversal) doesn't want to pick one backend. They want the simple question to come back fast and cheap, and the deep question to come back correctly even if it costs more — without operating two stacks.
That's the case where one managed endpoint routing across architecturally distinct backends earns its keep. If your traffic is uniform, you don't need it. If your traffic is mixed — most real enterprise corpora are — you do.
Deeper reading and adjacent products
The architecture deep-dive lives in our blog post The Future of RAG Systems: Beyond Simple Document Retrieval. The arena that powers Step 1 above is at RAG Arena & Dynamic Routing. Routing decisions are audit-anchored via the same release-manifest pattern we use across the platform — see Validating and Releasing Custom LMs in Regulated Fields. And if you want to know how we evaluate retrieval quality online (the signal that feeds Step 3 above), the regression-testing post is where to start.