Routing inference across LLM providers without breaking latency
2025-09-12 / 2 min / llm / infra / cost / latency
An orchestration layer that picks the right provider per request. 28% lower provider/API spend against the prior single-provider baseline, normalised for request volume and token mix. p95 latency stayed sub-second. Caller code never changed.
Why single-provider setups stop scaling
Once an LLM product crosses meaningful traffic, single-provider becomes the bottleneck for three reasons. Cost is set by a vendor who can re-price you on 30 days' notice. Throughput is rate-limited at the org level and you fight every other tenant for headroom. Quality on your specific workload is rarely uniform across models, so you pay for the most capable model on requests that did not need it.
The naive answer is "put a router in front". That is the right answer. The implementation is where teams quietly burn weeks.
The orchestration layer
A thin Node.js service in front of OpenAI, Anthropic, and Gemini. Callers spoke a single internal request shape: prompt, tool definitions, expected output schema, latency budget, cost ceiling. Provider selection was opaque.
Selection ran a tiered policy. First, capability match (does this model support the tools and schema this caller needs). Second, rolling p95 latency from the last five minutes per provider. Third, cost per expected token. A strict 50ms budget on the routing decision itself; anything that took longer fell back to the default route.
Failures within a tier fell through to a same-tier backup. Replays of the same request were idempotent against a content hash, so a 5xx never billed us twice.
Where the 28% savings came from
The 28% figure was provider/API spend only, measured against the prior single-provider baseline after normalising for request volume and token mix. Most of the savings did not come from picking the cheapest model. They came from getting off the most-expensive default on traffic that did not need it. Internal-tool calls, summarisation, structured extraction with strict schemas: these went to cheaper models with no measurable quality drop on the eval set. The expensive model held the long-tail reasoning workloads.
A smaller share came from killing duplicate calls during retries. In the routing logs, the idempotency hash caught roughly 4% of total traffic that would otherwise have been double-billed.
Failure mode to watch for
Cross-provider behaviour is not uniform on edge cases. JSON-mode strictness, tool-call argument shapes, refusal patterns, token counting: these all vary. If your callers depend on undocumented quirks of one provider's output, the router will turn those bugs into intermittent flakes that look like infrastructure problems.
Add a normalisation layer between the router and the providers, and pin known-good provider versions per tier. Or you will spend the savings on debugging tickets that look like "sometimes it returns weird JSON".
If your LLM product has crossed meaningful traffic and you are single-provider, the orchestration layer is the right next step. Send a brief.
Read next
- An AI underwriting assistant adopted by a 120-person credit operation in 10 weeks
Not a model demo. A workflow tool the credit team actually opened every morning. Built in 10 weeks, took manual review off the top decile of cases, and saved roughly five minutes of handling time per accepted draft against the pre-launch six-minute baseline. Here is how it shipped without an LLM-replaces-humans pitch.
- Building an eval harness that actually catches regressions
Retrieval and prompt evaluation pipelines that drove an 18% relative lift in rubric pass rate over the prior eval harness, measured on production-derived canary sets. Plus why most eval setups silently lie to you.
