Understanding Intelligent Routing: How Forge Picks the Best Model
When you send a request to Forge with "model": "auto", you are activating the most sophisticated LLM routing engine we have seen in production. In under 5ms, Forge analyzes your request, classifies its intent, evaluates provider availability, and routes to the optimal model. This article explains how each component works.
The Cascading Intent Classifier
The first step in routing is understanding what the user is asking for. Forge's intent classifier operates in three tiers, each more powerful and more expensive than the last. The cascade exits as soon as a tier produces a confident classification.
Tier 1: Regex and Keyword Matching. The fastest tier checks for explicit signals in the prompt. Requests containing code blocks, specific programming language keywords, or structured output indicators are classified instantly. Requests asking for translations, summaries, or simple factual questions hit predefined patterns. This tier resolves roughly 40% of requests in under 1ms.
Tier 2: Semantic Router. For requests that do not match explicit patterns, Forge generates an embedding using EmbeddingGemma-300M and compares it against a library of intent clusters. Each cluster maps to a task category (coding, creative writing, analysis, conversation, math, etc.) and a set of recommended models. This tier resolves another 45% of requests in 2-3ms.
Tier 3: LLM Fallback. The remaining 15% of ambiguous requests are classified by a fast, cheap LLM call. This adds latency but ensures that even novel request types are routed appropriately. The fallback model is configured to return only a classification label, keeping costs minimal.
RouteLLM: The BERT Classifier
Once the intent is classified, Forge's RouteLLM component decides the quality tier. RouteLLM is a fine-tuned BERT classifier, compiled to ONNX for sub-5ms inference. It was trained on hundreds of thousands of prompt-response pairs labeled with quality assessments, learning to predict which requests genuinely need a frontier model and which will be served equally well by a smaller, cheaper alternative.
The key insight is that most requests do not need the most expensive model. A "What is the capital of France?" question returns the same answer from GPT-4o and GPT-4o-mini, but the cost difference is 10-20x. RouteLLM identifies these cases with high accuracy, routing simple requests to capable but inexpensive models. In production, this delivers approximately 85% cost savings compared to always using the frontier model, with less than 2% quality degradation as measured by automated evaluation suites.
Quality Router with ELO Scoring
Not all models are equal on all tasks, and their relative performance changes over time as providers release updates. Forge maintains an ELO scoring system that continuously evaluates model performance across task categories. When a new model version is released, its ELO scores start at the baseline and are updated as real request data flows through the system.
The Quality Router consults the ELO scores after RouteLLM determines the quality tier. Within that tier, it selects the model with the highest ELO score for the classified task category. This means Forge automatically adapts to provider improvements without manual configuration changes. If Anthropic releases a Claude update that performs better on coding tasks, the ELO system detects this within hours and adjusts routing accordingly.
The Seven-Level Priority Chain
After intent classification, quality routing, and ELO scoring narrow the field, the seven-level priority chain makes the final selection. Each level acts as a filter, and the first level that produces a definitive answer wins:
- Explicit Model Override: If the user specified a model directly (e.g.,
"model": "claude-opus-4-20250514"), that takes absolute priority. - Subscription OAuth: Enterprise customers with dedicated provider agreements get routed to their contracted endpoints.
- Priority Parameter: The
forge.priorityfield (speed, quality, balanced, cost) adjusts the weighting between cost, latency, and quality scores. - Provider Health: Real-time health checks remove providers experiencing elevated error rates or latency spikes.
- Cost Optimization: Among healthy, qualified providers, the cheapest option for the required quality tier is preferred.
- Geographic Affinity: Data residency requirements and latency minimization based on user location.
- Load Balancing: Final tie-breaking distributes load across equivalent providers to prevent hot spots.
Multi-Provider Failover
Even after the routing decision is made, Forge maintains a ranked fallback list. If the selected provider returns an error, times out, or exceeds its rate limit, Forge automatically retries with the next provider in the list. The retry happens transparently — your application receives a successful response without knowing that a failover occurred. The only evidence is in the Langfuse trace, which shows the original provider, the failure reason, and the fallback provider that ultimately served the request.
Forge uses the Bifrost sidecar for provider failover. Bifrost maintains persistent connections to all 14+ providers, monitors their health with sub-second granularity, and pre-warms connections so that failover adds less than 50ms of additional latency.
The Result
The combination of cascading classification, RouteLLM cost optimization, ELO-scored quality routing, and multi-provider failover means that every request gets the best possible model for its task at the lowest possible cost with automatic resilience. You do not need to build provider switching logic, maintain model performance benchmarks, or implement retry policies. Forge handles all of it in the request path, at a latency overhead of less than 10ms for the entire routing stack.
Stay up to date
Get the latest articles on AI infrastructure, security, and engineering delivered to your inbox. No spam, unsubscribe anytime.
By subscribing you agree to our privacy policy.