Route Every Request to the Optimal Model
Cascading intent classification, ELO-scored quality routing, and 16+ algorithms ensure every request hits the right provider at the right price.
Cascading Intent Classifier
Every inbound request passes through a three-tier classification cascade. Each tier is cheaper and faster than the next. Requests only escalate when the current tier cannot confidently classify intent.
Regex & Keyword Matching
<0.1msLightning-fast pattern matching for known intents. Handles greetings, help commands, and structured queries without touching any ML model. Catches roughly 30% of inbound traffic at near-zero cost.
semantic-router Embedding Match
<2msEmbeds the user message with EmbeddingGemma-300M and compares against a pre-computed route index. Handles the majority of remaining classification at embedding-only cost, with no LLM invocation required.
LLM Fallback Classifier
~50-200msFor ambiguous or novel queries that escape Tiers 1 and 2, a lightweight LLM call determines intent. This fallback path processes less than 10% of traffic, keeping overall classification cost minimal.
RouteLLM BERT Classifier
At the core of Forge's auto-routing is a fine-tuned BERT model that predicts which LLM will produce the best response for a given prompt. Trained on hundreds of thousands of preference pairs with ELO scoring, it learns which models excel at different tasks: code generation, creative writing, analysis, summarization, and more.
The classifier runs locally via ONNX Runtime, adding less than 5ms of latency per request. By routing commodity queries to cost-efficient models and reserving premium models for complex tasks, teams consistently see an 85% reduction in LLM spend without measurable quality loss.
Quality Router Flow
Incoming prompt arrives at Quality Router
BERT classifier predicts model-prompt affinity scores
ELO rankings weight candidates by historical quality
Cost and latency constraints filter the candidate pool
Top-ranked provider receives the request
Response quality feeds back into ELO scoring
7-Level Priority Chain
Forge evaluates routing signals in strict priority order. Higher-priority signals always override lower ones, giving you precise control when you need it and intelligent automation when you do not.
Subscription OAuth
Enterprise SSO-authenticated requests with pre-negotiated provider pools and guaranteed SLAs
Explicit Model
User specifies an exact model like claude-opus-4-20250514 or gpt-4o, bypassing all auto-routing
Quality Override
Request includes quality constraints (min ELO, max latency) that filter the candidate pool
Cost Budget
Request specifies a cost ceiling; router selects the best model within budget
Intent-Based
Cascading classifier maps intent to a pre-configured route with ideal provider
Auto-Select
RouteLLM BERT classifier picks the optimal model based on learned preferences and ELO scores
Default Fallback
System default model when all other signals are absent; configurable per tenant
LLMRouter: 16+ Routing Algorithms
Beyond the default quality-optimized routing, Forge supports over 16 routing strategies that you can select per-request, per-tenant, or per-agent. Mix and match strategies for experimentation, gradual rollouts, or cost control.
Cost-Optimized
Routes to the cheapest provider that meets quality thresholds
Quality-First
Picks the highest-ELO model regardless of cost
Latency-Optimized
Selects the fastest responding provider in real-time
Round-Robin
Distributes load evenly across configured providers
Weighted Random
Probabilistic routing based on configurable weights
Least-Connections
Routes to the provider with fewest active requests
Geographic
Routes based on user proximity to provider regions
Capability-Match
Selects providers with specific feature support
Token-Budget
Optimizes for maximum output within a token budget
Hybrid
Blends multiple strategies with tunable coefficients
A/B Split
Traffic splitting for model comparison experiments
Canary
Gradual rollout routing for new model deployments
Priority-Queue
Enterprise requests routed ahead of free-tier traffic
Fallback-Chain
Cascading provider list for guaranteed delivery
Context-Aware
Routes based on conversation history and memory state
Subscription-Tier
Routes to provider pools allocated per billing tier
Failover & Cost Optimization
Production workloads demand reliability. Forge ensures every request gets a response, even when providers experience outages, rate limits, or degraded performance.
Automatic Failover
When a provider returns an error or exceeds latency thresholds, Forge immediately retries the request on the next provider in the fallback chain. Retries are transparent to the caller and preserve streaming state.
Health-Aware Routing
Forge continuously monitors provider health with synthetic probes and real request telemetry. Unhealthy providers are automatically removed from the candidate pool and re-added once they recover.
Load Balancing
Distribute requests across providers using round-robin, weighted random, or least-connections strategies. Rate limits are tracked per-provider and per-tenant to prevent quota exhaustion.
Cost Optimization
RouteLLM routes 85% of requests to cost-efficient models without quality degradation, using a fine-tuned BERT classifier running in under 5ms via ONNX. Savings compound with semantic caching.
Stop overpaying for LLM calls
Set "model": "auto" and let Forge route every request to the best model at the best price. Average customers see 85% cost savings in the first week.