Intelligent Routing

Route Every Request to the Optimal Model

Cascading intent classification, ELO-scored quality routing, and 16+ algorithms ensure every request hits the right provider at the right price.

Cascading Intent Classifier

Every inbound request passes through a three-tier classification cascade. Each tier is cheaper and faster than the next. Requests only escalate when the current tier cannot confidently classify intent.

Tier 1

Regex & Keyword Matching

<0.1ms

Lightning-fast pattern matching for known intents. Handles greetings, help commands, and structured queries without touching any ML model. Catches roughly 30% of inbound traffic at near-zero cost.

Tier 2

semantic-router Embedding Match

<2ms

Embeds the user message with EmbeddingGemma-300M and compares against a pre-computed route index. Handles the majority of remaining classification at embedding-only cost, with no LLM invocation required.

Tier 3

LLM Fallback Classifier

~50-200ms

For ambiguous or novel queries that escape Tiers 1 and 2, a lightweight LLM call determines intent. This fallback path processes less than 10% of traffic, keeping overall classification cost minimal.

RouteLLM BERT Classifier

At the core of Forge's auto-routing is a fine-tuned BERT model that predicts which LLM will produce the best response for a given prompt. Trained on hundreds of thousands of preference pairs with ELO scoring, it learns which models excel at different tasks: code generation, creative writing, analysis, summarization, and more.

The classifier runs locally via ONNX Runtime, adding less than 5ms of latency per request. By routing commodity queries to cost-efficient models and reserving premium models for complex tasks, teams consistently see an 85% reduction in LLM spend without measurable quality loss.

85%
Cost Reduction
<5ms
ONNX Latency
ELO
Quality Scoring

Quality Router Flow

1

Incoming prompt arrives at Quality Router

2

BERT classifier predicts model-prompt affinity scores

3

ELO rankings weight candidates by historical quality

4

Cost and latency constraints filter the candidate pool

5

Top-ranked provider receives the request

6

Response quality feeds back into ELO scoring

7-Level Priority Chain

Forge evaluates routing signals in strict priority order. Higher-priority signals always override lower ones, giving you precise control when you need it and intelligent automation when you do not.

P1

Subscription OAuth

Enterprise SSO-authenticated requests with pre-negotiated provider pools and guaranteed SLAs

P2

Explicit Model

User specifies an exact model like claude-opus-4-20250514 or gpt-4o, bypassing all auto-routing

P3

Quality Override

Request includes quality constraints (min ELO, max latency) that filter the candidate pool

P4

Cost Budget

Request specifies a cost ceiling; router selects the best model within budget

P5

Intent-Based

Cascading classifier maps intent to a pre-configured route with ideal provider

P6

Auto-Select

RouteLLM BERT classifier picks the optimal model based on learned preferences and ELO scores

P7

Default Fallback

System default model when all other signals are absent; configurable per tenant

LLMRouter: 16+ Routing Algorithms

Beyond the default quality-optimized routing, Forge supports over 16 routing strategies that you can select per-request, per-tenant, or per-agent. Mix and match strategies for experimentation, gradual rollouts, or cost control.

Cost-Optimized

Routes to the cheapest provider that meets quality thresholds

Quality-First

Picks the highest-ELO model regardless of cost

Latency-Optimized

Selects the fastest responding provider in real-time

Round-Robin

Distributes load evenly across configured providers

Weighted Random

Probabilistic routing based on configurable weights

Least-Connections

Routes to the provider with fewest active requests

Geographic

Routes based on user proximity to provider regions

Capability-Match

Selects providers with specific feature support

Token-Budget

Optimizes for maximum output within a token budget

Hybrid

Blends multiple strategies with tunable coefficients

A/B Split

Traffic splitting for model comparison experiments

Canary

Gradual rollout routing for new model deployments

Priority-Queue

Enterprise requests routed ahead of free-tier traffic

Fallback-Chain

Cascading provider list for guaranteed delivery

Context-Aware

Routes based on conversation history and memory state

Subscription-Tier

Routes to provider pools allocated per billing tier

Failover & Cost Optimization

Production workloads demand reliability. Forge ensures every request gets a response, even when providers experience outages, rate limits, or degraded performance.

Automatic Failover

When a provider returns an error or exceeds latency thresholds, Forge immediately retries the request on the next provider in the fallback chain. Retries are transparent to the caller and preserve streaming state.

Health-Aware Routing

Forge continuously monitors provider health with synthetic probes and real request telemetry. Unhealthy providers are automatically removed from the candidate pool and re-added once they recover.

Load Balancing

Distribute requests across providers using round-robin, weighted random, or least-connections strategies. Rate limits are tracked per-provider and per-tenant to prevent quota exhaustion.

Cost Optimization

RouteLLM routes 85% of requests to cost-efficient models without quality degradation, using a fine-tuned BERT classifier running in under 5ms via ONNX. Savings compound with semantic caching.

Stop overpaying for LLM calls

Set "model": "auto" and let Forge route every request to the best model at the best price. Average customers see 85% cost savings in the first week.