Routing

How Forge intelligent routing works across providers.

Routing

Forge's routing system is the core intelligence layer that decides which LLM provider and model handles each request. Instead of hardcoding a single provider, you let Forge dynamically select the best option based on query complexity, cost constraints, latency requirements, and real-time model performance.

Cascading Intent Classifier

Every incoming request passes through a 3-tier intent classification system that determines query complexity and intent category:

Tier 1 — Regex/Keyword: Fast pattern matching for common queries (greetings, simple Q&A, code generation keywords). Sub-millisecond, handles 40-60% of traffic.
Tier 2 — Semantic Router: Uses EmbeddingGemma-300M to classify intent by embedding similarity against trained categories. Under 5ms via ONNX runtime.
Tier 3 — LLM Fallback: For ambiguous queries, a lightweight LLM classifies intent. Only triggered for 5-10% of requests.

Quality Router with ELO Scoring

Once intent is classified, the Quality Router ranks available models using a multi-factor scoring system:

ELO Score: Continuously updated based on response quality, user feedback, and evaluation benchmarks.
Cost: Per-token pricing for each model, weighted by the user's cost sensitivity setting.
Latency: Real-time P50/P95 latency tracking per provider, per region.
Capability Match: Model capabilities (code, vision, function calling, long context) matched to request requirements.

The RouteLLM BERT classifier achieves 85% cost reduction compared to always routing to the most expensive model, with less than 5ms decision overhead using ONNX inference.

Provider Failover

Forge implements automatic failover through the Bifrost sidecar. When a provider returns an error (rate limit, timeout, server error), the request is automatically retried on the next-best provider. The failover chain is configurable:

{
  "model": "auto",
  "forge": {
    "routing": {
      "failover": true,
      "maxRetries": 3,
      "providers": ["openai", "anthropic", "google"],
      "timeout": 30000
    }
  }
}

Cost Optimization

Control routing costs with the costSensitivity parameter:

{
  "forge": {
    "routing": {
      "costSensitivity": "high",    // Prefer cheaper models
      "maxCostPerRequest": 0.05,    // Hard cap at $0.05
      "preferredTier": "mid"        // Balance cost and quality
    }
  }
}

Cost sensitivity levels: "low" (best quality, higher cost), "medium" (balanced), "high" (minimize cost, acceptable quality).

7-Level Routing Priority Chain

The routing system evaluates providers through seven priority levels:

Explicit Model: If a specific model is requested, route directly.
Feature Requirements: Filter models that support required features (vision, tools, streaming).
Security Policy: Exclude providers that don't meet the security level.
Subscription Tier: Gate premium models by user tier.
Cost Budget: Apply cost constraints.
Quality Score: Rank remaining models by ELO.
Load Balance: Distribute across equivalent models to avoid rate limits.

Configuration Examples

Pin to a specific provider but allow failover:

{
  "model": "gpt-4o",
  "forge": {
    "routing": {
      "preferred": "openai",
      "failover": true,
      "fallbackModels": ["claude-sonnet-4-20250514", "gemini-2.0-flash"]
    }
  }
}

Use ensemble routing for critical decisions:

{
  "model": "auto",
  "forge": {
    "ensemble": {
      "enabled": true,
      "strategy": "best-of-n",
      "n": 3,
      "judge": "quality-score"
    }
  }
}

Key Concepts

Memory

Back to all documentation