Routing System Architecture
Tier 2 | Deep technical documentation for model routing Hub: README.md | Full Architecture: ARCHITECTURE.md
Overview
The routing system intelligently selects the optimal CLI/model for each task through a multi-stage pipeline. The full executed order in composite-router-stages.ts:runPipeline is:
Task
→ Budget (filter — eliminate over-budget CLIs)
→ Scoring (parallel) (ConfidenceCascade, CapabilityMatch,
KnnRouting, DistilledRule,
ResourceStrategy, ZeroRouter, Preference)
→ QualityConstraint (constraint-first; can short-circuit, #1686)
→ CategoryOverride (CATEGORY_CHAIN_OVERRIDES per category;
can short-circuit on sensitive cats,
#2414/#2417)
→ TOPSIS (rank, with stage-score-adjusted profiles
and performance-floor penalties, #1354/#1401)
→ LinUCB (bandit selection from ranked candidates)
→ PerfFloorOverride (reject LinUCB pick if CLI < 50% success at
≥20 samples; promote TOPSIS top, #1790)
→ Latency (record per-CLI latency for feedback loop)
→ Selected Model
The simpler legacy “Budget → ZeroRouter → Preference → TOPSIS → LinUCB” 5-stage diagram pre-dated #755/#1350/#2414. The constraint and category-override stages can short-circuit routing without ever reaching TOPSIS — omitting them gives the wrong mental model when debugging “why was my model rejected?” (#2947).
Use CompositeRouter.route(task) — do NOT directly instantiate stage routers.
CompositeRouter Pipeline
Chains multiple routers in sequence for intelligent model selection.
interface ICompositeRouter {
route(task: CliTask): Promise<Result<CompositeRoutingDecision, CompositeRoutingError>>;
getStats(): CompositeRouterStats;
invalidateCaches(): void;
}
interface CompositeRoutingDecision {
readonly cliName: 'claude' | 'gemini' | 'codex' | 'opencode';
readonly reason: string;
readonly confidence: number;
readonly topsisScore?: number;
readonly linucbExploration?: number;
readonly alternatives: readonly ('claude' | 'gemini' | 'codex' | 'opencode')[];
readonly stagesExecuted: readonly string[];
}
Stage 1: Task Analysis
Profiles tasks before routing:
| Characteristic | Derived From | Impact |
|---|---|---|
reasoningComplexity | Keywords (“design”, “architect”) | Boosts Claude quality score |
contextRequired | 0.25 tokens/char + 500 tokens/file | Filters by context window |
codeGeneration | Keywords (“implement”, “write”) | Boosts Codex score |
budgetSensitive | Keywords (“quick”, “simple”) | Prioritizes Gemini |
Stage 2: Budget Filter
Enforces token/cost/latency constraints:
interface BudgetConstraint {
readonly maxTokens?: number;
readonly maxCostUsd?: number;
readonly maxLatencyMs?: number;
}
Stage 3: TOPSIS Ranking
Multi-criteria decision for Pareto-optimal selection:
| Criterion | Weight | Direction | Description |
|---|---|---|---|
| Quality | 50% | Maximize | Reasoning + code generation |
| Cost | 30% | Minimize | $/token estimate |
| Latency | 20% | Minimize | Response time |
Stage 4: LinUCB Learning
Contextual bandit learns from outcomes:
// 6D context vector
const context = {
taskComplexity: 0.8, // Normalized 0-1
contextLengthNormalized: 0.3, // Tokens / max context
isCodeTask: true,
isReasoningTask: false,
budgetUtilization: 0.2, // % of budget used
timePressure: 0.0, // Deadline proximity
};
// UCB score calculation
UCB = E[reward | context] + alpha * sqrt(uncertainty);
Task Router Interface
Routes tasks to optimal CLI based on capability matching.
interface ITaskRouter {
route(task: Task): Promise<Result<ICliAdapter, RoutingError>>;
routeWithDetails(task: Task): Promise<Result<RoutingDecision, RoutingError>>;
}
interface RoutingDecision {
readonly adapter: ICliAdapter;
readonly confidence: number; // 0-1 routing confidence
readonly reason: string; // Why this CLI was chosen
readonly alternatives: readonly ICliAdapter[];
readonly decisionTimeMs: number;
}
type CliName = 'claude' | 'gemini' | 'codex' | 'opencode';
type CliTransport = 'mcp' | 'subprocess';
Budget Router (IBudgetRouter)
Budget-constrained routing with PILOT pattern (arXiv:2508.21141).
interface IBudgetRouter {
getSessionBudget(): SessionBudget;
updateBudget(usage: { tokens?: number; costUsd?: number }): void;
resetBudget(): void;
checkBudget(task: CliTask, constraint?: BudgetConstraint): BudgetRoutingResult;
routeWithBudget(
task: CliTask,
budget?: BudgetConstraint
): Promise<Result<BudgetRoutingResult, BudgetExceededError>>;
executeWithBudget(
task: CliTask,
budget?: BudgetConstraint
): Promise<Result<CliResponse & { budgetAfter: SessionBudget }, CliError>>;
}
Budget Thresholds
| Level | Usage | Action |
|---|---|---|
| Info | 50% | Log usage |
| Warning | 75% | Warn user |
| Critical | 90% | Suggest task simplification |
| Hard | 100% | Reject task |
Session Budget
interface SessionBudget {
readonly tokenBudget: number; // Default: 1M tokens
readonly costBudgetUsd: number; // Default: $10
readonly tokensUsed: number;
readonly costUsed: number;
readonly resetAt: number; // Epoch ms
}
Circuit Breaker (ICircuitBreaker)
Prevents cascading failures with configurable thresholds.
interface ICircuitBreaker {
execute<T>(operation: () => Promise<T>): Promise<T>;
getState(): CircuitState; // 'closed' | 'open' | 'half_open'
recordFailure(category: FailureCategory): void;
recordSuccess(): void;
reset(): void;
getSnapshot(): CircuitBreakerSnapshot;
}
State Transitions
stateDiagram-v2
[*] --> Closed
Closed --> Open: failures >= threshold
Open --> HalfOpen: timeout elapsed
HalfOpen --> Closed: success
HalfOpen --> Open: failure
Configuration
circuitBreaker:
failureThreshold: 5 # Failures before open
successThreshold: 2 # Successes to close from half-open
timeout: 30000 # ms before half-open
rollingWindow: 60000 # ms for failure counting
CLI Detection Cache (ICliDetectionCache)
Caches CLI health check results with TTL and invalidation.
interface ICliDetectionCache {
get(cliName: CliName): Promise<CliHealthResult | undefined>;
set(cliName: CliName, result: CliHealthResult): Promise<void>;
invalidate(cliName: CliName): void;
invalidateAll(): void;
getStats(): CacheStats;
onInvalidate(listener: (cliName: CliName) => void): () => void;
}
interface CliHealthResult {
readonly available: boolean;
readonly version?: string;
readonly checkedAt: number;
readonly error?: string;
}
Cache TTL Strategy
| Scenario | TTL | Rationale |
|---|---|---|
| Available | 5 minutes | Stable, reduce checks |
| Unavailable | 30 seconds | Retry quickly after failure |
| Version change | Immediate | Capabilities may differ |
Token Counter (ITokenCounter)
Universal token counting across model providers.
interface ITokenCounter {
count(text: string): Promise<TokenCountResult>;
countMessages(messages: Message[]): Promise<TokenCountResult>;
getMaxTokens(): number;
getProvider(): TokenCounterProvider;
}
type TokenCounterProvider = 'tiktoken' | 'anthropic' | 'heuristic';
Provider Selection
| Provider | Accuracy | Speed | Use Case |
|---|---|---|---|
tiktoken | High | Fast | OpenAI models |
anthropic | Exact | Medium | Claude models |
heuristic | ±10% | Instant | Quick estimates |
Capacity Monitor (ICapacityMonitor)
Tracks rate limits across model providers.
interface ICapacityMonitor {
updateFromHeaders(provider: string, headers: Headers): void;
getCapacity(provider: string): CapacityInfo | null;
onLowCapacity(callback: LowCapacityCallback): () => void;
setLowCapacityThreshold(threshold: number): void;
getTimeUntilReset(provider: string): number | null;
}
interface CapacityInfo {
readonly remainingTokens: number;
readonly remainingRequests: number;
readonly resetTime: Date | null;
readonly utilizationPercent: number;
}
Rate Limit Headers
| Provider | Token Header | Request Header |
|---|---|---|
| Anthropic | anthropic-ratelimit-* | anthropic-ratelimit-* |
| OpenAI | x-ratelimit-*-tokens | x-ratelimit-*-requests |
x-goog-api-* | x-goog-api-* |
Work Balancer (IWorkBalancer)
Distributes parallel tasks across available CLIs.
interface IWorkBalancer {
balance(tasks: TaskProfile[]): Promise<BalanceResult>;
queueTask(task: TaskProfile): void;
getQueueDepth(): number;
clearQueue(): void;
}
interface BalanceResult {
assignments: Map<string, CliName>;
unassigned: string[];
reasoning: Record<string, ScoreBreakdown>;
}
Balancing Algorithm
- Capacity check: Filter CLIs with available capacity
- Task match: Score CLI capabilities vs task requirements
- Load balance: Distribute evenly with affinity hints
- Fallback: Queue tasks if all CLIs at capacity
Feedback Integration (IFeedbackIntegration)
Connects routing decisions to outcomes for closed-loop learning.
interface IFeedbackIntegration {
recordRoutingDecision(decision: CompositeRoutingDecision): string;
recordOutcome(routingId: string, outcome: TaskOutcome): void;
getRoutingStats(cliName: CliName): RoutingOutcomeStats;
exportFeedback(): FeedbackExport;
}
interface TaskOutcome {
readonly success: boolean;
readonly latencyMs: number;
readonly tokensUsed?: number;
readonly errorCategory?: string;
}
interface RoutingOutcomeStats {
readonly totalRoutings: number;
readonly successRate: number;
readonly avgLatencyMs: number;
readonly avgTokens: number;
}
Reward Computation
reward = success * 0.5 + (1 - retries / max) * 0.3 + coherence * 0.2;
CLI Debugging
# Dry-run routing for a task
nexus-agents routing-audit "Implement a sorting algorithm" --format=json
# Output shows:
# - Task profile analysis
# - Budget filter results
# - TOPSIS scores per CLI
# - LinUCB selection with UCB scores
# - Feature importance analysis
# Show bandit statistics
nexus-agents routing-audit "task" --bandit-stats
Configuration
routing:
enableBudgetFilter: true # Stage 2 on/off
enableTopsisRanking: true # Stage 3 on/off
enableLinUCBSelection: true # Stage 4 on/off
budget:
tokenBudget: 1000000 # Session token limit
costBudgetUsd: 10.0 # Session cost limit
resetIntervalMs: 3600000 # 1 hour reset
topsis:
qualityWeight: 0.5
costWeight: 0.3
latencyWeight: 0.2
linucb:
alpha: 1.0 # Exploration parameter
Difficulty Estimation
Tier-routing difficulty estimation is done by the ZeroRouter (see “Source Files” below). The composite-router consumes decision.difficulty / decision.tier from ZeroRouter for fast/balanced/powerful tier selection.
History note (#2940): an alternate
DAAOEstimator(VAE-inspired, arXiv:2509.11079) was prototyped under Issue #334 and exported fromcli-adapters/index.ts, but#334ended up being implemented via ZeroRouter, not DAAO. The DAAO surface was retired in #2940 — see that issue for the full removal scope. If a true alternate difficulty estimator with different feature weights returns, reintroduce alongside its wiring stage in the same PR.
Source Files
| File | Purpose |
|---|---|
src/cli-adapters/composite-router.ts | Main routing pipeline |
src/cli-adapters/budget-router.ts | Budget enforcement |
src/cli-adapters/topsis-router.ts | Multi-criteria ranking |
src/cli-adapters/linucb-bandit.ts | Contextual bandit |
src/cli-adapters/zero-router.ts | Difficulty estimation |
src/cli-adapters/circuit-breaker.ts | Fault tolerance |
src/cli-adapters/cli-detection-cache.ts | Health check caching |
src/context/token-counter.ts | Token counting |
src/adapters/capacity-monitor.ts | Rate limit tracking |
src/learning/feedback-integration.ts | Outcome learning |
src/cli/routing-audit.ts | Debug CLI command |
Research Sources
| Technique | Paper | Paper-Reported Metrics (not measured on this system) |
|---|---|---|
| PILOT Budget Routing | arXiv:2508.21141 | Budget-constrained routing |
| TOPSIS Multi-Criteria | arXiv:2509.07571 | 31.46% cost reduction (paper benchmark) |
| IPR Quality Routing | arXiv:2509.06274 | 43.9% cost reduction (paper benchmark) |
| RouteLLM Preference | arXiv:2406.18665 | 2x cost reduction (paper benchmark) |
| SATER Confidence | arXiv:2510.05164 | 50%+ cost reduction, 80% latency reduction (paper) |
Related Documents
- Memory System: MEMORY_SYSTEM.md
- Agent System: AGENT_SYSTEM.md
- Full Architecture: ARCHITECTURE.md