Nexus-Agents Alignment Roadmap
Goal: Create the best software development agent swarm possible Assessment Date: 2026-01-09 (ET) | Last Updated: 2026-04-18 (ET) Architecture Decision: HYBRID APPROVED (5-0 unanimous) Current Version: v2.33.2 | Fitness Score: 98/100 | SWE-bench Lite: 65.7% (23/35 Lite subset) Historical Reviews: docs/archive/system-reviews-2026-01.md
Score Progression
| Date | Score | Milestone |
|---|---|---|
| 2026-01-09 | 6.4/10 | Initial 5-agent consensus |
| 2026-01-13 | 7.28/10 | Observability + docs automation |
| 2026-01-16 | 8.06/10 | v2.2.0 release, security complete |
| 2026-01-23 | 7.78/10 | Skills loader, deeper review |
| 2026-03-04 | 8.0/10 | v2.26.1, feedback loops, research synthesis |
| 2026-03-05 | 8.0/10 | Roadmap accuracy audit, reliability hardening |
| 2026-03-17 | 8.5/10 | v2.28.0, SWE-bench 65.7%, symbol extraction, codebase search, Augment analysis |
| 2026-04-18 | 8.5/10 | v2.33.2, modularization + benchmark extraction, research registry cleanup |
Trend: Baseline 6.4 -> steady 8.5 post-hardening. Phase 6 epic #1401 closed on this date; remaining targets met or exceeded except security_review (tracked in #1974).
North Star Strengths
These capabilities form the foundation — confirmed by multiple consensus agents:
| Strength | Evidence |
|---|---|
| Research-backed protocols | 42/43 techniques implemented, 1 rejected (registry ground truth) |
| Multi-criteria model routing | TOPSIS + LinUCB bandit + KNN + preference + cascade + tolerance |
| Memory system diversity | 8 backends + reflective MemR3 enhancement |
| Closed-loop learning | Weather report, feedback integration, self-refinement |
| Security posture | 113 pentest tests, Docker sandbox, rate limiting |
| Zero-credential architecture | OAuth 2.0/PKCE, SecretsVault, no stored creds |
| Strong type system | Result<T,E>, Zod validation, strict TypeScript |
| Multi-CLI orchestration | 4 CLIs (claude/gemini/codex/opencode), 10,000+ observed tasks |
Current State (2026-04-18)
System Metrics (as of 2026-04-18)
| Metric | Value |
|---|---|
| Version | v2.33.2 |
| Fitness score | 98/100 |
| Test suite | 24,200+ tests, 924 files |
| Coverage | 89.66% statements, 93.26% functions |
| MCP tools | 30 registered |
| Research techniques | 42 implemented, 1 rejected (43 total, from 176 papers) |
| Routing tasks observed | 10,000 across 4 CLIs |
| Overall success rate | 83.6% all-time / 86.5% 7-day recent-window |
| Expert types | 10 (code, architecture, security, testing, docs, devops, research, product, ux, infrastructure) |
CLI Performance (live weather report, 2026-04-18)
| CLI | Tasks | Success Rate | Best Category |
|---|---|---|---|
| Claude | 8,768 | 81.3% | Testing (100%, n=309), Planning (92%), Exploration (92%) |
| Gemini | 624 | 99.4% | Code Generation (100%), Exploration (100%, n=399) |
| Codex | 608 | 99.2% | Testing (99.5%), Code Generation (99.3%) |
| OpenCode | 0 | — | No recent samples (usage dropped after routing changes) |
Adaptive Routing Insights
Recommended mappings by weather report (high confidence only):
- Code generation -> Gemini (100% success, n=201)
- Testing -> Claude (100% success, n=309)
- Research -> Claude (90% success, n=122)
- Exploration -> Gemini (100% success, n=399)
- Code review -> Claude (74.6%, n=291) — trend declining
Phase Completion Summary
| Phase | Goal | Status | Key Deliverable |
|---|---|---|---|
| 1 | Observability Foundation | COMPLETE | OrchestrationObserver, routing dashboard |
| 2 | Learning Loop | COMPLETE | OutcomeFeedbackCollector, CompositeRouter |
| 3 | Dogfooding | COMPLETE | Claude Code PR review, SICA test gen |
| 4 | Security Hardening | COMPLETE | Docker sandbox, 113 pentest tests |
| 5 | Market Validation | PARTIAL | SWE-bench 65.7% (23/35 Lite subset); full 300 still open |
| 6 | Evidence-Based Hardening | CLOSED (#1401) | Targets hit or exceeded except claude security_review (→ #1974); remaining security/monitoring tracked individually |
Phase details and 8 historical system review transcripts: archive/system-reviews-2026-01.md
Open Gaps
Gap 1: No Production Benchmark Evidence (P1) — IN PROGRESS
Impact: 8/10 (reduced from 9/10) — Infrastructure ready, execution pending
The SWE-bench module has parallel execution (#1407), cross-run memory enrichment, partial git clones, and thread-safe JSONL output. First evaluation published: 65.7% resolution rate (23/35 on SWE-bench Lite). Test-aware prompting (FAIL_TO_PASS + test_patch in prompt) flipped django__django-14534 from failed to resolved. Analysis of 12 unresolved instances shows 100% file targeting accuracy but incorrect patch logic — all failures are “right location, wrong fix.”
Completed:
- Parallel runner with
--concurrency=Nworkers and per-slot isolated work directories - Cross-run memory enrichment (SessionMemory records outcomes, enriches future prompts)
- Git clone optimization (partial clones, skip-fetch for existing repos)
- CLI agent executor with Claude CLI OAuth authentication
- Instance priority sorting by estimated difficulty (repo complexity + problem length + memory data)
- CLI executor timeout optimized from 10min to 5min (median solve time is 2-4min)
Required: Execute SWE-bench Lite (300 instances), publish results, compare against leaderboard.
Gap 2: Swarm-Level Observability (P2) — MOSTLY RESOLVED
Impact: 3/10 (reduced from 7/10) — Core observability complete, visualization is a stretch goal
Resolved:
nexus-agents healthCLI command implemented with per-CLI performance breakdown- All 5 swarm health metrics measured (utilization, efficiency, accuracy, regret, adaptation speed)
- JSONL tracing infrastructure with
query_traceMCP tool - #1403 closed — all success criteria met
New (2026-03-05): Wave dispatch events (wave.started, wave.completed) emitted to pipeline EventBus and captured by TraceWriter (5986290e). Queryable via query_trace tool.
Remaining (stretch): Trace visualization UI for multi-wave dispatches, cross-session learning dashboard.
Gap 3: OpenCode CLI Reliability (P2) — STALLED (no usage)
Impact: 1/10 — OpenCode has 0 samples in the current 10k-task window. Routing changes effectively removed OpenCode from the active dispatch pool. Adapter fixes (#1402) were deployed but cannot be empirically validated without usage. If OpenCode is no longer a target CLI, this gap should be retired rather than monitored.
Status as of 2026-04-18: no action required until OpenCode either returns to active use or is formally removed.
Gap 4: Security Review Accuracy (P2) — PARTIALLY MITIGATED, STILL BELOW TARGET
Impact: 5/10 (partially reduced from 6/10) — Fixes deployed; claude success rose from 30% → 53.9% (590 samples) but trend is declining (confidence 1.0). Target of 60% not met. Adaptive routing is deweighting claude for this category (adaptiveBonus −5.4). Codex primary for security_review has only 3 samples (66.7%) — insufficient to confirm the new primary lands consistently.
- Security expert prompt improved with concrete JSON example + plaintext fallback guidance
- Routing realigned: primaryCli changed from claude to codex, bonus 15→7
- Failure classification now prioritizes message-based patterns (86+ keywords) over coarse error types
Tracked in #1974 for focused investigation: why claude sec_review is still declining and why codex sec_review is underutilized.
Gap 5: Exploration Success Rate (P3) — RESOLVED
Impact: 0/10 — Target exceeded. Exploration overall is 96.2% (812 samples). Claude exploration 92.5% (413 samples), Gemini 100% (399 samples). Gap closed.
Original target (>70%) is comfortably surpassed and claude is trending stable (not declining). No further action required.
Gap 6: Failure Classification (P2) — RESOLVED
Impact: 0/10 — Target exceeded. Current unknown rate is 0.2% (3/1645 failures), far below the 30% target. Current breakdown: execution 43.3%, timeout 28%, parse 16.4%, adapter_unavailable 4.4%, rate_limit 4.3%, validation 3.5%, unknown 0.2%.
Fixes deployed:
- Message-based classification with 100+ keyword patterns (was 86+)
- Added: SSL/TLS/proxy (→connection), ENOMEM (→crash), HTTP 5xx/bad gateway/service unavailable (→execution), max retries (→rate_limit), truncated/incomplete (→execution)
- Added broad catch-all execution patterns: error:, failed, failure, exception, not found, invalid, unable to, could not, unexpected, missing, unsupported
- Non-zero exit with error stderr now classified as EXECUTION_ERROR (#1402)
Status as of 2026-04-18: Target met (0.2% << 30%). Gap closed.
Forward-Looking: Phase 6 Closure (2026-04-18)
Tracking: #1401 CLOSED. Follow-ups: #1974 (security expert investigation), #1975 (capability-instruction-tuning decision).
Registry ground truth (2026-04-18): 42 techniques implemented, 1 rejected out of 43 total. The previous figures of “32 implemented / 14 partial of 46 mapped” were from an older snapshot and no longer reflect the registry.
6.1: Evidence-Based Validation — HANDED OFF
- SWE-bench full-scale execution tracked in #1414 (full orchestration pipeline), #1574 (Verified + Pro context management), #1575 (Pro support with Scale Labs harness)
- Demo workflows + regression benchmarking in CI remain open as individual enhancement items; no active epic
6.2: Swarm Intelligence Observability — MOSTLY COMPLETE
- All 5 swarm health metrics computed in weather report (#1403 CLOSED)
-
nexus-agents healthCLI with per-CLI performance breakdown - JSONL tracing infrastructure +
query_traceMCP tool - Wave dispatch events (
wave.started/wave.completed) in pipeline EventBus - Trace visualization UI for multi-wave dispatches (stretch)
- Cross-session learning dashboard (stretch)
6.3: Reliability Hardening — COMPLETE except one item
- Exploration timeout aligned with centralized config: 90s -> 180s (#1405 CLOSED)
- TOPSIS tolerance band (5%) for routing diversity (#1405)
- Failure classification: message-based patterns prioritized (#1401)
- Security expert prompt improved with example output (#1401)
- OpenCode model alias resolution from canonical registry (#1402)
- Expanded stderr error pattern detection for provider errors (#1402)
- Exploration success 57% -> 96.2% (target 70%, exceeded)
- Unknown failures 60.4% -> 0.2% (target <30%, exceeded)
- [~] OpenCode: 0 samples in current window (usage collapsed; gap retired — see Gap 3)
- Security expert claude 30% -> 53.9%, trend declining (target 60% not met) — tracked in #1974
6.4: Research-Driven Improvements — SUBSTANTIALLY COMPLETE
Registry is clean: 42/43 techniques implemented, 1 rejected (latent-space-sharing). All techniques once marked partial have been promoted to implemented or rejected with rationale. See docs/research/registry/techniques.yaml for ground truth.
Remaining open decision: capability-instruction-tuning (arxiv-2502.17282) — tracked in #1975 as a single-scope decision ticket.
Composite Success Metrics
Target Score Trajectory
| Phase | Target | Key Indicator |
|---|---|---|
| Current | ~8.0/10 | All infra phases done |
| After Phase 6 | 9.0/10 | Market validated |
| North Star | 10/10 | Best-in-class evidence |
Swarm Health Metrics (via weather_report tool)
| Metric | Definition | Healthy Range | Current | Status |
|---|---|---|---|---|
| Agent Utilization | % of expert roles actively succeeding | 70-90% | 0% | Needs recalibration — see note below |
| Collaboration Efficiency | Successful / total delegate tasks | > 0.1 | 71.9% | Healthy |
| Routing Accuracy | % routed to good CLIs (within 5% of best/cat) | > 50% | 94.9% | Healthy |
| Weekly Regret | Avg gap vs best-possible rate/category | Decreasing | 0.113 | Improving (was 0.179 on 2026-03-05) |
| Adaptation Speed | Avg samples for best CLI/category to converge | < 50 | 526 | Above healthy range — bandit coverage hasn’t converged for low-traffic categories |
Note: Routing Accuracy and Adaptation Speed metrics were corrected on 2026-03-05. Previous values (25.4% accuracy, 249 adaptation) used flawed calculations — accuracy counted only single-best CLI (expected ~25% with 4 CLIs), adaptation averaged all pairs instead of best-per-category.
Appendix: Original Voting Agent Reasoning
From the initial 5-agent consensus assessment (2026-01-09):
Architect (7/10): “The architecture is sound for a multi-agent system. The gap is from ‘multi-agent system’ to ‘swarm with emergent collective intelligence.’ That’s the difference between 7 and 10.”
AI/ML (7/10): “The swarm has excellent static decision-making infrastructure but cannot yet learn and improve from its own experience.”
DevEx (6/10): “The technology is impressive. The experience needs work.”
Security (6/10): “The security posture is thoughtful but incomplete. The most critical immediate action is sandboxing agent execution.”
PM (6/10): “The gap is not technical capability - it is market proof. The swarm must demonstrate it is best, not just claim best-in-class architecture.”
Key insight that still applies: Technique coverage (97.7% implemented of 43 total registry techniques, as of 2026-04-18) is near-complete, but coverage means nothing without user validation. The path from 8.5 to 10 is evidence (SWE-bench, demo workflows, regression CI — tracked in #1414/#1574/#1575), not more architecture or more paper implementations.
Generated via CLAUDE.md Consensus Voting Protocol Consolidation approved 3/3 (unanimous) on 2026-03-05 Historical system reviews archived to docs/archive/system-reviews-2026-01.md