Nexus-Agents Alignment Roadmap

Goal: Create the best software development agent swarm possible Assessment Date: 2026-01-09 (ET) | Last Updated: 2026-04-18 (ET) Architecture Decision: HYBRID APPROVED (5-0 unanimous) Current Version: v2.33.2 | Fitness Score: 98/100 | SWE-bench Lite: 65.7% (23/35 Lite subset) Historical Reviews: docs/archive/system-reviews-2026-01.md


Score Progression

DateScoreMilestone
2026-01-096.4/10Initial 5-agent consensus
2026-01-137.28/10Observability + docs automation
2026-01-168.06/10v2.2.0 release, security complete
2026-01-237.78/10Skills loader, deeper review
2026-03-048.0/10v2.26.1, feedback loops, research synthesis
2026-03-058.0/10Roadmap accuracy audit, reliability hardening
2026-03-178.5/10v2.28.0, SWE-bench 65.7%, symbol extraction, codebase search, Augment analysis
2026-04-188.5/10v2.33.2, modularization + benchmark extraction, research registry cleanup

Trend: Baseline 6.4 -> steady 8.5 post-hardening. Phase 6 epic #1401 closed on this date; remaining targets met or exceeded except security_review (tracked in #1974).


North Star Strengths

These capabilities form the foundation — confirmed by multiple consensus agents:

StrengthEvidence
Research-backed protocols42/43 techniques implemented, 1 rejected (registry ground truth)
Multi-criteria model routingTOPSIS + LinUCB bandit + KNN + preference + cascade + tolerance
Memory system diversity8 backends + reflective MemR3 enhancement
Closed-loop learningWeather report, feedback integration, self-refinement
Security posture113 pentest tests, Docker sandbox, rate limiting
Zero-credential architectureOAuth 2.0/PKCE, SecretsVault, no stored creds
Strong type systemResult<T,E>, Zod validation, strict TypeScript
Multi-CLI orchestration4 CLIs (claude/gemini/codex/opencode), 10,000+ observed tasks

Current State (2026-04-18)

System Metrics (as of 2026-04-18)

MetricValue
Versionv2.33.2
Fitness score98/100
Test suite24,200+ tests, 924 files
Coverage89.66% statements, 93.26% functions
MCP tools30 registered
Research techniques42 implemented, 1 rejected (43 total, from 176 papers)
Routing tasks observed10,000 across 4 CLIs
Overall success rate83.6% all-time / 86.5% 7-day recent-window
Expert types10 (code, architecture, security, testing, docs, devops, research, product, ux, infrastructure)

CLI Performance (live weather report, 2026-04-18)

CLITasksSuccess RateBest Category
Claude8,76881.3%Testing (100%, n=309), Planning (92%), Exploration (92%)
Gemini62499.4%Code Generation (100%), Exploration (100%, n=399)
Codex60899.2%Testing (99.5%), Code Generation (99.3%)
OpenCode0No recent samples (usage dropped after routing changes)

Adaptive Routing Insights

Recommended mappings by weather report (high confidence only):

  • Code generation -> Gemini (100% success, n=201)
  • Testing -> Claude (100% success, n=309)
  • Research -> Claude (90% success, n=122)
  • Exploration -> Gemini (100% success, n=399)
  • Code review -> Claude (74.6%, n=291) — trend declining

Phase Completion Summary

PhaseGoalStatusKey Deliverable
1Observability FoundationCOMPLETEOrchestrationObserver, routing dashboard
2Learning LoopCOMPLETEOutcomeFeedbackCollector, CompositeRouter
3DogfoodingCOMPLETEClaude Code PR review, SICA test gen
4Security HardeningCOMPLETEDocker sandbox, 113 pentest tests
5Market ValidationPARTIALSWE-bench 65.7% (23/35 Lite subset); full 300 still open
6Evidence-Based HardeningCLOSED (#1401)Targets hit or exceeded except claude security_review (→ #1974); remaining security/monitoring tracked individually

Phase details and 8 historical system review transcripts: archive/system-reviews-2026-01.md


Open Gaps

Gap 1: No Production Benchmark Evidence (P1) — IN PROGRESS

Impact: 8/10 (reduced from 9/10) — Infrastructure ready, execution pending

The SWE-bench module has parallel execution (#1407), cross-run memory enrichment, partial git clones, and thread-safe JSONL output. First evaluation published: 65.7% resolution rate (23/35 on SWE-bench Lite). Test-aware prompting (FAIL_TO_PASS + test_patch in prompt) flipped django__django-14534 from failed to resolved. Analysis of 12 unresolved instances shows 100% file targeting accuracy but incorrect patch logic — all failures are “right location, wrong fix.”

Completed:

  • Parallel runner with --concurrency=N workers and per-slot isolated work directories
  • Cross-run memory enrichment (SessionMemory records outcomes, enriches future prompts)
  • Git clone optimization (partial clones, skip-fetch for existing repos)
  • CLI agent executor with Claude CLI OAuth authentication
  • Instance priority sorting by estimated difficulty (repo complexity + problem length + memory data)
  • CLI executor timeout optimized from 10min to 5min (median solve time is 2-4min)

Required: Execute SWE-bench Lite (300 instances), publish results, compare against leaderboard.

Gap 2: Swarm-Level Observability (P2) — MOSTLY RESOLVED

Impact: 3/10 (reduced from 7/10) — Core observability complete, visualization is a stretch goal

Resolved:

  • nexus-agents health CLI command implemented with per-CLI performance breakdown
  • All 5 swarm health metrics measured (utilization, efficiency, accuracy, regret, adaptation speed)
  • JSONL tracing infrastructure with query_trace MCP tool
  • #1403 closed — all success criteria met

New (2026-03-05): Wave dispatch events (wave.started, wave.completed) emitted to pipeline EventBus and captured by TraceWriter (5986290e). Queryable via query_trace tool.

Remaining (stretch): Trace visualization UI for multi-wave dispatches, cross-session learning dashboard.

Gap 3: OpenCode CLI Reliability (P2) — STALLED (no usage)

Impact: 1/10 — OpenCode has 0 samples in the current 10k-task window. Routing changes effectively removed OpenCode from the active dispatch pool. Adapter fixes (#1402) were deployed but cannot be empirically validated without usage. If OpenCode is no longer a target CLI, this gap should be retired rather than monitored.

Status as of 2026-04-18: no action required until OpenCode either returns to active use or is formally removed.

Gap 4: Security Review Accuracy (P2) — PARTIALLY MITIGATED, STILL BELOW TARGET

Impact: 5/10 (partially reduced from 6/10) — Fixes deployed; claude success rose from 30% → 53.9% (590 samples) but trend is declining (confidence 1.0). Target of 60% not met. Adaptive routing is deweighting claude for this category (adaptiveBonus −5.4). Codex primary for security_review has only 3 samples (66.7%) — insufficient to confirm the new primary lands consistently.

  • Security expert prompt improved with concrete JSON example + plaintext fallback guidance
  • Routing realigned: primaryCli changed from claude to codex, bonus 15→7
  • Failure classification now prioritizes message-based patterns (86+ keywords) over coarse error types

Tracked in #1974 for focused investigation: why claude sec_review is still declining and why codex sec_review is underutilized.

Gap 5: Exploration Success Rate (P3) — RESOLVED

Impact: 0/10 — Target exceeded. Exploration overall is 96.2% (812 samples). Claude exploration 92.5% (413 samples), Gemini 100% (399 samples). Gap closed.

Original target (>70%) is comfortably surpassed and claude is trending stable (not declining). No further action required.

Gap 6: Failure Classification (P2) — RESOLVED

Impact: 0/10 — Target exceeded. Current unknown rate is 0.2% (3/1645 failures), far below the 30% target. Current breakdown: execution 43.3%, timeout 28%, parse 16.4%, adapter_unavailable 4.4%, rate_limit 4.3%, validation 3.5%, unknown 0.2%.

Fixes deployed:

  • Message-based classification with 100+ keyword patterns (was 86+)
  • Added: SSL/TLS/proxy (→connection), ENOMEM (→crash), HTTP 5xx/bad gateway/service unavailable (→execution), max retries (→rate_limit), truncated/incomplete (→execution)
  • Added broad catch-all execution patterns: error:, failed, failure, exception, not found, invalid, unable to, could not, unexpected, missing, unsupported
  • Non-zero exit with error stderr now classified as EXECUTION_ERROR (#1402)

Status as of 2026-04-18: Target met (0.2% << 30%). Gap closed.


Forward-Looking: Phase 6 Closure (2026-04-18)

Tracking: #1401 CLOSED. Follow-ups: #1974 (security expert investigation), #1975 (capability-instruction-tuning decision).

Registry ground truth (2026-04-18): 42 techniques implemented, 1 rejected out of 43 total. The previous figures of “32 implemented / 14 partial of 46 mapped” were from an older snapshot and no longer reflect the registry.

6.1: Evidence-Based Validation — HANDED OFF

  • SWE-bench full-scale execution tracked in #1414 (full orchestration pipeline), #1574 (Verified + Pro context management), #1575 (Pro support with Scale Labs harness)
  • Demo workflows + regression benchmarking in CI remain open as individual enhancement items; no active epic

6.2: Swarm Intelligence Observability — MOSTLY COMPLETE

  • All 5 swarm health metrics computed in weather report (#1403 CLOSED)
  • nexus-agents health CLI with per-CLI performance breakdown
  • JSONL tracing infrastructure + query_trace MCP tool
  • Wave dispatch events (wave.started/wave.completed) in pipeline EventBus
  • Trace visualization UI for multi-wave dispatches (stretch)
  • Cross-session learning dashboard (stretch)

6.3: Reliability Hardening — COMPLETE except one item

  • Exploration timeout aligned with centralized config: 90s -> 180s (#1405 CLOSED)
  • TOPSIS tolerance band (5%) for routing diversity (#1405)
  • Failure classification: message-based patterns prioritized (#1401)
  • Security expert prompt improved with example output (#1401)
  • OpenCode model alias resolution from canonical registry (#1402)
  • Expanded stderr error pattern detection for provider errors (#1402)
  • Exploration success 57% -> 96.2% (target 70%, exceeded)
  • Unknown failures 60.4% -> 0.2% (target <30%, exceeded)
  • [~] OpenCode: 0 samples in current window (usage collapsed; gap retired — see Gap 3)
  • Security expert claude 30% -> 53.9%, trend declining (target 60% not met) — tracked in #1974

6.4: Research-Driven Improvements — SUBSTANTIALLY COMPLETE

Registry is clean: 42/43 techniques implemented, 1 rejected (latent-space-sharing). All techniques once marked partial have been promoted to implemented or rejected with rationale. See docs/research/registry/techniques.yaml for ground truth.

Remaining open decision: capability-instruction-tuning (arxiv-2502.17282) — tracked in #1975 as a single-scope decision ticket.


Composite Success Metrics

Target Score Trajectory

PhaseTargetKey Indicator
Current~8.0/10All infra phases done
After Phase 69.0/10Market validated
North Star10/10Best-in-class evidence

Swarm Health Metrics (via weather_report tool)

MetricDefinitionHealthy RangeCurrentStatus
Agent Utilization% of expert roles actively succeeding70-90%0%Needs recalibration — see note below
Collaboration EfficiencySuccessful / total delegate tasks> 0.171.9%Healthy
Routing Accuracy% routed to good CLIs (within 5% of best/cat)> 50%94.9%Healthy
Weekly RegretAvg gap vs best-possible rate/categoryDecreasing0.113Improving (was 0.179 on 2026-03-05)
Adaptation SpeedAvg samples for best CLI/category to converge< 50526Above healthy range — bandit coverage hasn’t converged for low-traffic categories

Note: Routing Accuracy and Adaptation Speed metrics were corrected on 2026-03-05. Previous values (25.4% accuracy, 249 adaptation) used flawed calculations — accuracy counted only single-best CLI (expected ~25% with 4 CLIs), adaptation averaged all pairs instead of best-per-category.


Appendix: Original Voting Agent Reasoning

From the initial 5-agent consensus assessment (2026-01-09):

Architect (7/10): “The architecture is sound for a multi-agent system. The gap is from ‘multi-agent system’ to ‘swarm with emergent collective intelligence.’ That’s the difference between 7 and 10.”

AI/ML (7/10): “The swarm has excellent static decision-making infrastructure but cannot yet learn and improve from its own experience.”

DevEx (6/10): “The technology is impressive. The experience needs work.”

Security (6/10): “The security posture is thoughtful but incomplete. The most critical immediate action is sandboxing agent execution.”

PM (6/10): “The gap is not technical capability - it is market proof. The swarm must demonstrate it is best, not just claim best-in-class architecture.”

Key insight that still applies: Technique coverage (97.7% implemented of 43 total registry techniques, as of 2026-04-18) is near-complete, but coverage means nothing without user validation. The path from 8.5 to 10 is evidence (SWE-bench, demo workflows, regression CI — tracked in #1414/#1574/#1575), not more architecture or more paper implementations.


Generated via CLAUDE.md Consensus Voting Protocol Consolidation approved 3/3 (unanimous) on 2026-03-05 Historical system reviews archived to docs/archive/system-reviews-2026-01.md