Nexus-Agents Alignment Roadmap

Goal: A self-tuning governance substrate of composable building blocks — excellent at coding agents today, generalizing toward arbitrary goals — that closes the loop so the system measurably improves itself. Assessment Date: 2026-01-09 (ET) | Last Updated: 2026-05-31 (ET) Architecture Decision: HYBRID APPROVED (5-0 unanimous); Closed-Loop Direction RATIFIED (7-0 higher_order, 2026-05-31) Current Version: v2.92.3 | Historical Reviews: docs/archive/system-reviews-2026-01.md

Ratified Closed-Loop Direction (2026-05-31)

A 13-domain full-codebase fan-out review (14 agents) produced one structural finding, consistent across every domain: the autonomous loop is a “C”, not an “O”. The PRODUCER half is built and persists faithfully (OutcomeStore, FileAuditStorage, fitness-score, research synthesis, SwarmObserver, self-eval); the CONSUMER/tune half is unwired — signals are produced then printed, filed as issues, or written to stores nothing reads. Several integration bridges (ensureFeedbackSubscriber, onRetirement, firewall policyEnforcement stage, compiled-plan policy gates) already exist in code but are never invoked — so closing the loop is mostly a wiring task, not a build task.

Refined mission: nexus-agents is a governance substrate of composable building blocks — research, planning, voting/consensus, orchestration, QA, security — that compose into pipelines and feed every action into a durable logging substrate. It is built to close the loop: outcomes, audits, evaluations, and fitness signals flow back to tune routing, thresholds, and plans, so the system measurably improves itself rather than only reporting on itself. Shipped via npm as documented, exported primitives; excellent at coding-agent governance today and generalizing toward arbitrary goals; scaffolding that lets newer, more capable models operate safely under enforced rules.

Phased plan (tracked in epic #3143; each phase lands behind its own consensus_vote):

Phase	Focus	Issue
P0	Correctness edges + invoke already-written bridges + scheduled `improvement_review`	#3144
P1	Persistent + injectable OutcomeStore / singletons	#3145
P1	Unify + persist audit/outcome substrate (one TaskOutcome schema)	#3146
P2	SignalBus + bounded Tune stage (keystone)	#3147
P1/P2	Close the knowledge loop (self-eval→OutcomeStore, research→context)	#3148
P3	Publish composition primitives + `COMPOSITION_PATTERNS`	#3149
P4	Routing returns `(CLI, model)` + cost-enforcement stage	#3150
P5	Fully-autonomous orchestrator (SELECT→…→TUNE→REPEAT)	#3151

Binding conditions from the ratifying vote: phase-gating is the safeguard (this ratifies direction, not the P2/P5 designs); the P2 SignalBus must extend the existing pipeline EventBus, not fork it, and ship bounded/audited/kill-switched with bounded-effect tests before any auto-action; TaskOutcome unification is a breaking exported-contract change requiring the unanimous gate + migration path; “arbitrary goals” stays framing only until a concrete second domain is voted; P3 exports are driven by a real consumer, not shipped speculatively.

Score Progression

Date	Score	Milestone
2026-01-09	6.4/10	Initial 5-agent consensus
2026-01-13	7.28/10	Observability + docs automation
2026-01-16	8.06/10	v2.2.0 release, security complete
2026-01-23	7.78/10	Skills loader, deeper review
2026-03-04	8.0/10	v2.26.1, feedback loops, research synthesis
2026-03-05	8.0/10	Roadmap accuracy audit, reliability hardening
2026-03-17	8.5/10	v2.28.0, SWE-bench 65.7%, symbol extraction, codebase search, Augment analysis
2026-04-18	8.5/10	v2.33.2, modularization + benchmark extraction, research registry cleanup

Trend: Baseline 6.4 -> steady 8.5 post-hardening. Phase 6 epic #1401 closed on this date; remaining targets met or exceeded except security_review (tracked in #1974).

North Star Strengths

These capabilities form the foundation — confirmed by multiple consensus agents:

Strength	Evidence
Research-backed protocols	42/43 techniques implemented, 1 rejected (registry ground truth)
Multi-criteria model routing	TOPSIS + LinUCB bandit + KNN + preference + cascade + tolerance
Memory system diversity	8 backends + reflective MemR3 enhancement
Closed-loop learning	Weather report, feedback integration, self-refinement
Security posture	113 pentest tests, Docker sandbox, rate limiting
Zero-credential architecture	OAuth 2.0/PKCE, SecretsVault, no stored creds
Strong type system	Result<T,E>, Zod validation, strict TypeScript
Multi-CLI orchestration	4 CLIs (claude/gemini/codex/opencode), 10,000+ observed tasks

Current State (2026-04-18)

System Metrics (as of 2026-04-18)

Metric	Value
Version	v2.33.2
Fitness score	98/100
Test suite	24,200+ tests, 924 files
Coverage	89.66% statements, 93.26% functions
MCP tools	30 registered
Research techniques	42 implemented, 1 rejected (43 total, from 176 papers)
Routing tasks observed	10,000 across 4 CLIs
Overall success rate	83.6% all-time / 86.5% 7-day recent-window
Expert types	10 (code, architecture, security, testing, docs, devops, research, product, ux, infrastructure)

CLI Performance (live weather report, 2026-04-18)

CLI	Tasks	Success Rate	Best Category
Claude	8,768	81.3%	Testing (100%, n=309), Planning (92%), Exploration (92%)
Gemini	624	99.4%	Code Generation (100%), Exploration (100%, n=399)
Codex	608	99.2%	Testing (99.5%), Code Generation (99.3%)
OpenCode	0	—	No recent samples (usage dropped after routing changes)

Adaptive Routing Insights

Recommended mappings by weather report (high confidence only):

Code generation -> Gemini (100% success, n=201)
Testing -> Claude (100% success, n=309)
Research -> Claude (90% success, n=122)
Exploration -> Gemini (100% success, n=399)
Code review -> Claude (74.6%, n=291) — trend declining

Phase Completion Summary

Phase	Goal	Status	Key Deliverable
1	Observability Foundation	COMPLETE	OrchestrationObserver, routing dashboard
2	Learning Loop	COMPLETE	OutcomeFeedbackCollector, CompositeRouter
3	Dogfooding	COMPLETE	Claude Code PR review, SICA test gen
4	Security Hardening	COMPLETE	Docker sandbox, 113 pentest tests
5	Market Validation	PARTIAL	SWE-bench 65.7% (23/35 Lite subset); full 300 still open
6	Evidence-Based Hardening	CLOSED (#1401)	Targets hit or exceeded except claude security_review (→ #1974); remaining security/monitoring tracked individually

Phase details and 8 historical system review transcripts: archive/system-reviews-2026-01.md

Open Gaps

Gap 1: No Production Benchmark Evidence (P1) — IN PROGRESS

Impact: 8/10 (reduced from 9/10) — Infrastructure ready, execution pending

The SWE-bench module has parallel execution (#1407), cross-run memory enrichment, partial git clones, and thread-safe JSONL output. First evaluation published: 65.7% resolution rate (23/35 on SWE-bench Lite). Test-aware prompting (FAIL_TO_PASS + test_patch in prompt) flipped django__django-14534 from failed to resolved. Analysis of 12 unresolved instances shows 100% file targeting accuracy but incorrect patch logic — all failures are “right location, wrong fix.”

Completed:

Parallel runner with --concurrency=N workers and per-slot isolated work directories
Cross-run memory enrichment (SessionMemory records outcomes, enriches future prompts)
Git clone optimization (partial clones, skip-fetch for existing repos)
CLI agent executor with Claude CLI OAuth authentication
Instance priority sorting by estimated difficulty (repo complexity + problem length + memory data)
CLI executor timeout optimized from 10min to 5min (median solve time is 2-4min)

Required: Execute SWE-bench Lite (300 instances), publish results, compare against leaderboard.

Gap 2: Swarm-Level Observability (P2) — MOSTLY RESOLVED

Impact: 3/10 (reduced from 7/10) — Core observability complete, visualization is a stretch goal

Resolved:

nexus-agents health CLI command implemented with per-CLI performance breakdown
All 5 swarm health metrics measured (utilization, efficiency, accuracy, regret, adaptation speed)
JSONL tracing infrastructure with query_trace MCP tool
#1403 closed — all success criteria met

New (2026-03-05): Wave dispatch events (wave.started, wave.completed) emitted to pipeline EventBus and captured by TraceWriter (5986290e). Queryable via query_trace tool.

Remaining (stretch): Trace visualization UI for multi-wave dispatches, cross-session learning dashboard.

Gap 3: OpenCode CLI Reliability (P2) — STALLED (no usage)

Impact: 1/10 — OpenCode has 0 samples in the current 10k-task window. Routing changes effectively removed OpenCode from the active dispatch pool. Adapter fixes (#1402) were deployed but cannot be empirically validated without usage. If OpenCode is no longer a target CLI, this gap should be retired rather than monitored.

Status as of 2026-04-18: no action required until OpenCode either returns to active use or is formally removed.

Gap 4: Security Review Accuracy (P2) — PARTIALLY MITIGATED, STILL BELOW TARGET

Impact: 5/10 (partially reduced from 6/10) — Fixes deployed; claude success rose from 30% → 53.9% (590 samples) but trend is declining (confidence 1.0). Target of 60% not met. Adaptive routing is deweighting claude for this category (adaptiveBonus −5.4). Codex primary for security_review has only 3 samples (66.7%) — insufficient to confirm the new primary lands consistently.

Security expert prompt improved with concrete JSON example + plaintext fallback guidance
Routing realigned: primaryCli changed from claude to codex, bonus 15→7
Failure classification now prioritizes message-based patterns (86+ keywords) over coarse error types

Tracked in #1974 for focused investigation: why claude sec_review is still declining and why codex sec_review is underutilized.

Gap 5: Exploration Success Rate (P3) — RESOLVED

Impact: 0/10 — Target exceeded. Exploration overall is 96.2% (812 samples). Claude exploration 92.5% (413 samples), Gemini 100% (399 samples). Gap closed.

Original target (>70%) is comfortably surpassed and claude is trending stable (not declining). No further action required.

Gap 6: Failure Classification (P2) — RESOLVED

Impact: 0/10 — Target exceeded. Current unknown rate is 0.2% (3/1645 failures), far below the 30% target. Current breakdown: execution 43.3%, timeout 28%, parse 16.4%, adapter_unavailable 4.4%, rate_limit 4.3%, validation 3.5%, unknown 0.2%.

Fixes deployed:

Message-based classification with 100+ keyword patterns (was 86+)
Added: SSL/TLS/proxy (→connection), ENOMEM (→crash), HTTP 5xx/bad gateway/service unavailable (→execution), max retries (→rate_limit), truncated/incomplete (→execution)
Added broad catch-all execution patterns: error:, failed, failure, exception, not found, invalid, unable to, could not, unexpected, missing, unsupported
Non-zero exit with error stderr now classified as EXECUTION_ERROR (#1402)

Status as of 2026-04-18: Target met (0.2% << 30%). Gap closed.

Forward-Looking: Phase 6 Closure (2026-04-18)

Tracking: #1401 CLOSED. Follow-ups: #1974 (security expert investigation), #1975 (capability-instruction-tuning decision).

Registry ground truth (2026-04-18): 42 techniques implemented, 1 rejected out of 43 total. The previous figures of “32 implemented / 14 partial of 46 mapped” were from an older snapshot and no longer reflect the registry.

6.1: Evidence-Based Validation — HANDED OFF

SWE-bench full-scale execution tracked in #1414 (full orchestration pipeline), #1574 (Verified + Pro context management), #1575 (Pro support with Scale Labs harness)
Demo workflows + regression benchmarking in CI remain open as individual enhancement items; no active epic

6.2: Swarm Intelligence Observability — MOSTLY COMPLETE

All 5 swarm health metrics computed in weather report (#1403 CLOSED)
nexus-agents health CLI with per-CLI performance breakdown
JSONL tracing infrastructure + query_trace MCP tool
Wave dispatch events (wave.started/wave.completed) in pipeline EventBus
Trace visualization UI for multi-wave dispatches (stretch)
Cross-session learning dashboard (stretch)

6.3: Reliability Hardening — COMPLETE except one item

Exploration timeout aligned with centralized config: 90s -> 180s (#1405 CLOSED)
TOPSIS tolerance band (5%) for routing diversity (#1405)
Failure classification: message-based patterns prioritized (#1401)
Security expert prompt improved with example output (#1401)
OpenCode model alias resolution from canonical registry (#1402)
Expanded stderr error pattern detection for provider errors (#1402)
Exploration success 57% -> 96.2% (target 70%, exceeded)
Unknown failures 60.4% -> 0.2% (target <30%, exceeded)
[~] OpenCode: 0 samples in current window (usage collapsed; gap retired — see Gap 3)
Security expert claude 30% -> 53.9%, trend declining (target 60% not met) — tracked in #1974

6.4: Research-Driven Improvements — SUBSTANTIALLY COMPLETE

Registry is clean: 42/43 techniques implemented, 1 rejected (latent-space-sharing). All techniques once marked partial have been promoted to implemented or rejected with rationale. See docs/research/registry/techniques.yaml for ground truth.

Remaining open decision: capability-instruction-tuning (arxiv-2502.17282) — tracked in #1975 as a single-scope decision ticket.

Composite Success Metrics

Target Score Trajectory

Phase	Target	Key Indicator
Current	~8.0/10	All infra phases done
After Phase 6	9.0/10	Market validated
North Star	10/10	Best-in-class evidence

Swarm Health Metrics (via `weather_report` tool)

Metric	Definition	Healthy Range	Current	Status
Agent Utilization	% of expert roles actively succeeding	70-90%	0%	Needs recalibration — see note below
Collaboration Efficiency	Successful / total delegate tasks	> 0.1	71.9%	Healthy
Routing Accuracy	% routed to good CLIs (within 5% of best/cat)	> 50%	94.9%	Healthy
Weekly Regret	Avg gap vs best-possible rate/category	Decreasing	0.113	Improving (was 0.179 on 2026-03-05)
Adaptation Speed	Avg samples for best CLI/category to converge	< 50	526	Above healthy range — bandit coverage hasn’t converged for low-traffic categories

Note: Routing Accuracy and Adaptation Speed metrics were corrected on 2026-03-05. Previous values (25.4% accuracy, 249 adaptation) used flawed calculations — accuracy counted only single-best CLI (expected ~25% with 4 CLIs), adaptation averaged all pairs instead of best-per-category.

Appendix: Original Voting Agent Reasoning

From the initial 5-agent consensus assessment (2026-01-09):

Architect (7/10): “The architecture is sound for a multi-agent system. The gap is from ‘multi-agent system’ to ‘swarm with emergent collective intelligence.’ That’s the difference between 7 and 10.”

AI/ML (7/10): “The swarm has excellent static decision-making infrastructure but cannot yet learn and improve from its own experience.”

DevEx (6/10): “The technology is impressive. The experience needs work.”

Security (6/10): “The security posture is thoughtful but incomplete. The most critical immediate action is sandboxing agent execution.”

PM (6/10): “The gap is not technical capability - it is market proof. The swarm must demonstrate it is best, not just claim best-in-class architecture.”

Key insight that still applies: Technique coverage (97.7% implemented of 43 total registry techniques, as of 2026-04-18) is near-complete, but coverage means nothing without user validation. The path from 8.5 to 10 is evidence (SWE-bench, demo workflows, regression CI — tracked in #1414/#1574/#1575), not more architecture or more paper implementations.

Generated via CLAUDE.md Consensus Voting Protocol Consolidation approved 3/3 (unanimous) on 2026-03-05 Historical system reviews archived to docs/archive/system-reviews-2026-01.md