Lab framework self-improvement audit — proposals + applicable patches from M-001 through M-006 experience (queued 2026-05-29 with M-001/M-002 scope; scope-expanded 2026-05-30 to include M-003..M-006 + dual-repo + viz toggle + paper-search learnings)

Completed in 20h 9m

Code plan7/7 steps (100%)

Iterations9/22 (41%)

Goal

Metric:proposals_fully_specified→≥ 5(baseline )

Eval fixture: lab framework state at HEAD of git repo. Evidence base: ALL completed missions (M-001 pose-tracking failed, M-002 image-registration done, M-004 multimodal MIND-SSC done, M-005 real OpenNeuro longitudinal done) + the in-flight mission M-006 (first 3d-shape pillar, GS-SLAM viz-first) up to whatever state it has reached when M-003 starts. Plus framework artifacts that emerged during the 2026-05-29..30 window: dual-repo (private+public) workflow via tools/sync_to_public.py, tools/paper_search.py for researcher arxiv access, tools/recover_from_transcript.py disaster recovery, frontend/components/ExecutionView.tsx narrative↔raw toggle, LabStatusBanner, /about page, Vercel deployment with outputFileTracingIncludes, B-002/B-003/B-004 brainstorms.

Baseline artifact:

Achieved: 7.0000 ✓ → MEASURED-001

Approach

Read evidence base · cluster pain points · propose with patches · oracle-check

Audit-style writing mission. Agent ingests the FULL evidence base — M-001 through M-006 mission artifacts (postmortems, CHANGELOGs, attempts[] summaries, sandbox CLAUDE.mds) + framework artifacts (framework/, tools/, lab/missions/SCHEMA.md, validator, frontend lib/v2.ts schema readers, all components, sync_to_public.py, paper_search.py) + B-002/B-003/B-004 brainstorm outputs + the dual-repo + Vercel deployment context — clusters the recurring pain points into ≥5 concrete proposals with file:line evidence, and produces unified-diff patches for the top ≥3. Note the cross-pillar lessons: M-004's MIND-SSC numerical-fusion approach to grid_sample double-backward (pillar trap #5), M-005's real-data validation pattern that diverges from synthetic-fixture missions, M-006's first non-CLI 'visual oracle' which exposes a viz_kind dispatch gap in frontend. Director (human) reviews and applies via `git apply` + follow-up commits — agent does NOT touch framework files directly, only writes proposals + patches inside sandbox_dir.

References (21)

docs/v3-design.md — original v3 charter
lab/missions/M-001-postmortem.md — pose-tracking failure analysis (v2 mission, baseline)
lab/missions/M-002-postmortem.md — image-registration success caveats (v3 first success)
lab/missions/M-004-postmortem.md — multimodal MIND-SSC + grid_sample trap #5 lessons
lab/missions/M-005-postmortem.md — real OpenNeuro data validation, single-subject caveats
lab/missions/M-006.json (in-flight) — first non-image-registration v3 mission; viz_kind 'gaussian-splat-viewer' has no frontend dispatch yet (fallback to MetricsTable per SCHEMA.md)
algorithm/image-registration/CLAUDE.md — promoted gotchas from M-002+M-004+M-005 (the lessons we already extracted)
algorithm/infra/CLAUDE.md — infra pillar charter (this mission's home)
algorithm/3d-shape/missions/M-006/CLAUDE.md — pillar-traps already discovered (Trap #1, #1a, #2, #3, #4) — meta evidence of how a new pillar bootstraps
lab/brainstorms/B-002.json — image-registration brainstorm (precedent: numeric-only oracle)
lab/brainstorms/B-003.json — multimodal brainstorm (precedent: cross-pillar synthesis)
lab/brainstorms/B-004.json — M-006 selection brainstorm (precedent: numeric+visual dual-oracle, NEW pattern)
tools/sync_to_public.py — dual-repo private/public mirror with surface-scrub (framework-level, post-M-005)
tools/paper_search.py — researcher arxiv access tool (added post-R-1/R-2)
tools/recover_from_transcript.py — disaster recovery from session JSONL (added after git filter-repo incident)
frontend/components/ExecutionView.tsx — narrative↔raw toggle (UX response to 'execution log is alien' feedback)
frontend/next.config.ts — outputFileTracingIncludes for Vercel runtime fs reads
claudedocs/research_blueberry_v3_evolution_2026-05-29.md — v3 evolution research (the lens through which M-003 was originally scoped)
claudedocs/research_slam_landscape_2026-05-30.md — R-1, fed B-004
claudedocs/research_gs_slam_viz_2026-05-30.md — R-2, fed B-004
Anthropic engineering blog 'How we built a C compiler with the framework' — 4-pattern reference for self-improvement scope

Code plan

01Step 1 [survey]: read the EXPANDED evidence base. List every CHANGELOG.md entry across M-001..M-006 + every pillar root (image-registration, pose-tracking, infra, 3d-shape); the four postmortems (M-001/M-002/M-004/M-005); the v3-design.md; the validator + MISSION_PROMPT.md + sync_to_public.py + paper_search.py + recover_from_transcript.py; the frontend lib/v2.ts + lib/artifacts.ts + ExecutionView.tsx + LabStatusBanner.tsx + next.config.ts; per-pillar docker-compose.yml + Makefile + Dockerfile (note 3d-shape pillar bootstrap via M-006 Trap #1/#1a/#4); B-002/B-003/B-004 (note B-004 introduces numeric+visual dual-oracle pattern). Note the file:line of every observed friction point. Pay extra attention to: (a) what the dual-repo workflow exposed about private vs public boundaries, (b) M-005's real-data lesson vs synthetic-fixture earlier missions, (c) M-006's frontend viz_kind dispatch gap, (d) what worked across pillars vs what's pillar-specific. No files edited this turn. — ✅ DONE iter 2: surveyed 4 postmortems + M-006 (failed) spec/postmortem/CLAUDE.md(14 traps) + audit_oracle.py + SCHEMA.md; clustered 9 friction themes with file:line anchors (see iter-2 attempt summary); ≥3 mechanical patch candidates identified.
02Step 2 [cluster]: group friction points into themes (schema drift, container ergonomics, viz gaps, agent behavior, oracle design, budget heuristic, etc.). Pick the top 5-8 highest-leverage themes. Write a draft outline in proposals.md (just headings + 1-sentence problem statements). Files: proposals.md (draft). — ✅ DONE iter 3: wrote proposals.md skeleton — 7 `## Proposal N:` headings (P1 preflight gate, P2 viz_kind table, P3 determinism enforcement, P4 working_dir trap, P5 absolute-imports rule, P6 cross-pillar trap doc, P7 split iter budget), each with a 1-sentence PROBLEM + noted patch-target. Dropped T8/T9 (need human design / unverified).
03Step 3 [fill schema]: for each proposal, fill PROBLEM / EVIDENCE / PROPOSAL / IMPACT / EFFORT sections. ORACLE-GROUNDED: each section marker MUST match the oracle regex `(?:^[-*]\s+)?\*\*SECTION\*\*\s*:` — i.e. write `- **PROBLEM**:` (bullet) or `**PROBLEM**:` (bare), exact uppercase names. Headings MUST be `## Proposal N: <title>` (N a digit). EVIDENCE must cite concrete file:line — no generic claims. Files: proposals.md. — ✅ DONE iter 4: filled all 7 proposals' 5 sections + PATCH pointer with file:line citations; oracle reports proposals_fully_specified 7/7, zero missing sections.
04Step 4 [patches batch 1]: pick 1-2 proposals with mechanical fixes (renames, schema additions, single-file edits). Hand-write unified-diff patches at patches/proposal-N.patch by reading the target file, deciding the change, and emitting a diff with correct headers. Verify each patch with `git apply --check` (run from repo root via subprocess). Files: patches/proposal-*.patch. — ✅ DONE iter 5: authored patches/proposal-5.patch (P5 absolute-imports rule → framework/MISSION_PROMPT.md) + patches/proposal-2.patch (P2 viz_kind rows → lab/missions/SCHEMA.md), both pass `git apply --check` from repo root; oracle patches_applying 2/3.
05Step 5 [patches batch 2]: same as Step 4 for the harder proposals (multi-file edits, frontend changes). Aim for ≥3 total patches passing `git apply --check`. — ✅ DONE iter 6: authored patches/proposal-1.patch (P1 host pre-flight Step-0 block → framework/MISSION_PROMPT.md `queued → planning` section, non-overlapping with proposal-5's hunk); all 3 patches now pass `git apply --check` from repo root; oracle --strict reports proposals_fully_specified 7/7, patches_applying 3/3, pass_overall true (exit 0).
06Step 6 [self-oracle]: run `python algorithm/infra/tests/audit_oracle.py --mission M-003 --strict --json` FROM REPO ROOT (oracle resolves REPO_ROOT = parents[3] and runs `git apply --check` with cwd=repo root, so patch headers must be repo-root-relative `a/<path>`/`b/<path>` and context lines must match the CURRENT working-tree content of each target). Pass = summary.pass_overall true (≥5 fully_specified proposals AND ≥3 patches_applying). If any structural failure, fix the deficient section/patch and re-run. Repeat until pass. — ✅ DONE iter 7: oracle --strict re-confirmed green from repo root — proposals_fully_specified 7/7, patches_applying 3/3 (proposal-1/2/5 all applies:true), pass_overall true, exit 0; no fix loop needed. This is the eval producing metrics → executing → measuring.
07Step 7 [measure]: capture the oracle's JSON output into runs/MEASURED-001.json (wrap with run_id/mission/timestamp per v3 SCHEMA.md Shape B). Append final CHANGELOG entry. Files: runs/MEASURED-001.json + CHANGELOG.md. — ✅ DONE iter 8: wrote runs/MEASURED-001.json (Shape B: run_id MEASURED-001 / mission M-003 / timestamp / iteration 8 + primary_metric.proposals_fully_specified 7 + secondary_metrics patches_applying 3/proposals_count 7 + fixtures[] per proposal + nested oracle_output + pass_overall true); parses clean; frontend-resolvable via artifacts.ts primary_metric/secondary_metrics lift + missions/runs/ measured-by-construction. This is the executing→measuring eval already captured; the measure artifact closes Step 7, transition measuring → evaluating.

v3 metadata

Oracle

algorithm/infra/tests/audit_oracle.pydeterministic

$ python algorithm/infra/tests/audit_oracle.py --mission M-003 --strict --json

Sandbox

algorithm/infra/missions/M-003

Memory files (living)

algorithm/infra/missions/M-003/CLAUDE.md
algorithm/infra/missions/M-003/CHANGELOG.md

Pass tolerance

absolute ≤ 0 · relative ≤ 0%

Hard constraints (5)

proposals.md must have ≥5 numbered headings (## Proposal N: <title>)
every proposal must have all 5 sections: PROBLEM, EVIDENCE, PROPOSAL, IMPACT, EFFORT
every EVIDENCE section must cite ≥1 concrete file path or line reference from the expanded evidence base — M-001..M-006 mission artifacts (CHANGELOG, postmortem, attempts[]) OR framework artifacts (tools/, frontend/, framework/, lab/missions/SCHEMA.md, B-002..B-004 brainstorms, docs/) — generic AI-fluff is rejected by Director on review
≥3 proposals must include a unified-diff patch at patches/proposal-N.patch that passes `git apply --check`
patches must NOT modify files outside the framework (framework/, tools/, lab/missions/SCHEMA.md, frontend/, algorithm/*/CLAUDE.md, etc.) — algorithm/<pillar>/missions/<id>/ is off-limits

Execution

budget 13/22

File change matrix+7 ~8 · 9 files · 9 attempts

File	1	2	3	4	5	6	7	8	9
algorithm/infra/missions/M-003/CHANGELOG.md	·	·	~	~	~	~	·	~	~
algorithm/infra/missions/M-003/proposals.md	·	·	+	~	·	·	·	·	·
algorithm/infra/missions/M-003/patches/proposal-2.patch	·	·	·	·	+	·	·	·	·
algorithm/infra/missions/M-003/patches/proposal-5.patch	·	·	·	·	+	·	·	·	·
algorithm/infra/missions/M-003/patches/proposal-1.patch	·	·	·	·	·	+	·	·	·
algorithm/infra/missions/M-003/runs/MEASURED-001.json	·	·	·	·	·	·	·	~	·
MEASURED-001	·	·	·	·	·	·	·	+	·
algorithm/infra/missions/M-003/POSTMORTEM.md	·	·	·	·	·	·	·	·	+
lab/missions/M-003-postmortem.md	·	·	·	·	·	·	·	·	+

+ created ~ modified

iter 9 · evaluating → done · evaluate-target-met-close-missionjust now
Executed the evaluating → done transition. Re-ran the deterministic structural oracle `python algorithm/infra/tests/audit_oracle.py --mission M-003 --strict --json` from repo root as the evaluation gate (v3-C oracle obsession): re-confirmed green against the CURRENT working tree — proposals_count 7, proposals_fully_specified 7/7 (zero missing_sections), patches_count 3, patches_applying 3/3 (proposal-1.patch/proposal-2.patch/proposal-5.patch all applies:true, error:null), summary.pass_overall true, exit 0. Read the achieved metric from runs/MEASURED-001.json primary_metric.proposals_fully_specified (7) and compared to goal.target_value (5) under goal.target_op 'ge' → 7 ≥ 5 → target met. Set result {achieved_metric_value 7, target_met true, measured_artifact 'MEASURED-001', postmortem_path 'lab/missions/M-003-postmortem.md', honest_caveats[4]}. Wrote the postmortem to lab/missions/M-003-postmortem.md (canonical) and the v3-F sandbox copy algorithm/infra/missions/M-003/POSTMORTEM.md (identical minus the Next-mission-candidates section). Appended the final CHANGELOG entry. Status → done; budget 14 → 13. Mission deliverables complete and on disk: proposals.md (7 proposals, 5 sections each), patches/proposal-{1,2,5}.patch (3/3 git-apply-clean), runs/MEASURED-001.json (Shape B). Note: patches remain PROPOSED, not applied — only the Director applies them via `git apply` + follow-up commits per the E.1 infra-safety pattern; the structural oracle verdict (section completeness + git-apply-check) is not a semantic judgment of proposal merit, which is the Director's review gate. P3/P4/P6/P7 are actionable but were left patch-less (multi-file / human-design) by design.
Modified: lab/missions/M-003-postmortem.md, algorithm/infra/missions/M-003/POSTMORTEM.md, algorithm/infra/missions/M-003/CHANGELOG.md
Created: lab/missions/M-003-postmortem.md, algorithm/infra/missions/M-003/POSTMORTEM.md
iter 8 · measuring → evaluating · execute-step7-measure-build-artifactjust now
Executed Step 7 [measure] — built the measured artifact runs/MEASURED-001.json by wrapping the audit oracle's JSON output as SCHEMA.md Shape B. Top-level provenance: run_id 'MEASURED-001' (matches filename basename), mission 'M-003', timestamp, iteration 8, oracle 'algorithm/infra/tests/audit_oracle.py', oracle_exit_code 0. primary_metric.proposals_fully_specified = 7 (the goal metric; goal.target_op 'ge' goal.target_value 5). secondary_metrics: patches_applying 3, proposals_count 7, patches_count 3. fixtures[] carries one record per proposal (number/title/fully_specified/missing_sections/patch/patch_applies) so the frontend has per-fixture rows; the full untouched oracle verdict is preserved under oracle_output (proposals[], patches[] proposal-1/2/5 all applies:true, summary.pass_overall true). Top-level pass_overall true, blocking_failure null. Verified the file parses with json.load (run_id MEASURED-001, pass True, primary 7). Frontend resolution path confirmed: frontend/lib/artifacts.ts:108 lifts primary_metric+secondary_metrics into the metrics table, and artifacts.ts:103,137 infers measured:true from the missions/M-003/runs/ path (also set explicitly). Only sandbox files touched — runs/MEASURED-001.json (new) + the sandbox CHANGELOG entry — plus this spec; all permitted, v3-D-clean (≤3 files). Per the v2 measuring→evaluating mapping (read metrics file produced, build MEASURED-XXX artifact, set status evaluating), Step 7 is DONE and status advances measuring → evaluating. Budget 15 → 14. Next iter (evaluating → done): read primary_metric.proposals_fully_specified (7) and compare to goal.target_value (5) under target_op 'ge' → 7 ≥ 5 target met; set result {achieved_metric_value 7, target_met true, measured_artifact 'MEASURED-001', postmortem_path 'lab/missions/M-003-postmortem.md', honest_caveats [...]}, write the postmortem to lab/missions/M-003-postmortem.md + the v3-F sandbox copy POSTMORTEM.md, set status done.
Modified: algorithm/infra/missions/M-003/runs/MEASURED-001.json, algorithm/infra/missions/M-003/CHANGELOG.md
Created: MEASURED-001
iter 7 · executing → measuring · execute-step6-self-oracle-confirmjust now
Executed Step 6 [self-oracle] — re-ran the structural oracle `python algorithm/infra/tests/audit_oracle.py --mission M-003 --strict --json` from repo root to confirm the full pass contract before measuring. Verdict re-confirmed green: proposals_count 7, proposals_fully_specified 7/7 (zero missing_sections on any proposal), patches_count 3, patches_applying 3/3 — proposal-1.patch (847B), proposal-2.patch (897B), proposal-5.patch (764B) all report applies:true/error:null against the current working tree; summary.pass_overall true; --strict exit 0. No deficient section or patch surfaced, so the Step-6 fix/re-run loop was a no-op this turn. The oracle is read-only (it only shells `git apply --check`), so no repo files were edited — only the spec + sandbox CHANGELOG, both permitted, v3-D-clean (≤3 files). Per the v2 mapping, the oracle run is the eval that produces this mission's metrics, so this is the executing → measuring transition (Step 6 marked DONE; Step 7 [measure] remains). Budget 16 → 15. Next iter (measuring → evaluating) wraps this exact oracle JSON into runs/MEASURED-001.json as SCHEMA.md Shape B (run_id/mission/timestamp/iteration), then the evaluating turn compares proposals_fully_specified (7) ≥ goal.target_value (5) to close the mission done.
iter 6 · executing → executing · execute-step5-patches-batch-2just now
Executed Step 5 [patches batch 2] — authored the third applying patch to clear the ≥3 floor. patches/proposal-1.patch (P1): an insertion-only hunk adding a 'Step 0 — host pre-flight (greenfield missions)' block to the `### queued → planning` section of framework/MISSION_PROMPT.md — on the first transition, if the spec declares host_requirements (GPU, multiprocess-CUDA/IPC, disk, dataset), probe each and transition straight to failed with a preflight-failed attempt rather than burning research iterations proving a host incompatibility (addresses M-006 spending 19/20 iters on plumbing then dying on a WSL2 CUDA-IPC wall, M-006-postmortem.md:34,40). The hunk lives in the `queued → planning` block, non-overlapping with proposal-5.patch's `planning → executing` hunk, so both MISSION_PROMPT.md patches plus proposal-2.patch (SCHEMA.md) pass `git apply --check` independently against the unmodified tree. Generated insertion-only (ctx=3) from the live target bytes so context matches the current working tree exactly. Self-checked with `python algorithm/infra/tests/audit_oracle.py --mission M-003 --strict --json` from repo root: proposals_fully_specified 7/7, patches_applying 3/3, pass_overall true, exit 0 — the full structural pass contract is met. Files touched: patches/proposal-1.patch + CHANGELOG.md (2, v3-D-clean ≤3). Step 5 marked DONE in code_plan; Steps 6-7 remain (self-oracle re-confirmation, then wrap the oracle JSON into runs/MEASURED-001.json), so status stays executing. Budget 17 → 16. Next iter executes Step 7 [measure] — build runs/MEASURED-001.json (Shape B: run_id/mission/timestamp/iteration wrapping the oracle output), the executing → measuring transition.
Modified: algorithm/infra/missions/M-003/patches/proposal-1.patch, algorithm/infra/missions/M-003/CHANGELOG.md
Created: algorithm/infra/missions/M-003/patches/proposal-1.patch
iter 5 · executing → executing · execute-step4-patches-batch-1just now
Executed Step 4 [patches batch 1] — authored the two most-mechanical single-file patches and verified each passes `git apply --check` from repo root (oracle's exact check). patches/proposal-5.patch (P5): inserts a one-paragraph 'Entrypoint imports must be absolute' rule into framework/MISSION_PROMPT.md right after the 'Do NOT run Docker yet' line in the `planning → executing` block — addresses M-002 iter-14 burning a turn converting nine `from .x import` to absolute (M-002-postmortem.md:46-47). patches/proposal-2.patch (P2): adds two rows — `gaussian-splat-viewer` and `audit-proposals` — to the lab/missions/SCHEMA.md viz_kind dispatch table before the `_(any other)_` fallback row, closing the documented schema/frontend drift (SCHEMA.md table omitted both kinds that M-006.json:15 and M-003.json:16 already emit). Both hunks were generated insertion-only (ctx=3) from each target file's live bytes so context matches the current working tree exactly; both confirmed APPLIES CLEANLY. Oracle now: proposals_fully_specified 7/7, patches_applying 2/3 — pass_overall False only on the patch count, exactly as expected for batch 1. Files touched: patches/proposal-5.patch + patches/proposal-2.patch + CHANGELOG.md (3, v3-D-clean). Step 4 marked DONE in code_plan; Steps 5-7 remain (one more patch to reach ≥3, then self-oracle + measure), so status stays executing. Budget 18 → 17. Next iter executes Step 5 [batch 2]: author P1 (pre-flight Step-0 block) in MISSION_PROMPT.md's `queued → planning` section — a hunk non-overlapping with P5's so both apply independently against the unmodified tree — and/or P3 (SCHEMA.md determinism wording), to clear patches_applying ≥3.
Modified: algorithm/infra/missions/M-003/patches/proposal-5.patch, algorithm/infra/missions/M-003/patches/proposal-2.patch, algorithm/infra/missions/M-003/CHANGELOG.md
Created: algorithm/infra/missions/M-003/patches/proposal-5.patch, algorithm/infra/missions/M-003/patches/proposal-2.patch
iter 4 · executing → executing · execute-step3-fill-proposal-sectionsjust now
Executed Step 3 [fill schema] — filled all 7 proposals' five oracle-required sections (PROBLEM/EVIDENCE/PROPOSAL/IMPACT/EFFORT) plus a PATCH pointer, one file touched (proposals.md rewrite, sandbox-permitted, v3-D-clean ≤3 files). Every EVIDENCE section cites concrete file:line from the iter-2 survey: P1 host pre-flight gate (M-006.json execute-failed iters 4/9/11/13/15/17/19 + M-006-postmortem.md:25-34,34,40 + 3d-shape M-006/CLAUDE.md:52 Trap#14); P2 viz_kind dispatch rows (SCHEMA.md table omits gaussian-splat-viewer M-006.json:15 and audit-proposals M-003.json:16; gap noted M-003 CLAUDE.md:48); P3 enforce determinism (SCHEMA.md:170 HARD req vs M-004 SITK Mattes-MI ~3.8% jitter M-004-postmortem.md:30 vs M-005 SSD byte-stable M-005-postmortem.md:23); P4 compose working_dir vs Makefile (M-002-postmortem.md:52-54); P5 absolute-imports rule (M-002 iter14 converted 9 imports, M-002-postmortem.md:46-47); P6 cross-pillar trap doc for grid_sample double-backward (M-002-postmortem.md:48-51, M-004-postmortem.md:19, M-005-postmortem.md:21); P7 split setup vs research iter budget (M-006-postmortem.md:25-34, M-001-postmortem.md:51-53, M-003.json:87-88). Self-checked with `python algorithm/infra/tests/audit_oracle.py --mission M-003 --json`: proposals_fully_specified 7/7, zero missing sections — clears the ≥5 floor. P6 and P7 deliberately carry 'no patch — needs human curation / multi-file co-change' so they don't dilute the ≥3-independently-appliable-patch requirement; the patch slate for Steps 4-5 is the 3 mechanical single-file edits P5/P2/P1(or P3). Step 3 marked DONE in code_plan; Steps 4-7 remain (patches + self-oracle + measure), so status stays executing. Budget 19 → 18.
Modified: algorithm/infra/missions/M-003/proposals.md, algorithm/infra/missions/M-003/CHANGELOG.md
iter 3 · executing → executing · execute-step2-cluster-write-proposals-skeletonjust now
Executed Step 2 [cluster] — wrote the proposals.md draft skeleton inside the sandbox (one file touched, sandbox-permitted, ≤3 files per v3-D). Clustered the 9 surveyed friction themes (iter-2 attempt) into 7 numbered proposals, one `## Proposal N: <title>` heading apiece with a single-sentence `- **PROBLEM**:` statement and a noted patch-candidate target file: P1 mandatory iter-1 host/capability pre-flight gate (→ framework/MISSION_PROMPT.md; evidence M-006 burned 19/20 iters on plumbing then died on a WSL2 CUDA-IPC wall a smoke test would have caught, M-006-postmortem.md:34,40); P2 register missing viz_kind values gaussian-splat-viewer + audit-proposals in the dispatch table (→ SCHEMA.md:219-225); P3 enforce the deterministic:true oracle claim rather than trust it (→ SCHEMA.md:170 + pillar CLAUDE.md; M-004 SITK Mattes-MI ~3.8% jitter, M-004-postmortem.md:30); P4 document the compose working_dir=/workspace vs pillar-relative Makefile trap (→ MISSION_PROMPT.md or pillar CLAUDE.md; M-002-postmortem.md:52-54); P5 pin the absolute-imports-in-entrypoint rule into the mission prompt (→ MISSION_PROMPT.md; M-002-postmortem.md:46-47); P6 promote cross-pillar gotchas like grid_sample double-backward to a shared trap doc (→ docs/ + pillar pointer); P7 split the flat max_iterations into setup vs research sub-budgets (→ SCHEMA.md; M-006 + M-001-postmortem.md:51-53). Picked 7 to leave buffer above the oracle's ≥5 fully-specified floor; deferred T8 (core-modify escape hatch — needs human design, no mechanical patch) and T9 (stream.jsonl — unverified vs framework/mission.ps1). Step 2 marked DONE in code_plan; Steps 3-7 remain, so status stays executing. Budget 20 → 19. Next iter executes Step 3 [fill schema], filling each proposal's 5 sections with the oracle-required `- **NAME**:` markers and file:line evidence (proposals.md only — one file, v3-D-clean).
Modified: algorithm/infra/missions/M-003/proposals.md, algorithm/infra/missions/M-003/CHANGELOG.md
Created: algorithm/infra/missions/M-003/proposals.md
iter 2 · planning → executing · execute-step1-survey-evidence-basejust now
Executed Step 1 [survey] — read-only ingestion of the expanded evidence base (no repo files edited; only the spec + sandbox CHANGELOG/code_plan touched, all permitted). Read: 4 postmortems (M-001 pose-tracking failed AMOTA 0.4028<0.471; M-002/M-004/M-005 image-registration done), the in-flight-now-FAILED M-006 spec + postmortem + sandbox CLAUDE.md (14 enumerated traps), the structural oracle algorithm/infra/tests/audit_oracle.py, and lab/missions/SCHEMA.md. Clustered 9 friction themes, each with concrete file:line anchors so Steps 2-3 are mechanical: (T1) NO iter-1 host/capability pre-flight gate — M-006 spent 19/20 iters almost entirely on env plumbing (M-006.json attempts iter 4/9/11/13/15/17/19 all execute-failed-*) then died on a WSL2 CUDA-IPC wall (M-006-postmortem.md:25-34, sandbox CLAUDE.md Trap #14 line 52) that the postmortem itself says a single iter-1 smoke test would have caught (M-006-postmortem.md:34,40). (T2) viz_kind dispatch gap — SCHEMA.md:219-225 table lists only registration-before-after/optimizer-convergence/tracker-bev/compiler-bench; M-006's gaussian-splat-viewer (M-006.json:15) and M-003's audit-proposals (M-003.json:16) both silently fall back to MetricsTable; documented-incomplete in M-003 CLAUDE.md:48. (T3) deterministic:true asserted but unenforced — SCHEMA.md:170 calls it a HARD requirement / spec-reject-if-false, yet M-004's SITK Mattes-MI oracle jitters ~3.8% run-to-run because thread count isn't pinned (M-004-postmortem.md:30), violating v3-C byte-determinism; contrast M-005 SSD byte-stable (M-005-postmortem.md:23). Fix both: pin SetGlobalDefaultNumberOfThreads(1) + soften the schema's 'byte-for-byte' claim to 'verdict-stable + provenance-pinned'. (T4) compose working_dir=/workspace vs pillar-relative Makefile breaks make data/make eval as written — every M-002 measure needed the cwd-corrected -w invocation (M-002-postmortem.md:52-54). (T5) package-relative imports (from .warp import ...) fail under the oracle's top-level import_module('register'); M-002 iter 14 burned a remediation turn converting 9 imports to absolute (M-002-postmortem.md:46-47). (T6) grid_sample double-backward (pillar gotcha #5) recurs cross-mission — M-002 iter-15 failure (M-002-postmortem.md:48-51), defeated via GN-Hessian in M-004 (M-004-postmortem.md:19) and stayed clear in M-005 (M-005-postmortem.md:21); pinned only in image-registration/CLAUDE.md, not a cross-pillar trap doc the new 3d-shape pillar could consult. (T7) flat max_iterations conflates env-setup iters with research iters — M-006 exhausted budget on plumbing; M-001 spent its budget on 3 UKF PD-remediation rounds and never reached the planned Step-7/8 sweeps (M-001-postmortem.md:51-53). (T8) 'DO NOT modify core' constraint had no escape hatch — M-006's only in-principle fix (single-process launcher) was forbidden by the rule (M-006-postmortem.md:27, Trap #14), arguing for a sanctioned 'concurrency-substrate adapter' exception category. (T9) stream.jsonl pipeline — SCHEMA.md:233-239 documents mission.ps1 tee-ing stream-json, to verify against framework/mission.ps1 next. The top-5 (T1-T5) map cleanly to ≥3 mechanical git-apply-able patches in the allowed target set (SCHEMA.md, framework/MISSION_PROMPT.md, frontend/, algorithm/<pillar>/CLAUDE.md, docs/) per sandbox CLAUDE.md:158-167. Step 1 marked DONE in code_plan; more steps remain (Step 2 [cluster] writes the proposals.md heading skeleton), so status advances planning → executing. Budget 21 → 20.
iter 1 · queued → planning · elaborate-code-plan17h ago
Read the M-003 operating manual (algorithm/infra/missions/M-003/CLAUDE.md), infra pillar charter, the mission spec, and the structural oracle (algorithm/infra/tests/audit_oracle.py). Confirmed the pass contract: proposals.md needs >=5 headings of form `## Proposal N: <title>` each carrying all 5 markers PROBLEM/EVIDENCE/PROPOSAL/IMPACT/EFFORT matched by regex `(?:^[-*]\s+)?\*\*NAME\*\*\s*:`; patches/proposal-N.patch needs >=3 files that pass `git apply --check` run from repo root (so diffs must be repo-root-relative a/ b/ paths whose context matches current working-tree content). Elaborated code_plan steps 3 and 6 with these exact oracle constraints so subsequent executing turns are mechanical. Sandbox confirmed empty (only CLAUDE.md + CHANGELOG.md + empty patches/ runs/). Next turn (planning->executing) begins Step 1 survey of the expanded M-001..M-006 + framework evidence base.

Switch to raw stream above to see the agent's tool calls and reasoning verbatim.

Results

MEASURED-001

measured

just now

Metrics

Metric	Value
proposals_fully_specified	7
target_value	5
target_op	ge
patches_applying	3
patches_count	3
proposals_count	7
min_patches_required	3

Notes

Structural audit oracle (proposals.md section-completeness + git-apply-check per patch). 7/7 proposals fully specified (target ge 5), 3/3 patches apply cleanly (min 3). pass_overall true, --strict exit 0.

Honest caveats

Pass is a STRUCTURAL oracle verdict — proposals.md section-marker completeness + `git apply --check` per patch — not a semantic judgment of proposal merit; Director (human) review is the real quality gate.
Evidence file:line citations were accurate at audit time, but framework files drift, so anchors like M-006-postmortem.md:34 may shift before the patches are applied.
Three applying patches is the MINIMUM floor, not a claim that only three proposals are actionable — P3/P4/P6/P7 are actionable too but needed multi-file or human-design changes left for the Director.
Single deterministic structural-oracle run (no statistical noise applies), but no independent second oracle cross-checks the section/patch verdicts; patches are PROPOSED only and pass `--check` against the tree as of this mission.