Whasuk Lee
← Works

An Agent in a Lab: A Chronicle of LenaLab

LenaLabWed Jun 03 2026 00:00:00 GMT+0000 (Coordinated Universal Time)3 verified1 rejectedGitHub →
visual-odometryrgb-dslam

How a verification-first harness let an AI agent author computer-vision algorithms from scratch — and what it got right, wrong, and never quite figured out.

— Written 2026-06-03, covering work done 2026-06-02.


The premise

LenaLab is built on one idea borrowed from Anthropic's harness-engineering work: never let the agent that produces a result be the one that decides whether it's any good.

So the lab has two halves that don't trust each other. A solver — a Claude agent — writes a visual-odometry algorithm from scratch in a sealed sandbox. A verifier — pure, deterministic Python with no model anywhere in it — runs that code on a held-out trajectory the agent never saw and measures the error with closed-form geometry. "It ran" is never success. The only thing that earns a ✅ is a held-out number under a bar that was fixed before the agent started.

The solver ⟂ verifier split

Everything below is what happened when we actually ran it. Each episode follows the same shape: what got built → how it improved on the last one → what broke → the lesson. The numbers are pulled straight from the run registries, not rounded for flattery.


Episode 0 — Building the bench (13:08)

Commit 74377be — "verification-first computer-vision research lab"

Before the agent could do anything, the lab itself had to exist. The key decision was not to build a verifier — the whole verification spine (the state-machine loop, the held-out evaluator, the crash-resumable registry, the token+experiment budget, the single-GPU lease, the Docker job-runner) was imported from a prior project, "Touchstone." LenaLab added only the vision domain: the dataset provider, a classical reference algorithm, the grader, and the expert prompts.

The discipline that made the rest possible was the calibration gate: before the agent is allowed a single autonomous turn, the verifier must VERIFY a known-good run and REJECT a deliberately broken one. If it can't tell those apart, it's a rubber stamp, and the lab refuses to open.

What could be better, even here: the gate proves the grader isn't blind, but it can't prove the held-out split isn't correlated with the training split. That limitation never went away — it's honestly noted in the design doc and it's still true.


Episode 1 — First light, and the cost of perfectionism (≈03:17–03:44, salvaged)

Algorithm archived: agent_authored_vo_tum_v1.py

The first real trial: a sandboxed agent was asked to write monocular visual odometry for the TUM fr1_xyz sequence — recover the camera's path from a stream of plain grayscale frames, graded against a motion-capture ground truth it couldn't see. The bar was set by running a classical ORB reference (0.089 m) and allowing ×1.5 → 0.134 m.

The agent wrote 360 lines on its own. It chose a genuinely sophisticated design: PnP-centric pose estimation against a maintained 3-D landmark map, with deferred initialization — it waited until the camera baseline was wide enough for a reliable two-view triangulation, then back-filled the earlier frames. Reprojection-pruned map, optical-flow tracking, graceful pose-holding when a frame failed.

The result, when finally graded: VERIFIED at 0.124 m. It cleared the bar — though, tellingly, it was worse than the simple classical reference (0.124 vs 0.089). Good monocular tracking early, then the classic end-of-sequence drift.

Monocular VO v1 estimate vs ground truth

Red is the agent's path, black is ground truth. It hugs the truth through the middle of the run, then the X–Y panel shows it failing to capture the lateral back-and-forth and drifting off — exactly the monocular weakness the next episode set out to fix.

The first failure

The live run printed RESULT: FAILED — and it was neither a hang nor a bad algorithm. The agent kept refining to chase the tight bar and hit its 40-turn authoring limit. The SDK raised, and the harness threw away 27 minutes of working code as a failure.

Lesson 1 — a budget limit must never discard a valid artifact. We shipped resilient_sdk_author: if a session ends early but left a runnable entry file, the evaluator grades it anyway. A turn cap is a spending limit, not a verdict. This single fix is what made every later run survivable.


Episode 2 — The clean re-run, and beating the reference (13:37)

Commit dcb3057 · algorithm: agent_authored_vo_tum_v2.py

With the resilient author in place and the budget raised to 80 turns, we re-ran the same task. This time the agent took a different and better tack: goodFeaturesToTrack + optical-flow tracking, a wider keyframe baseline (skip every other frame), a SIFT fallback, and keyframe interpolation.

Result: VERIFIED automatically at 0.052 m — no manual salvage, and this time better than the classical reference (0.052 vs 0.089). A background watchdog guarded the unattended run and fired zero kills.

Monocular VO v2 estimate vs ground truth

The wider keyframe baseline keeps the estimate (red) locked to ground truth (black) far longer — compare the tighter X–Z agreement to Episode 1's drift.

How it improved on Episode 1

  • 0.124 m → 0.052 m (2.4× more accurate), and crossed from worse-than-reference to better.
  • Manual salvage → fully automatic — the harness fix turned a fragile run into a hands-off one.
  • The wider keyframe baseline directly attacked the end-of-sequence drift that hurt v1.

What could still be better: it's monocular, so scale is unobservable — the grader had to align with Sim(3) and give away the scale for free. The trajectory is correct only up to an unknown stretch factor. And it was scored on the same sequence it tuned against. Both of those became the agenda for the next episode.


Episode 3 — Depth, generalization, and an expensive blind spot (16:53 → 18:08)

Commits c526166, 2c6caa8, dc4e01f · reference + agent_authored_vo_rgbd_v1.py

This was the most ambitious step, and it fixed the two weaknesses head-on:

  1. Depth was exposed (RGB-D). The provider now materializes the TUM depth channel, so the agent can recover metric scale — real metres, no Sim(3) freebie. The grader switched to SE(3) alignment: if you don't actually use depth, your scale is wrong and you fail.
  2. Generalization grading. The agent's code is now scored on held-out sequences it never authored against (fr1_desk, while it developed on fr1_xyz), with ground truth isolated outside the input directory. The grader also began reporting RPE (drift) and a scale-error diagnostic, not just ATE.

The calibration first: reference RGB-D PnP scored 0.057 m on the unseen scene with scale_err 0.077 (near-metric — depth works); the degenerate control blew up to 0.70 m and was correctly REJECTED. A ~12× discrimination margin, far wider than the monocular gate's.

The live agent result: VERIFIED at 0.033 m (SE(3), metric), RPE 0.010, scale_err 0.032 — near-perfect absolute scale, on a scene it had never touched, beating the classical RGB-D reference (0.057 m). The agent built a multi-strategy pipeline: SIFT → 3D-2D PnP RANSAC as the primary, KLT optical-flow as a fallback, keyframe recovery, depth for metric scale.

This is the strongest and most honest result in the project: metric (no scale gift) and generalizing to an unseen scene.

Graded with SE(3) — no scale freebie — on fr1_desk, a scene the agent never authored against. The estimate tracks ground truth in absolute metres (scale error just 3 %).

How it improved on Episode 2

Monocular v2Agent RGB-D
Held-out ATE0.052 m0.033 m
AlignmentSim(3), scale given awaySE(3), metric
Scored onsame sequenceunseen sequence

The second failure — and it was costly

The first live RGB-D attempt FAILED after ~1.17 million tokens. The cause was almost absurd: the RGB-D dataset names contained a colon (vo-rgbd-dev:fr1_xyz), and a colon breaks Docker's host:container volume-mount syntax. Every sandbox run errored — the agent was authoring completely blind, never once seeing its code execute, for over a million tokens. Local-mode calibration had passed cleanly because it uses no -v mounts at all, so nothing caught it before the live, billed run.

Lesson 2 — validate the path you'll actually run, not a convenient proxy. Fix: mount-safe dataset names and a Docker-mode reference dry-run (not just local) before any agent session. The blind spot was that our cheap test exercised a different code path than the expensive real one.


Episode 4 — SLAM: the reference triumphs, the agent does not (19:14 → 20:54)

Commits 0818dc5, 6af0522 · agent_authored_vo_slam_v1.py (untracked)

The frontier: full SLAM with loop closure. On a sequence that revisits the same room (fr1_room), pure frame-to-frame odometry drifts badly; the fix is to detect when you've returned to a known place and optimize the whole pose graph to snap the loop shut.

The reference works, and proves loop closure is necessary. We built an RGB-D front-end → keyframes → geometrically-verified loop detection → a self-contained SE(3) pose-graph optimizer. On fr1_room:

ConfigurationHeld-out ATE
VO-only (no loop closure)0.86 m
Degenerate control1.02 m
Reference SLAM (with loop closure)0.23 m

The bar was 0.347 m; VO-only and the degenerate control both fail it, while loop-closure SLAM clears it — a clean experimental demonstration that loop closure isn't optional on this sequence (~73% drift reduction).

SLAM: loop closure works (left); the agent diverged (right)

Left: the reference SLAM (green) snaps the loop shut and stays on ground truth (black), while plain VO-only (red) drifts away. Right: the live agent's authored SLAM — its pose graph diverged, flinging the trajectory across ±600 m while the true path (black) is barely a dot at the origin. The verifier scored it 412 m and rejected it.

The third failure — an honest negative result

Then we let a live agent try to author SLAM from scratch. It wrote a structurally sound 352-line program: SIFT matching, loop detection, a pose graph. And it was REJECTED — its trajectory scored 412 m. The pose-graph optimization diverged, blowing a 3-metre-scale path up by two orders of magnitude.

The important part: the verifier caught it. A broken SLAM scored 412 m and was rejected — nothing false was accepted. That is the lab working exactly as designed.

But it was compounded by my own harness misconfiguration, and this is the sharpest lesson:

  • The hang-watchdog killed containers at 480 s.
  • The agent's SLAM took ~507 s inside the container (slower than its 153 s host run — fewer cores in the sandbox).
  • The grader's own timeout was 600 s.

So 480 < 507 < 600: every in-container test was killed before it finished. The agent never once saw its own output, so it could never observe the divergence, so it could never debug it. We had built a lab where the scientist's experiments were confiscated mid-run.

Lesson 3 — the safety budget must let the agent see its own results. A watchdog that fires before the grader's own timeout doesn't protect the run, it lobotomizes it. Corrected to 900 s (> the 600 s grader). More generally: per-iteration cost has to be low enough that the agent can actually loop — observe, hypothesize, retry — or it isn't doing science, it's guessing once.

Lesson 4 — some problems don't fit in one session. SLAM-from-scratch is materially harder than VO/RGB-D: the agent got the structure right but the optimizer diverged. Loop closure is demonstrated in the repo via the working reference; the live agent SLAM is left as an open frontier, not dressed up as a success.


The arc, in one table

#TrialWhat the agent builtHeld-out resultvs referenceVerdict
1Monocular VO v1PnP + landmark map, deferred init0.124 m (Sim3)worse (0.089)✅ (salvaged)
2Monocular VO v2Optical flow, wide keyframe baseline0.052 m (Sim3)better✅ (automatic)
3RGB-D VOSIFT→PnP RANSAC + KLT, metric depth0.033 m (SE3, unseen)better (0.057)
4SLAMSIFT + loop detection + pose graph412 m (diverged)— (ref: 0.23)❌ REJECTED

The trajectory of the agent mirrors the trajectory it was estimating: confident progress, then a hard turn at the frontier.


What the lab learned about itself

Three of the four failures were harness bugs, not algorithm bugs — a turn limit that ate working code, a colon that blinded the agent, a watchdog that confiscated its experiments. That's the real finding of harness engineering: most of what stops a capable agent isn't its intelligence, it's the scaffolding around it. Every one of those was a wrong assumption baked into the harness about what the agent needed to succeed.

The verifier, by contrast, never failed. It rejected the broken SLAM, rejected every degenerate control, refused to give monocular runs free scale, and scored RGB-D on scenes the agent had never seen. The half of the lab with no AI in it is the half we trust.

What to improve next

  1. Close the failure-memory loop. The lab has a failed_approaches.md, but it's nearly empty — the four lessons above live only in prose. The agent that attempts SLAM next starts from zero, unaware the optimizer diverged last time. Wiring rejected results back into the next session's context is the highest-leverage fix, and it's the same "structured handoff across context resets" that harness-engineering prescribes.
  2. The SLAM run left no structured record at all — its registry has zero rows; the 412 m result exists only in this narrative. The run that most needed durable memory produced none.
  3. Make iteration cheap enough for SLAM. The agent needs to run, see, and retry within budget. Until a single SLAM iteration is fast enough to loop on, the frontier stays closed.
  4. Stress-test the scaffolding. As the model improves, some guardrails (menu clamps, heavy deferred-init patterns) may already be over-engineering. Harness components encode assumptions about what the model can't do; those assumptions expire.

The lab is honest by construction: a result counts only when code it can't see says so. That's why the proudest entry in this chronicle (0.033 m, metric, on an unseen scene) and the most instructive one (412 m, rejected) are recorded with exactly the same candor.