Blueberry Lab
QueuedM-003 queued2 done · 1 failed · 1 queued · 0 active · last activity just now

About

Blueberry Lab

Throw a measurable goal at it. Seven AI personas hold a brainstorm, one idea gets promoted to an experiment, the lab writes real code, runs it inside Docker, and accumulates measured artifacts. An ongoing test of whether a one-person lab can produce paper-grade tools.

v3 framework — 4 disciplines

Four patterns lifted from Anthropic's automated C-compiler build case study. With all four enforced, the research loop ships work that holds up.

  • L1 — Obsess over the test oracle. Every experiment pins an external reference implementation (SimpleITK, etc.) as its oracle; the agent only passes by matching it. No grading yourself with the eval code you also wrote.
  • L2 — One iteration = one task. Each turn is exactly one state transition. No bundling several jobs into one step.
  • L3 — Greenfield only. Each experiment begins in an empty sandbox at algorithm/<pillar>/missions/M-NNN/. No brownfield patches on existing code. M-001 (brownfield) failed where M-002 through M-005 (greenfield) all succeeded — that gap is the evidence.
  • L4 — Living memory. Every experiment maintains a CLAUDE.md (the running ops manual) and a CHANGELOG.md (per-turn notes). When a new trap is found, the agent writes it down. The next experiment doesn't step in it.

Two mechanisms

BRAINSTORMEXPERIMENT
InputProblem + measurable metricPromoted brainstorm card
Mechanism7 personas in parallel + Director synthesisBounded loop, 1 turn = 1 transition
Outputlab/brainstorms/B-NNN.jsonruns/MEASURED-*.json + sidecar PNG
Duration5–10 minutesHours (CPU)
Human gates1 (declare the metric)2 (approve plan, verify done)

Experiment state machine

queued
  └─→ planning ──→ executing ──→ measuring ──→ evaluating
                                                  ├─→ done    (target met)
                                                  ├─→ failed  (budget exhausted)
                                                  └─→ planning  (retry within budget)

Each turn is exactly one transition. When iteration_budget_remaining hits zero, the experiment is forced to failed. M-001 failed this way, honestly (AMOTA 0.471 → 0.4228); v3 was introduced afterward.

Active pillars

  • image-registration — differentiable image registration. Three experiments complete: M-002 (synthetic affine), M-004 (multi-modal MIND-SSC), M-005 (real OpenNeuro T1 longitudinal).
  • pose-tracking — v2 frozen. M-001 (CenterPoint + AB3DMOT, AMOTA 0.471) is the failure case and the control group for v3 validation.
  • infra — meta pillar. Experiments that audit and improve the lab framework itself. M-003 queued.

Honest caveats

  • One-person lab on personal hardware (RTX 3080 Mobile + 16 GB RAM). Large-scale training is out of reach.
  • All measurements are single-run; no statistical significance is claimed. Ratio comparisons carry the robustness judgment.
  • Benchmark scores are relative to SITK. No head-to-head against ANTs / VoxelMorph or standard suites like DIR-Lab.
  • v3 was validated across 5 experiments. Its generality across other domains is unproven.
  • This site is a git-push snapshot. No live streaming (Tier 0).

Acknowledgments

  • Anthropic engineering blog — the loop pattern and the automated C-compiler build case study. The v3 framework's four lessons are taken directly from there.
  • OpenNeuro ds007328 (Petrovskiy 2024, CC0) — the real T1 brain MRI dataset that M-005 ran on.
  • SimpleITK — the oracle for every experiment. The reason the agent can't grade itself.
  • 42dot AD-Perception job posting — the source of the 7-persona pillar layout (v1; v2 / v3 borrow only the structure).

Explore

  • Experiments — five experiments in detail (goal, attempts, measured results, viz)
  • Brainstorms — per-persona idea cards + Director synthesis
  • Members — the 7 personas + Director + Frontend lead

한국어로 한 줄

Blueberry Lab은 한 명이 운영하는 자율 perception research lab입니다. 측정 가능한 목표를 던지면 7개 AI 페르소나가 회의를 열고, 그중 하나를 실험으로 promote하면 lab이 실제 코드를 짜고 Docker에서 돌려 실측 결과를 누적합니다.