About
Blueberry Lab
Throw a measurable goal at it. Seven AI personas hold a brainstorm, one idea gets promoted to an experiment, the lab writes real code, runs it inside Docker, and accumulates measured artifacts. An ongoing test of whether a one-person lab can produce paper-grade tools.
v3 framework — 4 disciplines
Four patterns lifted from Anthropic's automated C-compiler build case study. With all four enforced, the research loop ships work that holds up.
- L1 — Obsess over the test oracle. Every experiment pins an external reference implementation (SimpleITK, etc.) as its oracle; the agent only passes by matching it. No grading yourself with the eval code you also wrote.
- L2 — One iteration = one task. Each turn is exactly one state transition. No bundling several jobs into one step.
- L3 — Greenfield only. Each experiment begins in an empty sandbox at
algorithm/<pillar>/missions/M-NNN/. No brownfield patches on existing code. M-001 (brownfield) failed where M-002 through M-005 (greenfield) all succeeded — that gap is the evidence. - L4 — Living memory. Every experiment maintains a CLAUDE.md (the running ops manual) and a CHANGELOG.md (per-turn notes). When a new trap is found, the agent writes it down. The next experiment doesn't step in it.
Two mechanisms
| BRAINSTORM | EXPERIMENT | |
|---|---|---|
| Input | Problem + measurable metric | Promoted brainstorm card |
| Mechanism | 7 personas in parallel + Director synthesis | Bounded loop, 1 turn = 1 transition |
| Output | lab/brainstorms/B-NNN.json | runs/MEASURED-*.json + sidecar PNG |
| Duration | 5–10 minutes | Hours (CPU) |
| Human gates | 1 (declare the metric) | 2 (approve plan, verify done) |
Experiment state machine
queued
└─→ planning ──→ executing ──→ measuring ──→ evaluating
├─→ done (target met)
├─→ failed (budget exhausted)
└─→ planning (retry within budget)Each turn is exactly one transition. When iteration_budget_remaining hits zero, the experiment is forced to failed. M-001 failed this way, honestly (AMOTA 0.471 → 0.4228); v3 was introduced afterward.
Active pillars
- image-registration — differentiable image registration. Three experiments complete: M-002 (synthetic affine), M-004 (multi-modal MIND-SSC), M-005 (real OpenNeuro T1 longitudinal).
- pose-tracking — v2 frozen. M-001 (CenterPoint + AB3DMOT, AMOTA 0.471) is the failure case and the control group for v3 validation.
- infra — meta pillar. Experiments that audit and improve the lab framework itself. M-003 queued.
Honest caveats
- One-person lab on personal hardware (RTX 3080 Mobile + 16 GB RAM). Large-scale training is out of reach.
- All measurements are single-run; no statistical significance is claimed. Ratio comparisons carry the robustness judgment.
- Benchmark scores are relative to SITK. No head-to-head against ANTs / VoxelMorph or standard suites like DIR-Lab.
- v3 was validated across 5 experiments. Its generality across other domains is unproven.
- This site is a git-push snapshot. No live streaming (Tier 0).
Acknowledgments
- Anthropic engineering blog — the loop pattern and the automated C-compiler build case study. The v3 framework's four lessons are taken directly from there.
- OpenNeuro ds007328 (Petrovskiy 2024, CC0) — the real T1 brain MRI dataset that M-005 ran on.
- SimpleITK — the oracle for every experiment. The reason the agent can't grade itself.
- 42dot AD-Perception job posting — the source of the 7-persona pillar layout (v1; v2 / v3 borrow only the structure).
Explore
- Experiments — five experiments in detail (goal, attempts, measured results, viz)
- Brainstorms — per-persona idea cards + Director synthesis
- Members — the 7 personas + Director + Frontend lead
한국어로 한 줄
Blueberry Lab은 한 명이 운영하는 자율 perception research lab입니다. 측정 가능한 목표를 던지면 7개 AI 페르소나가 회의를 열고, 그중 하나를 실험으로 promote하면 lab이 실제 코드를 짜고 Docker에서 돌려 실측 결과를 누적합니다.