A Beat, and Its Undoing: The Blueberry Chronicle
An earlier self-running perception lab, on a CLI loop. This is the story of its one real win — and how the lab's own verifier dismantled it. Recorded because the dismantling is the point.
— Written 2026-06-01.
The premise
Blueberry ran missions on a loop: form a hypothesis, build it, measure it against an external baseline, and — the part that mattered — let an independent evaluator re-run the oracle and decide, so the agent that wrote the code never graded itself. Same idea Touchstone later generalized; Blueberry is where it was first stress-tested, on geometric computer vision.
The lab's one research line with real headroom was differentiable classical vision: take a converged classical solver, make it differentiable through the implicit function theorem, and refine on top of a frozen front-end. The question was whether that beats the classical method it refines.
The win — M-011
A differentiable IRLS (Cauchy) essential-matrix refiner, run on frozen
SuperPoint+LightGlue correspondences, was compared apples-to-apples against
cv2's RANSAC pose on the full MegaDepth-1500 benchmark. It won:
+0.0279 AUC@5 over RANSAC (also +0.018 @10, +0.0098 @20), differentiable end-to-end on all 1500 pairs. The lab's first beat of an external baseline.
It was honest even in victory: the gain lived in 228 of 1500 pairs (228 better, 42 worse, the rest unchanged) — real, but modest. Still, a ceiling moved for the first time. It was tempting to stop here and call it a result.
The undoing — M-012, then M-013
The lab didn't stop. Two follow-ups, each a falsifiable prediction, each graded by the independent evaluator on a held-out measurement:
M-012 — does it generalize? Same refiner, same exact hyperparameters and seed, swapped only the benchmark to ScanNet-1500 (indoor). The prediction: it clears a +0.01 bar again. It did not.
+0.0053 AUC@5 indoors — positive, but a fifth of the outdoor gain and below the bar. On 1235 of 1500 pairs the refinement fell back to the RANSAC seed (zero contribution). The win was domain-conditional, not universal.
M-013 — does it beat a strong baseline? The sharpest doubt, from an independent audit: RANSAC is a weak baseline. So M-013 changed exactly one thing — the baseline became MAGSAC++, the standard strong robust estimator — and re-ran the identical comparison with zero re-tuning.
−0.1154 AUC@5. The refiner lost, decisively. MAGSAC++ (~0.57) sits well above the differentiable refiner (~0.46), which sits just above plain RANSAC (~0.43). The M-011 beat was real only against a weak baseline.
So the agenda bet behind the whole line was retired, honestly: the differentiable refiner is not competitive with a strong classical estimator. The lab has no result that beats a strong external baseline. That is written down, not buried.
The audit — the lab grading itself
Between the win and the undoing, three independent reviewers were pointed at the lab's own harness. They found that the verification gates it advertised were, at that moment, inert — the flagship M-011 had actually reached "done" self-graded, the exact failure the design claimed to prevent. The finding was recorded as a falsified hypothesis about the lab itself, and the gates were made to actually fire before anything else proceeded. The follow-ups above ran through the fixed gates.
What the chronicle is
| Mission | Claim | Baseline | Result | Verdict |
|---|---|---|---|---|
| M-011 | diff-refiner beats RANSAC on two-view pose | cv2-RANSAC (weak) | +0.0279 | ✅ confirmed |
| M-012 | …and it generalizes indoors | cv2-RANSAC, ScanNet | +0.0053 (below bar) | ✗ falsified |
| M-013 | …and it beats a strong baseline | MAGSAC++ | −0.1154 | ✗ falsified |
One win, taken apart by two honest negatives and a self-audit. The differentiable engine turned out to be a parity / transfer tool, not a tool that wins against the best classical methods — a sober, useful thing to know, and the reason the lab pivoted away from "beat classical at inference" toward "use differentiability where a frozen method can't adapt at all."
A result counts only when something the producer can't influence says so — and that applies hardest to the producer's own best result. The proudest entry here (+0.0279) and the one that retired the whole bet (−0.1154) are recorded with the same candor.