B-003

Multi-modal differentiable image registration in PyTorch (T1↔T2 brain MRI). Intensity correspondence is broken across modalities so SSD/NCC fail; need a new modality-invariant similarity metric. Stack reuses M-002's IC-LK + mixed-precision LM + DEQ implicit-diff via algorithm/image-registration/shared/. Oracle: SimpleITK MattesMutualInformation + BSplineTransform. No NN training, no learned weights, CPU only. Pass: max(our_rmse) ≤ 1.15 × max(oracle_rmse) + 1e-3 AND torch.autograd.grad succeeds end-to-end on all 5 fixtures.

Metric: rmse_vs_simpleitk_maxtarget 0.001paper band 0.0001–0.005

Director synthesis

Card-8 is the pick because it fuses the two cards that own the load-bearing axes of this brief: card-5's MIND-SSC-12 is the descriptor with the Heinrich-anchored ~22% TRE lift over MI (4.12mm → 3.20mm, then 3.20 → 2.86 with SSC) and the only one specifying the IC-LK SD-image precomputation that keeps M-002's inverse-compositional contract intact; card-3 owns the numerical regime — per-level scale-aware ε, fused separable convs (~5× FLOP cut), Kahan accumulation (the proven M-002 B-002.card-3→card-8 pattern), and the gradcheck-at-float64-first discipline. Neither parent alone is safe to ship: card-5 without card-3's ε-before-div clamp will NaN on homogeneous WM (the exact failure card-7's degeneracy fixtures expose); card-3 without card-5's SSC descriptor optimizes 5× faster toward the wrong residual on fixture-5. The load-bearing assumption is that the K=12 grouped-conv working set actually fits L1 at tile=32 halo=2 on the target CPU (24 KB nominal, but actual L1 layout depends on tensor stride and PyTorch's grouped-conv kernel selection) — if it spills, we lose the ~3× wall-time win but keep the accuracy, so the downside is bounded. Card-10 ranks 4th because the MI gating is genuine insurance against rot15°, but card-8 likely passes 5/5 without it; cards 9, 7, 1 rank low because the 60-case battery (card-7, card-9) is high-value as a verification harness but does not itself close the accuracy gap, and card-1's SIREN fit is the only non-MIND path and the only one with a 3-5s per-pair pre-fit cost the CPU budget can't absorb without evidence it beats MIND-SSC.

B-003.card-8 — MIND-SSC-12 with fused separable kernels + per-level scale-aware ε + Kahan-accumulated patch-SSD, IC-LK gradients precomputed once on SD-image template.
B-003.card-5 — Implement MIND-SSC with vectorized unfold-based conv, epsilon-floored variance, and per-channel SSD that plugs into M-002's IC-LK Hessian via chain rule on steepest-descent images.
B-003.card-3 — Pick MIND, fuse its 6-channel descriptor into a streaming separable kernel with float32 forward + float64 Cholesky carryover from M-002, and gate Mattes MI behind an underflow-safe soft-histogram only as oracle parity check.
B-003.card-10 — MS-MIND (4-level σ=1,2,4,8, weighted 0.4/0.3/0.2/0.1) with Mattes MI gating term (weight 0.1) at the coarsest 2 levels only, dropped if MS-MIND alone clears the 1.15× gate.
B-003.card-4 — Drop MI as the inner loss and run MIND-SSC inside a coarse-to-fine IC-LK with a per-voxel Huber kernel; keep Mattes MI only as a global gating term on the coarsest pyramid level to escape T1↔T2 local optima.
B-003.card-6 — 4-level Gaussian-pyramid MIND with per-scale self-similarity costs, optimized via M-002's IC-LK + mixed-precision LM + DEQ — the closest training-free analogue of an SSL-shaped feature space for T1↔T2 IR.
B-003.card-2 — MIND-SSD residual + Cauchy IRLS + coarse-to-fine M-estimator schedule: port direct VO's robust LM to multi-modal IR, keep MI as fixture-5 (bias+FFD) fallback only.
B-003.card-9 — MIND-SSD with Cauchy IRLS + annealed kernel width, validated and tuned against the 60-case tri-axial battery with MIND-degeneracy fixtures as hard NaN gates.
B-003.card-7 — Replace the 5-fixture spec with a 60-case multi-modal counterfactual battery stratified by intensity-coupling, bias, modality-asymmetric noise, and MIND-degeneracy traps, gated by tri-axial pass criteria (per-class RMSE + oracle-agreement + modality-symmetry + convergence-recovery).
B-003.card-1 — Per-pair SIREN-fitted Structure Tensor Field (STF) as a differentiable modality-invariant descriptor, replacing MIND's hand-crafted patch self-similarity with a continuous coordinate-MLP fit to the local 2nd-moment matrix.

Promoted cards

M-004 ← B-003.card-8

All cards (ranked)

B-003.card-8#1 rankcross-pollination
director-jiwoo-han
MIND-SSC-12 with fused separable kernels + per-level scale-aware ε + Kahan-accumulated patch-SSD, IC-LK gradients precomputed once on SD-image template.
Numerical core + descriptor depth fused — the MIND-SSC residual is only as trustworthy as the variance denominator it divides by, and the IC-LK Jacobian only as stable as the kernel that computes it.
Take card-5's 12-edge symmetric MIND-SSC as the descriptor surface and card-3's numerical machinery as the substrate beneath it. Implement patch-SSD via fused separable 1D convs (5+5 instead of 5×5, ~5× FLOP cut), tiled 32×32 streaming with halo=2 so the K=12 grouped-conv working set stays in L1 (~24 KB), Kahan compensated summation for the per-patch sum-of-squares accumulator (the M-002 B-002.card-3 → card-8 pattern that won there), and ε = 1e-5 × var(I) recomputed PER pyramid level rather than a global constant. Clamp V BEFORE the div, not after exp, since torch.exp(-D/V→0) produces inf→NaN gradients that gradcheck misses. Precompute ∇MIND(I_template) once at the reference image (inverse-compositional intact); the 12-channel Jacobian inflates JᵀJ by a constant 12 but preserves the 6×6 affine / banded-FFD Hessian structure so M-002's mixed-precision LM drops in unchanged. gradcheck at float64 on a 16×16 toy fixture is a hard prereq before any float32 production run.
Combines: B-003.card-5 B-003.card-3
Novelty: Card-5 is correct about descriptor choice but silent on the variance-denominator numerical regime; card-3 is correct about fused/tiled/Kahan but applied to plain MIND, missing the SSC accuracy lift. The coupling neither owns alone: the per-level scale-aware ε is what makes the MIND-SSC 12-channel grouped-conv safe to autograd through — without it, Kahan accuracy is wasted on a NaN-producing div, and without MIND-SSC the fused kernel optimizes the wrong descriptor.
Expected gain: max RMSE 0.80-0.90× oracle on fixtures 1-4, 0.70-0.85× on fixture 5 (MIND's bias-invariance dominates Mattes' bias correction). Wall-time 2.5-3× faster than naive MIND-SSC from kernel fusion. Zero NaN events in 60-case battery (card-7's MIND-degeneracy fixtures pass by construction).
Effort: 3.5 days (3d from card-5 + 0.5d to lift card-3's fused/tiled/Kahan layer in; the numerical work is mostly drop-in once interfaces match).
Papers (4)
- Heinrich MICCAI 2012 (MIND)
- Heinrich TMI 2013 (MIND-SSC, Fig.2)
- Baker & Matthews IJCV 2004 (IC-LK)
- Higham 2002 (Kahan summation accuracy)
B-003.card-5#2 rank
Soyoung Choi
Implement MIND-SSC with vectorized unfold-based conv, epsilon-floored variance, and per-channel SSD that plugs into M-002's IC-LK Hessian via chain rule on steepest-descent images.
Modality-invariant local self-similarity descriptors as the only honest way to align T1 to T2 without learning, while remaining fully autograd-friendly and IC-LK-compatible.
Build a single mind_ssc(I, patch_radius=1, neigh_radius=2, sigma=None) -> Tensor[B,12,H,W] op (we drop full 36-dim to a symmetric 12-edge SSC graph following Heinrich 2013 TMI Fig.2 — the 6 face-neighbors plus 6 symmetric cross-terms, which captures ~85% of full SSC accuracy at 2x memory instead of 6x). Internals: (1) precompute K offset vectors as a static int tensor; (2) compute patch-SSD via F.conv2d with a (2P+1)² box kernel applied to (I − shift(I, off_k))² — one conv per edge, batched across K with grouped conv so it's a single kernel launch; (3) variance V = mean_k D_k + eps, with eps = 1e-5 * I.var() (scale-aware, not constant 1e-6 — flat ventricles will otherwise wipe gradients); (4) MIND_k = exp(-D_k / V) clamped to [1e-8, 1] before any downstream div. For similarity: SSD between MIND(I_fixed) and MIND(warp(I_moving, phi)) summed over channels, divided by K — equivalent to Heinrich's L1 form up to a constant Jacobian, but L2 plays much nicer with LM's Gauss-Newton Hessian. IC-LK integration is the load-bearing piece: in M-002 the steepest-descent images are ∇I_template · ∂W/∂p; here we replace ∇I_template with ∇MIND(I_template) computed once at the reference frame (that's the whole point of inverse-compositional — descriptor and its spatial grad are precomputed, never recomputed inside the iteration). Spatial gradient of MIND is autograd-safe because exp(-D/V) is C^∞ and we never hit the singular V=0 branch thanks to eps floor. The Hessian H = JᵀJ then assembles per-channel and sums — 12 channels means 12x the FLOPs of mono SSD but the JTJ structure (6x6 for affine, 6x6 per control point for FFD via locality) is unchanged, so M-002's mixed-precision LM solver drops in unmodified. Pyramid: build MIND at each level independently (do NOT downsample MIND — downsample I then recompute, otherwise neighborhood semantics break). Use 4-level Gaussian pyramid, sigma=1.0 per level — basin of attraction for MIND is roughly 1/3 of intensity-SSD's, so coarse levels carry more weight than in M-002. For DEQ inner loop on FFD, fixed-point operator is identical to M-002 (preconditioned gradient step on control points); only the residual function swaps. Validation against oracle SITK Mattes-MI BSpline uses M-002's image-RMSE-on-warped-grid protocol — fair because both methods produce a phi, and RMSE on the resampled moving image is metric-agnostic.
Expected gain: Heinrich 2012 Table 2 reports MIND beats MI by ~22% TRE on inter-modal CT-MR (4.12mm → 3.20mm); MIND-SSC adds another ~10% (Heinrich 2013 TMI Table II, 3.20 → 2.86). On our T1↔T2 with bias field + 6px FFD (fixture 5) I expect our_rmse ≈ 0.85-0.95x oracle_rmse, comfortably inside the 1.15x pass band. Fixtures 1-2 should be near-tied with oracle; fixture 5 is where MIND-SSC pulls ahead.
Effort: Day 1: mind_ssc op + unit tests (offset correctness, eps behavior on flat region, autograd gradcheck at float64). Day 2: pyramid integration + IC-LK steepest-descent images for affine. Day 3: FFD + bending energy hookup, oracle harness, all 5 fixtures green. Day 4 buffer: tune eps scale, MIND-SSC vs MIND ablation table for the writeup. Total 3 days + 1 buffer.
Papers (4)
- Heinrich et al. MIND, MedIA 2012 §3.1
- Heinrich et al. MIND-SSC, IEEE TMI 2013
- Heinrich Foveated MIND, Sensors 2019
- Baker & Matthews IC-LK, IJCV 2004 §3 (for SD-image derivation)
Domain notes
Three traps that will silently fail CI: (1) torch.exp(-D/V) with V→0 produces inf, then NaN in grad even with later clamping — clamp V BEFORE the div, not after exp; gradcheck won't catch this because it perturbs around non-flat points. (2) grouped conv for K=12 edges needs weight shape (12,1,kH,kW) with groups=12 and input repeated to 12 channels — easy to off-by-one and end up computing K identical SSDs. (3) align_corners=True must match M-002 grid_sample exactly; if MIND is computed on a grid offset by 0.5px from where warp samples, you get a constant ~0.3px bias that masquerades as a regularizer tuning problem for a full day. Also: do NOT add MIND-SSC and Foveated MIND simultaneously in v1 — Foveated is +15% but adds a non-local sum that breaks the unfold vectorization; defer to a follow-up card if M-004 has slack.
B-003.card-3#3 rank
Seungwoo Yoo
Pick MIND, fuse its 6-channel descriptor into a streaming separable kernel with float32 forward + float64 Cholesky carryover from M-002, and gate Mattes MI behind an underflow-safe soft-histogram only as oracle parity check.
Inner-loop numerical precision, fused descriptor kernels, and EfficientViT-style memory-traffic accounting for multi-modal similarity.
From a numerical-precision lens, Mattes MI is the trap I want to avoid as the primary metric: Parzen-windowed soft histograms with ~64 bins over 256² voxels produce per-bin counts in the 1e-3 to 1e-6 range, and the joint log(p_xy / p_x p_y) term amplifies any float32 underflow into NaN gradients — exactly the failure mode I burned three days on at Tesla with INT8-quantized depth heads. MIND (Heinrich 2012) is the cleaner autograd citizen: descriptor d_i(x) = exp(-|patch_i - patch_center|² / V(x)) is a sequence of separable 1D convs, pointwise square, local mean, exp, and a final L1/L2 normalization — every op has a stable Jacobian if we floor V(x) with epsilon=1e-5 * mean(V). Concretely I propose: (1) Compute the 6-channel MIND descriptor once per pyramid level in float32 using a fused separable kernel — 5×5 Gaussian patch as two 1D convs (5+5 vs 25 mults, 5× FLOP reduction), square-and-sum via einsum, then a single exp call. (2) Stream the descriptor: do NOT materialize the full 256²×6 tensor (1.5 MB per image, 3 MB pair, blows L2). Instead tile into 32×32 spatial blocks with halo=2 for the conv, compute SSD between fixed/moving descriptors block-wise, accumulate into a single scalar loss with Kahan compensated summation (reuse the pattern from B-002.card-3 that became card-8's winner). (3) Keep the M-002 mixed-precision LM exactly as-is: float32 residual/Jacobian forward, float64 normal-equation JᵀJ + λ diag(JᵀJ), Cholesky in float64. The MIND loss is differentiable scalar so it slots directly into the existing IC-LK update without metric-specific reformulation. (4) For the bias+FFD fixture (#5) MIND is intrinsically bias-invariant because exp(-D/V) is contrast-normalized per-voxel — this is the killer feature vs MI, which needs bias-field pre-correction or N4 to hit oracle parity. (5) Mattes MI implementation deferred to a 50-LOC sanity oracle (NOT the optimization metric) just to confirm our MIND-driven warps achieve RMSE within the 1.15× envelope on fixtures 1-4 where MI is the SITK reference.
Expected gain: MIND ~250 LOC vs MI ~400 LOC (38% less code, fewer numerical edge cases). Fused separable kernel: 5× FLOP reduction on patch SSD, ~3× wall-time on CPU vs naive 5×5 conv. Tiled streaming: peak memory 32²×6×4B = 24 KB working set vs 1.5 MB dense (fits L1, ~2× speedup from cache). Bias+FFD fixture (#5): expect RMSE ≤ 0.7× oracle MI without any bias correction (MIND's structural advantage). Autograd 5/5 finite at epsilon floor 1e-5.
Effort: 0.5d MIND fused kernel + separable conv, 0.5d tiled streaming + Kahan accumulator, 0.5d epsilon-floor sweep on V(x) across 5 fixtures, 0.5d MI sanity-oracle (50 LOC), 0.5d integration into M-002 LM + DEQ + pyramid. Total 2.5 days, within the M-002 budget reuse envelope.
Papers (5)
- Heinrich et al., 'MIND', MedIA 2012
- Mattes et al., 'Parzen MI', TMI 2003
- Liu et al., 'EfficientViT', CVPR 2023 — memory-traffic accounting
- Yoo et al., 'Q-ViT', ICCV 2021 — mixed-precision calibration patterns
- Frantar et al., 'GPTQ', ICLR 2023 — Cholesky stability under low-precision Hessians
Domain notes
Epsilon floor on V(x) is the single most important hyperparameter — set it to 1e-5 * mean(V) per pyramid level, NOT a global constant, because V scales with intensity range and pyramid-level smoothing changes V by ~4× per octave. Verify gradient finite-ness by torch.autograd.gradcheck on a 16×16 toy at float64 before trusting float32 production path. Do NOT add INT8 anywhere — autograd graph stays float; my Tesla-era quantization instincts are an anti-pattern here. DEQ fixed-point: MIND's local-normalization makes the Jacobian better-conditioned than MI (no log-singularities), so DEQ Anderson acceleration should converge in ~30% fewer iterations on fixture #5.
B-003.card-10#4 rankcross-pollination
director-jiwoo-han
MS-MIND (4-level σ=1,2,4,8, weighted 0.4/0.3/0.2/0.1) with Mattes MI gating term (weight 0.1) at the coarsest 2 levels only, dropped if MS-MIND alone clears the 1.15× gate.
Coarse-to-fine basin selection done right — multi-scale descriptor pyramid + sparse coarse-level MI gating is a strictly better basin-finder than either alone.
Card-6 (MS-MIND coarse-to-fine weighted SSD) and card-4 (MIND-SSC + Huber + MI gating at coarse levels for rot15° basin selection) agree on the structural insight that fine-level MIND is correct but coarse-level basin selection is where multi-modal IR breaks. Card-6 widens the basin via Gaussian-pyramid descriptor coarsening; card-4 widens it via a non-local dense MI term used ONLY where the dense Hessian cost is amortized (small images at coarse levels). The combination: build MS-MIND per card-6 (rebuilding MIND at each pyramid level, NOT downsampling MIND — semantics break), add Mattes MI at weight 0.1 at levels 3-4 only per card-4, drop the MI gating if 5/5 fixtures pass without it per card-4's honest caveat. This is the cheapest insurance against the rot15° local-min that pure MS-MIND papers acknowledge (Heinrich's TRE table shows MIND struggles at >12° rotation init). Card-4's Huber stays per-level with δ = 1.345·MAD.
Combines: B-003.card-6 B-003.card-4
Novelty: Card-6 lacks an answer for the rot15° basin (V-JEPA-style coarse weighting widens but does not guarantee correct basin); card-4 uses MI gating but at single-scale MIND-SSC, missing the multi-scale SSD weighting that drives card-6's iteration-count win. Coupling: MS-MIND provides the smooth coarse landscape on which the MI gating term needs only weight 0.1 (not card-4's higher implied weight) to be decisive, and the optional-drop discipline is the honest fallback neither parent specifies cleanly.
Expected gain: max RMSE 0.85-0.95× oracle on fixtures 1-4; 1.00-1.10× on fixture 5; 30-40% fewer LM iterations from coarse-to-fine basin widening (card-6); recovery from rot15° init improves from ~75% (pure MIND) to >95% with MI gating (card-4 anchor).
Effort: 3 days (2.5d card-6 + 0.5d to bolt on card-4's coarse-level MI gating, which is ~80 LOC of an already-built MI residual).
Papers (3)
- Heinrich MICCAI 2012 (MIND multi-scale)
- Mattes TMI 2003 (B-spline MI parzen)
- Studholme PR 1999 (multi-resolution MI registration)
B-003.card-4#5 rank
Jaehyun Lee
Drop MI as the inner loss and run MIND-SSC inside a coarse-to-fine IC-LK with a per-voxel Huber kernel; keep Mattes MI only as a global gating term on the coarsest pyramid level to escape T1↔T2 local optima.
Multi-modal IR as a coarse-to-fine factor graph: deformation params are latent nodes, descriptor patches are edges, and the modality choice is a per-edge robust kernel — exactly how LIVO-SLAM fuses LiDAR/visual/inertial.
From SLAM I have a strong prior that Mutual Information is the wrong primary cost for an iterative least-squares solver: Parzen/Mattes MI is non-local, non-decomposable per-voxel, and its Hessian is dense and poorly conditioned — which is exactly why LIVO-style systems use point-to-plane / photometric residuals as the inner cost and reserve information-theoretic terms for loop-closure scoring. Translating that to M-004: use MIND-SSC (Heinrich 2012) as the residual because it is (a) modality-invariant by construction via self-similarity, (b) decomposable per-voxel so it plugs straight into M-002's IC-LK + mixed-precision LM with a sparse block-diagonal Jacobian, and (c) differentiable in PyTorch via local patch SSDs — DEQ implicit-diff still works because the fixed point is over the warp, not the descriptor. Build a 4-level Gaussian pyramid (σ = 4,2,1,0.5 voxels); at the coarsest two levels add a Mattes-MI gating term (low weight, ~0.1) purely as a basin selector for the +rot15° fixture where MIND is locally ambiguous around symmetric ventricles — this mirrors how VINS-Fusion uses IMU pre-integration as a coarse prior before visual BA refines. Wrap every MIND residual in a Huber kernel (δ ≈ 1.345·MAD of residuals per level) so that T1↔T2 tissue mis-correspondences (e.g. CSF appearing bright in T2 but dark in T1, fat in bias-field fixture) are down-weighted rather than poisoning the Gauss-Newton step — this is the same Cauchy/Huber trick we use for moving objects in visual SLAM. For the FFD fixture, treat the B-spline control grid as a sparse pose graph: each control point is a node, its 4³ support region defines edges to data factors, and the bending-energy regularizer becomes a between-factor with isotropic Gaussian prior — this gives a banded Hessian that LM solves in O(N) with M-002's existing Cholesky path. Per-fixture metric weighting (the multi-sensor covariance analogue): estimate residual variance σ² per level after one warm-up iteration and rescale — identity/+trans fixtures will collapse σ fast and let MIND dominate; +bias+FFD will keep MI weight non-trivial longer. Honest caveat: if MIND-SSC alone passes 5/5, drop the MI gating entirely — extra metrics are extra failure modes, and SLAM taught me that the simplest residual that converges is the right one.
Expected gain: Expect max RMSE ≈ 0.85-0.95× oracle on identity/+trans/+rot, ≈ 1.05-1.10× on +FFD and +bias+FFD (well within 1.15× budget). MIND-SSC alone should give ~2-3× faster convergence than Mattes MI in the inner loop (sparse vs dense Hessian) and remove the ~20% local-minimum failure rate I'd predict for pure MI on +rot15°. Autograd 5/5 is straightforward — MIND is just local SSDs.
Effort: 2.5-3 days: 0.5d MIND-SSC implementation (6-neighborhood SSC patches, vectorized), 0.5d pyramid + Huber wiring into M-002's LM, 1d FFD-as-factor-graph regularizer with banded solve, 0.5d MI gating term (reuse oracle's Parzen if exposed, else 64-bin soft-histogram), 0.5d fixture sweep + ablation (MIND-only vs MIND+MI-gate vs MI-only) to justify the final config.
Papers (7)
- Heinrich et al. 2012 (MIND, MedIA)
- Heinrich et al. 2013 (MIND-SSC, MICCAI)
- Rueckert et al. 1999 (FFD B-spline IR)
- Mattes et al. 2003 (PET-CT MI)
- Triggs et al. 1999 (Bundle Adjustment — robust kernels)
- Qin et al. 2018 (VINS-Fusion — tight coupling)
- Shan et al. 2020 (LVI-SAM — multi-modal factor graph)
Domain notes
Three SLAM-specific transfers worth flagging. (1) Coarse-to-fine is not optional for MI — single-resolution Mattes MI on +rot15° will land in a ventricle-symmetry local optimum maybe 30% of the time; this is the direct analogue of why we never do single-scale direct visual odometry. (2) The FFD-as-pose-graph view means the bending-energy regularizer should be tuned by L-curve, not guessed — same protocol as tuning IMU bias random-walk σ in VIO; under-regularize and you get folding (negative Jacobian determinant, the IR equivalent of a non-PSD information matrix). Add a determinant>0 barrier or projected LM step. (3) Per-fixture metric weighting echoes sensor-covariance scheduling in LIO: identity/+trans need almost no MI (MIND is sufficient and faster), +bias+FFD needs MI longer because intensity-derived MIND patches degrade under multiplicative bias — consider a bias-field pre-estimation step (N4-style log-polynomial, ~10 params) as a separate factor before FFD refinement, decoupling intensity from geometry exactly like we decouple gyro bias from pose in pre-integration. Honest concession to the brief: MI is not used in modern SLAM front-ends for a reason; if M-002's DEQ implicit-diff struggles with MI's dense Hessian, MIND-only is the correct fallback and I would defend that choice over a clever-but-fragile MI scheme.
B-003.card-6#6 rank
Donghyun Park
4-level Gaussian-pyramid MIND with per-scale self-similarity costs, optimized via M-002's IC-LK + mixed-precision LM + DEQ — the closest training-free analogue of an SSL-shaped feature space for T1↔T2 IR.
SSL teaches that the metric IS the representation; with no training allowed, my honest contribution is a hand-designed multi-scale self-similarity descriptor (MIND-pyramid) that operationalizes SSL's augmentation-invariance prior classically.
SSL methods (DINO, V-JEPA, MAE) share one structural commitment: build a representation in which two views of the same content collapse to the same code while different content stays apart. Without training, I cannot learn such a code, but MIND (Heinrich 2012) is its purest hand-designed shadow — at each voxel, MIND encodes the patch's self-similarity to its neighborhood, which is intrinsically modality-invariant because a T1 patch and a T2 patch of the same anatomy share local structure even when intensities invert. My M-004 proposal is a multi-scale MIND pyramid (MS-MIND), the classical analogue of multi-crop/multi-view consistency. Concretely: (1) build a 4-level Gaussian pyramid (sigma = 1, 2, 4, 8 voxels, downsampled by 2 each level) for both T1 and T2; (2) compute a 6-neighbor MIND descriptor at each level with patch radius 1 and search radius 1, normalized by local variance; (3) define the similarity as a weighted sum of per-level SSD-on-MIND-descriptors, weights (0.4, 0.3, 0.2, 0.1) coarse-to-fine, which mirrors V-JEPA's coarse-context-predicts-fine-target inductive bias; (4) plug this into M-002's shared IC-LK Jacobian machinery — MIND is differentiable w.r.t. the warp, so analytical Jacobians are straightforward; (5) for FFD and bias+FFD fixtures, wrap the cost in M-002's mixed-precision LM and use the DEQ fixed-point step for the deformable level. The contrastive framing is honest but bounded: at test-time, the optimization itself acts like a single-pair InfoNCE — the warp parameter is updated to make matched voxels' MIND descriptors agree (positive) while the MIND construction implicitly suppresses agreement with neighbors (negative, via local normalization). This is test-time optimization on one pair, not training; no weights are learned, satisfying the constraint. For the bias+FFD fixture I add a MAE-flavored mask-aware variant: voxels where pyramid coefficients indicate severe bias-field corruption are downweighted and their MIND cost is reconstructed from surrounding context — connects directly to yuna-kang B-002.card-7's occlusion stressor. Against SITK's MattesMI-BSpline oracle, MS-MIND should match or beat single-scale MIND on identity/trans/rot (where MIND already wins on T1↔T2), and meaningfully close the gap on FFD and bias+FFD where global MI loses spatial specificity.
Expected gain: vs single-scale MIND: TRE -15-25% on FFD fixture, -20-30% on bias+FFD; vs SITK MattesMI oracle: within 1.2× TRE on all 5 fixtures (single-scale MIND is ~1.8× on bias+FFD). Convergence: 30-40% fewer LM iterations due to coarse-to-fine basin-of-attraction widening.
Effort: 2.5 days — 0.5d MIND core + pyramid builder, 1d analytical Jacobians and M-002 IC-LK integration, 0.5d mask-aware MAE variant for bias+FFD, 0.5d fixture sweep + oracle comparison.
Papers (6)
- MIND (Heinrich 2012)
- Self-similarity Context (Heinrich 2013 SSC)
- DINO (Caron 2021)
- V-JEPA (Bardes 2024)
- MAE (He 2022)
- Mattes MI (Mattes 2003)
Domain notes
The fit is genuinely awkward: SSL's power comes from learned invariances over millions of images, and MIND is a fixed, hand-crafted prior that cannot adapt to anatomy. I am not claiming MS-MIND IS SSL — I am claiming it is the most defensible classical projection of SSL's structural commitments (multi-view consistency, local-context reconstruction, representation-defines-metric) onto a no-training IR problem. The lens adds value in three concrete places: (a) justifying multi-scale over single-scale MIND on principled grounds rather than empirical hyperparameter sweep, (b) motivating the mask-aware MAE variant for bias+FFD instead of generic robust losses, (c) framing test-time warp optimization as single-pair contrastive inference, which gives a clean story for why MS-MIND outperforms MI on T1↔T2 specifically (MI is a global histogram statistic, MS-MIND is a local representation-space metric — exactly the shift SSL made over classical computer vision). If yuna-kang's mask-aware loss lands, our cards compose multiplicatively on the bias+FFD fixture.
B-003.card-2#7 rank
Hyunsu Kim
MIND-SSD residual + Cauchy IRLS + coarse-to-fine M-estimator schedule: port direct VO's robust LM to multi-modal IR, keep MI as fixture-5 (bias+FFD) fallback only.
Multi-modal IR is isomorphic to direct VO's robust photometric BA — keep the warp Jacobian, swap the 'photometric residual' into a modality-invariant descriptor space, and apply Huber/Cauchy to the long-tailed residuals; this lets us reuse M-002's IC-LK stack almost as-is.
Primary: MIND (Heinrich MICCAI 2012). The self-similarity descriptor d(x) ∈ R^6 (6-neighborhood patch SSD normalized by local variance σ(x)) on both images produces residual r(x;p) = d_T(x) - d_M(W(x;p)) in SSD shape — zero refactor of M-002's IC-LK + mixed-precision LM + DEQ implicit-diff (~250 LOC, autograd natural, Jacobian = ∂d_M/∂x · ∂W/∂p chain-rule clean). My direct-VO contribution layers three things on top: (1) Cauchy IRLS kernel — MIND residual is long-tailed (modality gap + FFD boundary + bias residue), same shape as the association cost distribution we observed in CVPR 2024 LiDAR-camera. w_i = 1/(1+(r_i/c)²), c = 1.4826·MAD(r) per-level. Cauchy is redescending (stronger outlier handling than Huber) — better for the bias-field-heavy fixture-5. (2) Coarse-to-fine M-estimator schedule — pyramid level L uses c_L = c_0·2^L (loose at coarse, tight at fine), the same 'annealed kernel width' trick from our TPAMI 2023 direct BA. Progressively narrows the convex basin to avoid local minima. (3) Mahalanobis-gated trust region — LM update normalized by residual covariance Σ = diag(σ_d²): Δp = -(JᵀWJ + λ·diag(JᵀWJ))⁻¹ JᵀWr, W = diag(w_i/σ_d²(x_i)). Mirrors how LiDAR-camera tracker incorporates measurement uncertainty into gating. Per-fixture forecast: identity/translation/rotation within 1.05× oracle (rigid in descriptor space ≈ SSD); B-spline FFD ~1.10× (bending energy via shared/regularizers); bias+FFD hardest, 1.20× risk. Safety net: for fixture-5 only, add mini-Mattes MI (~120 LOC, 32-bin Parzen) as a fallback IF MIND-only doesn't clear 1.15×. First submission MIND-only.
Expected gain: max(our_rmse) ≈ 1.08-1.12 × max(oracle_rmse) (Heinrich MICCAI 2012 Table 2: MIND-SSD ~7% better than Mattes MI on inter-modal, parity on intra-modal). Cauchy IRLS expected to drop fixture-5 RMSE additional 15-20% (TPAMI 2023 Fig.8 Cauchy vs L2). Autograd 5/5 — MIND is conv2d + SSD + normalize, all differentiable; IRLS weights detached.
Effort: 4 days total. Day 1: MIND descriptor (6-neighborhood patch SSD + local variance normalize, 90 LOC) + unit test. Day 2: Cauchy IRLS + MAD scale estimator plugged into M-002 LM (60 LOC). Day 3: coarse-to-fine schedule + Mahalanobis weighting + 5-fixture integration (50 LOC + harness). Day 4: bias+FFD debug; optional mini-Mattes MI fallback (~120 LOC buffer).
Papers (4)
- Heinrich et al., 'MIND: Modality independent neighbourhood descriptor', MICCAI 2012 / MedIA 2012
- Mattes et al., 'PET-CT image registration using mutual information', TMI 2003
- Kim et al., 'Direct Bundle Adjustment for MOT', TPAMI 2023 — Cauchy IRLS schedule
- Kim et al., 'Sparse-to-Dense LiDAR-Camera 3D MOT', CVPR 2024 — Mahalanobis gating
Domain notes
Direct VO's long-tailed photometric residual is caused by (a) occlusion, (b) lighting change, (c) non-Lambertian surfaces — mapping 1:1 to multi-modal IR's (a) modality gap, (b) bias field, (c) FFD boundary. So the robust kernel recipe ports cleanly. One trap: MIND's σ(x) denominator approaches 0 in homogeneous regions, numerical instability — ε=1e-3 floor required (Heinrich eq.3 footnote). Also: IC-LK precomputes Jacobian on the reference image, so the MIND descriptor must also be precomputed on the reference, maintaining the inverse-compositional structure. DEQ implicit-diff layer wraps only the outer optimization; IRLS inner loop is detached as a fixed point.
B-003.card-9#8 rankcross-pollination
director-jiwoo-han
MIND-SSD with Cauchy IRLS + annealed kernel width, validated and tuned against the 60-case tri-axial battery with MIND-degeneracy fixtures as hard NaN gates.
Robustness + adversarial eval as a closed loop — the right kernel choice (Cauchy vs Huber) and annealing schedule cannot be picked from a paper, only from the failure modes your battery exposes.
Card-2 is the right robustness recipe (Cauchy redescending beats Huber for the bias-field outlier regime fixture-5 exposes, per Kim TPAMI 2023 Fig.8) but its c_L = c_0·2^L annealing constant is a free parameter that any single fixture can overfit. Card-7's 4-axis Cartesian battery (transform × bias × asymmetric noise × intensity inversion) plus the 3 MIND-degeneracy fixtures (homogeneous WM, single-voxel-variance, constant slab) is the only honest tuning surface. Concretely: implement MIND-SSD + Cauchy IRLS + Mahalanobis-weighted LM, then sweep c_0 ∈ {0.5, 1.0, 2.0} × MAD on the 60-case battery, accept the value that maximizes the worst-case oracle-ratio across all 60 (NOT the mean — the worst case is what the 1.15× gate measures). Card-7's gate 3 (modality-symmetry T1→T2 ∘ T2→T1 ≤ 0.3px) catches Cauchy-vs-Huber asymmetric bias the 5-fixture spec misses entirely. MI-as-Mini-Mattes stays as card-2's fixture-5-only fallback, gated by battery telemetry not gut feel.
Combines: B-003.card-2 B-003.card-7
Novelty: Card-2 picks Cauchy from a paper; card-7 builds a battery but specifies no consumer. The coupling: the battery's worst-case-across-60 metric is the ONLY honest selector for the kernel-width annealing constant, and the MIND-degeneracy fixtures are the ONLY NaN gate that catches Cauchy's specific failure mode (redescending → zero weight → singular normal equations on homogeneous regions). Neither parent expresses this — card-2 would silently fail on a homogeneous-WM patient scan in production; card-7 would catch it but have no recipe to fix it.
Expected gain: max RMSE 1.05-1.10× oracle across 60 cases (vs card-2 alone targeting 1.08-1.12× on 5); ~15-20% extra headroom on fixture-5 from Cauchy (anchored to Kim TPAMI 2023 Fig.8); near-zero false-pass rate from card-7's tri-axial gates catching modality-coupling failures that fixture spec misses.
Effort: 5 days (4d card-2 + 2.5d card-7, with 1.5d overlap since the battery is the tuning instrument for the robustness sweep).
Papers (4)
- Heinrich MICCAI 2012 (MIND Table 2)
- Kim TPAMI 2023 (Cauchy IRLS Fig.8)
- Black & Anandan IJCV 1996 (annealed robust estimation)
- Holland & Welsch 1977 (IRLS Cauchy weights)
B-003.card-7#9 rank
Yuna Kang
Replace the 5-fixture spec with a 60-case multi-modal counterfactual battery stratified by intensity-coupling, bias, modality-asymmetric noise, and MIND-degeneracy traps, gated by tri-axial pass criteria (per-class RMSE + oracle-agreement + modality-symmetry + convergence-recovery).
Eval-design discipline from closed-loop world-model validation: fixtures must adversarially probe modality-coupling failures, not just sample easy transforms.
Build the fixture battery as a 4-axis Cartesian product collapsed to ~60 cases: (a) Transform class {identity, translation [1,3,8 px], rotation [5,15,25 deg], FFD [low,mid,high control-point density], composite} — 12 levels; (b) Bias field magnitude {none, weak 10%, moderate 30%, severe 60%} via low-order polynomial multiplicative field, since MI is bias-robust in theory but MIND is locally normalized — we want to see the divergence empirically; (c) Modality-asymmetric noise {clean/clean, T1-noisy/T2-clean, T1-clean/T2-noisy, both-noisy} with Rician noise to mimic MR physics; (d) Pathology-mimicking intensity inversion traps — synthetic CSF-like hyperintense blobs in T2 with matched T1-hypointense counterparts (the genuine T1/T2 inversion that breaks SSD/NCC and stresses MI binning). Plus 3 dedicated MIND-degeneracy fixtures: homogeneous WM block (flat-region variance epsilon trigger), single-voxel-variance patch, and constant-intensity slab — agent's MIND must return finite gradient and degrade to identity-warp behavior, not NaN. Pass gates are tri-axial: (1) per-class hard RMSE gate carried over from M-002 (identity <0.1 px, translation <0.5 px, rotation <1.0 px, FFD <1.5 px, composite <2.0 px), max-ratio across classes <3x to prevent one-class hero solutions; (2) oracle-agreement: for each fixture, both our MIND-SSD pipeline and SITK Mattes-MI-BSpline must agree on RMSE rank ordering across noise levels (Kendall tau >= 0.7) — disagreement fixtures get flagged as 'hard cases' and logged, not failed, because oracle disagreement is signal not noise; (3) modality-symmetry: T1->T2 transform composed with T2->T1 transform must yield identity within 0.3 px RMSE on a 10-fixture subset — this catches asymmetric similarity weighting that single-direction tests miss entirely; (4) convergence-recovery from M-002: perturb the converged warp by {1 px translation, 5 px translation, 0.1 rad rotation} and require re-convergence to within 110% of original RMSE in >=95% of cases. Report a 4x4 heatmap (transform-class x noise-asymmetry) of MIND-SSD-vs-Mattes-MI RMSE delta so hard regions are visually obvious. Implementation: deterministic seeds, single procedural T1/T2 phantom generator (concentric-ellipsoid white/gray/CSF model with physics-motivated intensity LUTs), fixtures cached as .npz so eval is reproducible and CPU-bounded under 4 min total.
Expected gain: Estimated 4-6x failure-mode coverage vs the 5-fixture spec (60 vs 5 cases, with ~15 cases targeting modality-coupling failures invisible to single-modality eval). Anchored to GAIA-1 ablation methodology where counterfactual stratification surfaced ~3x more failure modes than uniform sampling, and to nuPlan's stratified scenario battery showing rank-correlation gates catch ~40% of silent regressions that aggregate-RMSE misses. MIND-degeneracy fixtures alone are expected to catch 100% of variance-epsilon NaN bugs that would otherwise surface only in clinical edge cases.
Effort: 2.5 days (1 day phantom generator + fixture caching, 1 day tri-axial gate harness + oracle-agreement Kendall-tau scoring, 0.5 day modality-symmetry + convergence-recovery loops).
Papers (6)
- Heinrich et al. MIND 2012 (MICCAI)
- Mattes et al. PET-CT MI 2003 (TMI)
- Rueckert FFD 1999 (TMI)
- Klein et al. elastix evaluation 2009 (TMI)
- GAIA-1 Hu et al. 2023 (ablation methodology only, not generative)
- nuPlan Caesar et al. 2021 (stratified eval methodology)
Domain notes
No generative models, no diffusion, no RSSM — fixtures are 100% procedural via deterministic phantom synthesis. Oracle disagreement is intentionally treated as diagnostic signal, not failure, because the whole point of multi-modal IR is that T1/T2 intensity coupling is non-monotonic and any single similarity metric will have blind spots. Modality-symmetry check is the multi-modal analogue to the closed-loop consistency check in world-model eval — if forward and reverse alignment don't compose to identity, the similarity metric is implicitly biased toward one modality. Convergence-recovery carries over from M-002 unchanged because optimization-landscape robustness is modality-agnostic.
B-003.card-1#10 rank
Minseo Park
Per-pair SIREN-fitted Structure Tensor Field (STF) as a differentiable modality-invariant descriptor, replacing MIND's hand-crafted patch self-similarity with a continuous coordinate-MLP fit to the local 2nd-moment matrix.
Coordinate-MLP as a learned-on-the-fly modality-invariant feature field: fit a tiny SIREN per image pair (no pretraining, no labels) whose latent code at coordinate x captures local geometric structure (gradient orientation manifold) rather than intensity, then register by SSD on these latents.
MIND (Heinrich MedIA 2012) and the broader self-similarity family (Shechtman & Irani CVPR 2007) work because local geometric structure — edges, ridges, junctions — is preserved across T1/T2 even when intensity mapping inverts. The structure tensor S(x)=∇I∇Iᵀ smoothed by Gaussian G_σ is the classical continuous form of this signal (Förstner 1986; Knutsson 1989), and its eigen-decomposition gives a 3-DOF rotation-equivariant local descriptor (λ₁, λ₂, θ) that is intensity-bias invariant after normalization. My proposal: for each fixed/moving pair, fit two small SIRENs f_φ, f_ψ : R² → R⁵ (Sitzmann NeurIPS 2020, ω₀=30) for ~200 Adam steps each, supervised to reproduce a normalized structure-tensor field computed once on each image — specifically the unit-norm (λ₁−λ₂)/(λ₁+λ₂+ε) anisotropy, sin2θ, cos2θ, plus a coherence-gated mask channel. The SIREN acts as a band-limited continuous interpolator (better than bilinear under warping — this is exactly the differentiable-warp argument from NeRF/Mildenhall ECCV 2020 §4 and the implicit-grid claim of my own Neural SDF ECCV 2022 §3.2), so when M-002's IC-LK pyramid warps coordinates x → x+u(x), we evaluate f_φ and f_ψ at warped coords and minimize SSD on the 5-vector. Bias field becomes near-invisible because the tensor is normalized; B-spline FFD is handled by the existing Jacobian path. Implicit-diff in DEQ wrapping (M-002) is unaffected — the SIREN is frozen during LM iterations, only φ,ψ are pre-fit. Crucially this is NOT learned weights in the forbidden sense: there is no training set, no pretrained checkpoint, no labels — just per-pair overfitting of two MLPs to a closed-form classical feature, taking ~3-5s CPU at 128×128. Compared to MIND's discrete 6-neighborhood patch SSD, STF gives sub-pixel smooth gradients (SIREN's analytic ∂f/∂x is well-defined and bounded, unlike MIND's quantized patch shifts which Heinrich §3.4 admits cause LM stalls).
Expected gain: On fixtures 1-3 expect parity with MIND (~0.8-1.0 px RMSE, oracle Mattes ~0.6 px). On fixture 4 (FFD) expect 10-20% improvement over discrete MIND from sub-pixel-smooth gradients (cf. SIREN's order-of-magnitude derivative-fidelity gain over ReLU MLPs, Sitzmann NeurIPS 2020 Fig. 3). On fixture 5 (bias + FFD) tensor normalization should match MIND within noise. Honest: unlikely to beat raw MIND on fixtures 1-2 where MIND is already at oracle floor.
Effort: Day 1: structure-tensor closed form + 5-channel target, sanity check on fixture 1. Day 2: SIREN fit loop (reuse M-002 LM scaffold, ~80 LOC), per-pair caching, coord-grid eval. Day 3: wire into IC-LK similarity slot, autograd finite-grad check, all 5 fixtures + tolerance gate. Day 4 buffer: ω₀ / σ sweep if fixture 4 fails the 15% bound. Total ~350 LOC, mid-range between MIND (~250) and MI (~400).
Papers (6)
- Heinrich et al., 'MIND: Modality independent neighbourhood descriptor for multi-modal deformable registration', MedIA 2012, §2-3 (baseline self-similarity descriptor, what we replace with continuous form)
- Sitzmann et al., 'Implicit Neural Representations with Periodic Activation Functions (SIREN)', NeurIPS 2020, §3 + Fig.3 (band-limited fit, analytic derivatives — justifies sub-pixel gradient claim)
- Förstner & Gulch, 'A fast operator for detection and precise location of distinct points, corners and centres of circular features', ISPRS 1987 (structure tensor as rotation-equivariant local descriptor)
- Mildenhall et al., 'NeRF', ECCV 2020, §4 (differentiable continuous warping of coordinate fields — same evaluation pattern at warped x)
- Park (self), 'Neural SDF', ECCV 2022, §3.2 (per-shape overfit of small coord-MLP yields smoother gradients than discrete grids for LM-style optimization — directly transferable scaffolding)
- Shechtman & Irani, 'Matching local self-similarities across images and videos', CVPR 2007 (theoretical basis for why self-similarity transfers across modalities)
Domain notes
Fit honesty: my lens (implicit reps) genuinely adds something here only because of the sub-pixel derivative argument — if the evaluator decides FFD fixture isn't gradient-limited, this card collapses to 'a fancier MIND' and we should just ship MIND (acknowledging B-002.card-6/7 pattern). Risk 1: SIREN ω₀ tuning on CPU may be flaky for sparse-structure regions (ventricles); coherence mask should gate these. Risk 2: 200-step pre-fit cost (~3-5s) may dominate the registration budget — if so, drop SIREN and evaluate the structure tensor directly on a bilinear-upsampled grid, losing the sub-pixel claim but keeping bias-invariance. Risk 3: structure tensor is rotation-equivariant not invariant; for fixture 3 (15° rotation) the (sin2θ, cos2θ) channels co-rotate with the warp, which is correct under IC-LK's coordinate Jacobian but easy to get sign-wrong — needs an explicit unit test. Not in scope: 3D / volumetric (fixtures are 2D per spec), no 3DGS connection (overkill for 2D MRI slices).

Director synthesis

Promoted cards

All cards (ranked)

MIND-SSC-12 with fused separable kernels + per-level scale-aware ε + Kahan-accumulated patch-SSD, IC-LK gradients precomputed once on SD-image template.

Implement MIND-SSC with vectorized unfold-based conv, epsilon-floored variance, and per-channel SSD that plugs into M-002's IC-LK Hessian via chain rule on steepest-descent images.

Pick MIND, fuse its 6-channel descriptor into a streaming separable kernel with float32 forward + float64 Cholesky carryover from M-002, and gate Mattes MI behind an underflow-safe soft-histogram only as oracle parity check.

MS-MIND (4-level σ=1,2,4,8, weighted 0.4/0.3/0.2/0.1) with Mattes MI gating term (weight 0.1) at the coarsest 2 levels only, dropped if MS-MIND alone clears the 1.15× gate.

Drop MI as the inner loss and run MIND-SSC inside a coarse-to-fine IC-LK with a per-voxel Huber kernel; keep Mattes MI only as a global gating term on the coarsest pyramid level to escape T1↔T2 local optima.

4-level Gaussian-pyramid MIND with per-scale self-similarity costs, optimized via M-002's IC-LK + mixed-precision LM + DEQ — the closest training-free analogue of an SSL-shaped feature space for T1↔T2 IR.

MIND-SSD residual + Cauchy IRLS + coarse-to-fine M-estimator schedule: port direct VO's robust LM to multi-modal IR, keep MI as fixture-5 (bias+FFD) fallback only.

MIND-SSD with Cauchy IRLS + annealed kernel width, validated and tuned against the 60-case tri-axial battery with MIND-degeneracy fixtures as hard NaN gates.

Replace the 5-fixture spec with a 60-case multi-modal counterfactual battery stratified by intensity-coupling, bias, modality-asymmetric noise, and MIND-degeneracy traps, gated by tri-axial pass criteria (per-class RMSE + oracle-agreement + modality-symmetry + convergence-recovery).

Per-pair SIREN-fitted Structure Tensor Field (STF) as a differentiable modality-invariant descriptor, replacing MIND's hand-crafted patch self-similarity with a continuous coordinate-MLP fit to the local 2nd-moment matrix.