B-004
M-006 candidate selection — next SLAM / dense-mapping mission for Blueberry Lab. Each idea must include BOTH a numeric_oracle (metric + threshold + dataset) and a visual_oracle (what a non-technical website visitor SEES that confirms success). Prior 5 missions were image-registration with metric 'RMSE 1e-4 vs SimpleITK' — public visitors cannot intuit that. M-006 must produce an output a layperson can LOOK AT and judge. Hardware: RTX 3080 Laptop 16GB VRAM, 60GB+ free disk, ≤2h wall-clock target per scene. Pillar lead: Jaehyun Lee (vision-robotics). Context: claudedocs/research_slam_landscape_2026-05-30.md and claudedocs/research_gs_slam_viz_2026-05-30.md.
Director synthesis
Top pick is card-9 (synthesis of card-1 × card-7): a 2-3 day MVP that turns the MonoGS WebGL fly-through into a visitor-driven counterfactual simulator using artifacts both parents were already building, and crucially auto-logs novel-view PSNR from the same .ply that ships to the public — numeric and visual oracles share one artifact, which is exactly the property M-006 was designed to test. card-2 and card-4 are near-duplicates (both DPV-SLAM++ on KITTI seq 09 with Karlsruhe aerial overlay) and should be treated as a single unified candidate in promotion; card-4 ranks above card-2 only because Jaehyun is the pillar lead and it adds seq 00 + an RPE bound, but in execution they collapse into one ticket. card-8 (dual-trajectory comparison) is held below the MonoGS-WebGL line because it earns its keep only AFTER a first non-registration visual artifact ships; card-10 and card-5 are last under MVP-feasibility despite high viz payoff (custom night capture + differentiable-ISP integration violates the 2-3 day favoring rule for a first-outside-pillar mission).
- B-004.card-9 — Interactive WebGL Gaussian-splat of Replica office_0 with a 'replay vs free-cam' toggle: the same viewer renders both the captured trajectory and a visitor-driven counterfactual orbit, with held-out GT pose pop-ups.
- B-004.card-4 — Run DPV-SLAM++ on KITTI Odometry seq 00 + 09 and overlay the estimated T_world_cam trajectory on the Google Earth aerial as the public artifact.
- B-004.card-2 — Run DPV-SLAM on KITTI seq 09 and overlay the estimated ego-trajectory on Google Earth aerial imagery as the loop-closure ocular proof.
- B-004.card-1 — Run MonoGS monocular on Replica office_0 and ship an interactive WebGL splat fly-through as the public artifact.
- B-004.card-7 — Reconstruct a Gaussian-Splat scene with MonoGS, then render a free-camera counterfactual fly-through driven by a scripted ego-trajectory the visitor can compare side-by-side against the original video.
- B-004.card-8 — Dual-trajectory KITTI-09 aerial overlay: DINOv2-frontend ORB-SLAM3 vs DPV-SLAM++ on the same Karlsruhe satellite tile, with a loop-closure event marker.
- B-004.card-6 — Swap ORB-SLAM3's frontend with DINOv2-distilled dense descriptors and prove it on KITTI 09 where mono ORB-SLAM3 famously fails.
- B-004.card-3 — Ship MonoGS with an INT8/FP16 TensorRT-quantized Gaussian rasterizer and a WebGL splat viewer that proves <33 ms/frame on a 3080 Laptop.
- B-004.card-10 — Night-alley raw-Bayer-to-Gaussian-splat: differentiable ISP feeds MonoGS, ship a side-by-side WebGL viewer of raw-ISP-splat vs stock-JPEG-splat with a 'read the sign' visitor task.
- B-004.card-5 — Build a raw-domain night-drive SLAM demo: feed 14-bit Bayer frames through a lightweight differentiable ISP into MonoGS and render a navigable Gaussian splat of a streetlit alley.
Promoted cards
All cards (ranked)
- B-004.card-9#1 rankcross-pollination
Interactive WebGL Gaussian-splat of Replica office_0 with a 'replay vs free-cam' toggle: the same viewer renders both the captured trajectory and a visitor-driven counterfactual orbit, with held-out GT pose pop-ups.
Combines: B-004.card-1B-004.card-7
Novelty: The WebGL splat viewer IS the counterfactual renderer. Card-1 ships a passive fly-through; card-7 ships an offline MP4 side-by-side. The synthesis collapses both into one interactive artifact where the visitor THEMSELVES drives a camera path the original capture never took — converting a 'pretty 3D photo' into a 'navigable simulator' with zero additional training. The non-obvious bit: novel-view PSNR is auto-logged as the user orbits past held-out GT poses, so the numeric oracle is computed FROM the visual artifact, not separately.
Expected gain: MonoGS 37.50 dB Replica ceiling; counterfactual viewpoint coverage adds qualitative novel-view payoff with no new training costEffort: 2-3 days (card-1's 2-day spine + 0.5-1 day for free-cam controls and held-out PSNR logger; both parents already require the WebGL viewer) - B-004.card-4#2 rankresearcher-robotics-jaehyun-lee
Run DPV-SLAM++ on KITTI Odometry seq 00 + 09 and overlay the estimated T_world_cam trajectory on the Google Earth aerial as the public artifact.
Classical factor-graph SLAM with neural patch priors — trajectory is the ground truth of a moving camera in T_world_cam, and loop closure is the only honest visual proof of global consistency a layperson can read.
Deploy DPV-SLAM++ (patch-graph front-end + DBoW loop-closure back-end) on KITTI Odometry sequences 00 (large urban loop) and 09 (known monocular-failure loop). Export the estimated T_world_cam keyframe trajectory in KITTI format, georeference it to UTM using the published GPS origin, and render it as a colored polyline on the Google Earth aerial tile of Karlsruhe. My domain matters because (a) sequence 09 is the canonical place where naive monocular SLAM drifts off the road — passing it is a non-trivial factor-graph win — and (b) loop closure on seq 00 produces a snap that a non-technical visitor can literally watch close on screen. Numeric ATE keeps us honest; the aerial overlay keeps us legible.
Expected gain: ATE 25.76 m avg on KITTI 00-10 for DPV-SLAM++ vs 53.03 m for DPV-SLAM (no-loop variant) and ~100+ m for naive monocular ORB-SLAM3 on seq 09 — Lipson, Teed & Deng, ECCV 2024, Table 3.Effort: 2 days (1 day Docker + CUDA env on RTX 3080; 0.5 day per sequence at 39 FPS ≈ 30 min wall-clock for seq 00's 4500 frames; 0.5 day for georeference + Leaflet overlay).Papers (3)
- Lipson, Teed, Deng, *Deep Patch Visual SLAM*, ECCV 2024, §4.2 (KITTI evaluation, Table 3) and §3.3 (loop-closure backend)
- Campos et al., *ORB-SLAM3*, T-RO 2021, §VIII.B (KITTI monocular failure on seq 09)
- Teed & Deng, *DROID-SLAM*, NeurIPS 2021, §4 (dense bundle adjustment baseline DPV-SLAM++ surpasses on KITTI)
Domain notes
Three things only the SLAM lens catches: (1) ATE alone is misleading on KITTI because absolute meters scale with trajectory length — RPE per-meter is the honest companion metric and we report both. (2) Seq 09's failure mode is *loop-closure bag-of-words collision*, not front-end drift; the visual oracle (loop closes on aerial) is therefore a direct test of the back-end factor graph, not the patch network. (3) Georeferencing requires T_utm_world from KITTI's GPS/IMU origin (oxts/data/0000000000.txt) — this is a one-time SE(3) extrinsic, trivially documentable, and avoids the temptation to hand-align the polyline to the imagery (which would silently hide drift).
- B-004.card-2#3 rankresearcher-pose-hyunsu-kim
Run DPV-SLAM on KITTI seq 09 and overlay the estimated ego-trajectory on Google Earth aerial imagery as the loop-closure ocular proof.
Pose & tracking: SLAM is just streaming 6-DoF pose estimation under loop-closure constraints — the visual artifact a layperson reads is the ego-trajectory's metric consistency, not the map.
Treat M-006 as a streaming 6-DoF camera-pose estimation problem (my domain) rather than a rendering problem. KITTI sequence 09 is the canonical monocular-SLAM loop-closure stress test — a known failure mode for ORB-SLAM3 — so it cleanly separates 'pose tracker that drifts' from 'pose tracker that actually closes the loop.' DPV-SLAM++ fits the RTX 3080 16GB budget (5-7 GB VRAM, 39 FPS on KITTI per the ECCV 2024 paper), produces a TUM-format trajectory we can rigorously evaluate with evo, and the loop-snap event itself is the visual payoff: a non-expert sees the car's blue line return to its own start dot on a satellite photo of Karlsruhe.
Expected gain: DPV-SLAM++ KITTI avg ATE 25.76 m vs DPV-SLAM (no full loop closure) 53.03 m — roughly 2x trajectory-error reduction from loop closure alone (Lipson, Teed & Deng, ECCV 2024, Table 3); seq 09 specifically is where ORB-SLAM3 mono is known to fail loop closure, so the visual delta is large.Effort: 2 days (1 day Docker + DPVO env + KITTI seq 00/05/07/09/10 download ~10 GB; 0.5 day for the five DPV-SLAM++ runs at 39 FPS — seq 00 ~4500 frames finishes in ~30 min wall-clock; 0.5 day for evo_ape numeric eval + Leaflet aerial overlay).Papers (3)
- Lipson, Teed & Deng, 'Deep Patch Visual SLAM' (ECCV 2024), §4.2 'KITTI Odometry' Table 3 and §3.3 'Loop Closure'
- Campos et al., 'ORB-SLAM3' (T-RO 2021), §VII-C 'KITTI dataset' — documents seq 09 monocular loop-closure failure as the contrast baseline
- Weng et al., 'AB3DMOT' (IROS 2020), §III 'Trajectory evaluation' — sanity-check ATE/RPE protocol against MOT-style trajectory metrics I use day-to-day
Domain notes
From the tracking-and-pose lens: the website-visible 'did the loop close' question is *exactly* the same question an autonomous-driving stack asks of its pose estimator every frame — if the trajectory snaps back to origin on the aerial, both the engineer and the layperson agree the system works. Monocular-only is fine for M-006 because we have GT poses; for downstream missions I'd want LiDAR-camera fusion (CenterPoint-style) — what does it buy us? Roughly an order-of-magnitude tighter ATE on long sequences and proper scale, since monocular SLAM is up-to-scale and KITTI seq 09's 53→25 m gap is largely scale drift that LiDAR depth eliminates outright. Flag for record: report ATE, RPE, and loop-closure-residual together — never one in isolation.
- B-004.card-1#4 rankresearcher-3d-minseo-park
Run MonoGS monocular on Replica office_0 and ship an interactive WebGL splat fly-through as the public artifact.
3D shape modeling — judge SLAM by the quality of the reconstructed scene representation (Gaussians/occupancy/pointmap), not just the trajectory; the artifact a layperson sees IS the geometry.
From the 3D-shape lens, the M-006 visual oracle problem is solved the moment the output is a navigable 3D scene rather than a 2D chart — and 3DGS is the only representation in our hardware budget that lets a non-expert literally orbit the reconstruction in a browser. MonoGS (CVPR 2024 Best Demo) reconstructs Replica office_0 from monocular RGB at ~37.5 dB PSNR with a Gaussian map of only ~2.6 MB on disk, which we embed directly into a WebGL splat viewer on the lab site. Compared to Occupancy Networks (Mescheder et al. §3, requires watertight GT meshes and a marching-cubes extraction step before anything is viewable) or NeRF (Mildenhall et al. §4, needs offline volumetric rendering — no browser fly-through), 3DGS uniquely collapses 'reconstruct → publish' into one artifact. Assumptions: Replica's released camera intrinsics are trusted as-given; pose noise is zero in the GT trajectory we hold out from; office_0 is fully static so the rigid-Gaussian assumption holds.
Expected gain: PSNR 36–40 dB on Replica office_0 monocular per Matsuki et al. Table 2 (paper avg 37.50 dB across 8 Replica scenes, office_0 specifically 39.95 dB); SSIM 0.96–0.97; LPIPS 0.05–0.08.Effort: 2 days end-to-end: ~30 min conda env + dataset, ~30–60 min MonoGS run on RTX 3080 Laptop (extrapolated from 3 FPS × ~2000 frames with dev.speedup branch), ~1 day Next.js WebGL splat-viewer component + held-out eval harness.Papers (3)
- Matsuki, Murai, Kelly, Davison, *Gaussian Splatting SLAM* (MonoGS), CVPR 2024 — §4.2 (Camera Tracking via analytic Jacobians) and §5.1 Table 2 (Replica per-scene PSNR/SSIM/LPIPS)
- Kerbl, Kopanas, Leimkühler, Drettakis, *3D Gaussian Splatting for Real-Time Radiance Field Rendering*, SIGGRAPH 2023 — §5 (Fast Differentiable Rasterizer) for the representation that makes browser fly-through viable
- Mescheder et al., *Occupancy Networks*, CVPR 2019 — §3 (contrast: implicit field requires explicit mesh extraction before any visual artifact exists)
Domain notes
Three things only the 3D-shape pillar will catch: (1) MonoGS's elongated Gaussians encode view-dependent transparency — Replica office_0 has a glass partition that will look better here than under any implicit occupancy field; (2) the held-out eval must sample poses geodesically far from training trajectory (≥ 0.3 m baseline) or PSNR is meaningless — same trap NeRF eval splits fell into pre-2022; (3) the .ply Gaussian export is ~2.6 MB and loads in any browser, so the visual oracle is genuinely embeddable, not a screenshot — this is the structural reason 3DGS beats NeRF/occupancy for public-facing SLAM artifacts.
- B-004.card-7#5 rankresearcher-worldmodel-yuna-kang
Reconstruct a Gaussian-Splat scene with MonoGS, then render a free-camera counterfactual fly-through driven by a scripted ego-trajectory the visitor can compare side-by-side against the original video.
A SLAM map is not the deliverable; the deliverable is a closed-loop simulator you can replay a policy through and stress-test counterfactually.
Run MonoGS on a short driving / indoor clip to produce a 3DGS map, then treat that map as a differentiable renderer (a tiny world model) and synthesize a NEW trajectory the camera never actually took -- e.g., shift ego 1.5m laterally or pause at frame 200 and orbit. Render the counterfactual at 1280x720 next to the original recording. From my domain's lens this is the cheapest path to a closed-loop-capable asset: the same splat scene can later be queried by a policy at arbitrary 6-DoF poses, which is exactly what GAIA-1 / DriveDreamer simulate generatively but here grounded in real geometry. The visitor sees 'same room, new camera path' -- proof the reconstruction is a navigable simulator, not just a pretty point cloud.
Expected gain: MonoGS reports 37.50 dB PSNR on Replica (Matsuki et al. 2024, Table 1); on a laptop 3080 we expect 30-34 dB on novel views per the GS-SLAM survey ablations (published band).Effort: 2 days: 0.5d MonoGS setup + Replica scene, 0.5d novel-trajectory scripting (Blender-style camera spline through the splat), 0.5d side-by-side renderer + PSNR eval harness, 0.5d buffer for VRAM tuning on 16GB.Papers (3)
- Matsuki et al., MonoGS, CVPR 2024, §4.2 'Camera Tracking and Mapping' and §5.1 Replica PSNR table
- Hu et al., GAIA-1 tech report 2023, §3 'World Model' on action-conditioned novel-view generation
- Hafner et al., DreamerV3, Nature 2025, §2 on latent world models as policy-queryable simulators
Domain notes
Photometric PSNR on training views is the log-replay trap of GS-SLAM -- it proves nothing about whether the map is a simulator. The held-out novel-view PSNR + the visual counterfactual fly-through is the closed-loop analog: it asks whether the map answers queries it was never trained on. This is the minimal viable step toward later swapping a real policy in for the scripted spline -- which is the M-007+ trajectory I'd push for.
- B-004.card-8#6 rankcross-pollination
Dual-trajectory KITTI-09 aerial overlay: DINOv2-frontend ORB-SLAM3 vs DPV-SLAM++ on the same Karlsruhe satellite tile, with a loop-closure event marker.
Combines: B-004.card-6B-004.card-4
Novelty: Frontend-vs-backend ablation rendered as a SINGLE shared aerial canvas: two trajectories (DINOv2-frontend ORB-SLAM3 vs DPV-SLAM++) drawn on the same Karlsruhe satellite tile for seq 09. Neither parent proposes a controlled head-to-head on identical geography — the visual oracle becomes a comparative judgement, not a binary one, which is strictly more informative to a layperson than either card alone.
Expected gain: DPV-SLAM++ 25.76 m, DINOv2-frontend speculative 1.5-2x over SuperPoint-SLAM3 0.34% relative; comparative visual gain is the multiplierEffort: 3-4 days (DPV-SLAM++ is the 2-day spine; DINOv2 frontend swap is the 1-2 day delta on top, sharing eval + viz infra) - B-004.card-6#7 rankresearcher-ssl-donghyun-park
Swap ORB-SLAM3's frontend with DINOv2-distilled dense descriptors and prove it on KITTI 09 where mono ORB-SLAM3 famously fails.
SLAM frontends are bottlenecked by hand-crafted local features (ORB) or supervised features (SuperPoint); replacing them with self-supervised DINOv2 dense features pretrained on web-scale unlabeled imagery should yield more discriminative, illumination-invariant correspondences that directly improve SLAM tracking and loop closure on hard sequences without any task-specific labels.
Take ORB-SLAM3 as the host system and replace the ORB descriptor + DBoW2 vocabulary with frozen DINOv2-ViT-S/14 dense patch features (distilled into a lightweight head for 256-D descriptors at 30 FPS, following the SuperPoint-SLAM3 recipe but with SSL features instead of supervised ones). The SSL features are pretrained on LVD-142M with no SLAM-specific labels, so this is a pure transfer test of 'does large-scale joint-embedding pretraining transfer to geometric correspondence?'. Loop closure uses cosine-similarity over pooled DINOv2 tokens instead of bag-of-words. The headline scientific bet: SSL features eat the illumination/viewpoint variation that breaks ORB on KITTI 09, and the public sees that as a car that actually closes the loop on a satellite photo.
Expected gain: SuperPoint-SLAM3 (arXiv 2506.13089) reports KITTI Odometry mean translational error 0.34% vs 4.15% for vanilla ORB-SLAM3 (~12x reduction) using supervised SuperPoint; we expect SSL-pretrained DINOv2 features to match or modestly beat this (1.5-2x further error reduction band) because DINOv2 features are demonstrably stronger than SuperPoint at dense matching per DINOv2 paper Table 9 (semantic correspondence on SPair-71k). Mark as (speculative, no published KITTI ATE for a DINOv2 frontend in ORB-SLAM3 yet).Effort: 5-7 days (1 day DINOv2 ViT-S/14 inference plumbing + descriptor head, 1 day ORB-SLAM3 frontend swap via the SuperPoint-SLAM3 fork as scaffolding, 1 day DBoW2 -> DINOv2-cosine loop closure replacement, 1 day KITTI 00/05/07/09/10 runs, 2-3 days visual oracle: satellite overlay tooling + side-by-side GIF + correspondence heatmap viewer)Papers (5)
- Oquab et al., DINOv2, TMLR 2024, §5.2 'Dense recognition tasks' and §6.1 (frozen features transfer); §3 (LVD-142M data curation, per Tian et al.)
- Campos et al., ORB-SLAM3, IEEE T-RO 2021, §III 'Tracking' and §V 'Place Recognition' (frontend hooks to replace)
- SuperPoint-SLAM3, arXiv 2506.13089, §4 'KITTI Odometry results' (0.34% vs 4.15% baseline, the supervised-feature precedent we extend to SSL)
- Caron et al., DINO, ICCV 2021, §5.3 'Image retrieval' (cosine over DINO tokens as a loop-closure substitute for DBoW2)
- Bardes et al., V-JEPA, ICLR 2024, §4.2 'Frozen evaluation' (joint-embedding pretrained features transfer without finetuning, the methodological anchor)
Domain notes
The SSL pillar uniquely sees that (1) SLAM's correspondence problem is exactly the pretext task that DINO/DINOv2 solve at scale - patch-level dense self-distillation produces features that are invariant to lighting, weather, and viewpoint, which are the exact failure modes of ORB on KITTI 09 (afternoon shadows + repeated buildings in the loop region); (2) the 'frozen features + linear probe' protocol from DINOv2 maps cleanly onto SLAM as 'frozen backbone + lightweight descriptor head', which means we can ablate honestly - if frozen DINOv2 already wins on KITTI 09, the result is bulletproof and cheap (no finetuning, no labels); (3) for the driving-jepa long-term play this is also a free dogfood opportunity - swap DINOv2 for our own V-JEPA-driving checkpoint in a follow-up and see if domain-pretraining on driving video beats web-image pretraining on a driving SLAM benchmark. The visual oracle (satellite-overlay loop-closure GIF) is the rare SLAM artifact a layperson actually 'gets' in one second.
- B-004.card-3#8 rankresearcher-efficient-seungwoo-yoo
Ship MonoGS with an INT8/FP16 TensorRT-quantized Gaussian rasterizer and a WebGL splat viewer that proves <33 ms/frame on a 3080 Laptop.
INT4/FP8 quantization + TensorRT on Orin-class hardware — every mission needs a measured on-target latency budget, not just simulator FPS.
Take MonoGS (CVPR 2024 Best Demo) on Replica office0, replace the FP32 rasterization + tracking MLPs with a mixed-precision TensorRT engine (FP16 rasterizer, INT8-quantized tracking head via per-channel symmetric PTQ with GPTQ fallback on outlier layers), and export the final Gaussian field as a .splat file rendered in the existing WebGL viewer. My domain matters because GS-SLAM viability on consumer hardware is gated by per-frame latency, not algorithmic novelty — a quantized rasterizer is the difference between a 2 FPS research toy and a 30 FPS artifact a website visitor can actually interact with in their browser.
Expected gain: FP16 rasterization: 1.8-2.2x speedup with <0.3 dB PSNR loss (Photo-SLAM Tab. 3, 3080 Ti Laptop); INT8 PTQ on tracking MLPs: additional 1.3-1.6x with <1.0 dB loss (Q-ViT ICCV 2021 §4.2 ImageNet band, transferred — speculative for GS-SLAM specifically).Effort: 4-5 days (1d MonoGS repro on Replica, 1.5d TensorRT FP16 rasterizer port + calibration, 1d INT8 PTQ on tracking head with per-channel symmetric + GPTQ on outlier proj layers, 1d .splat export + WebGL viewer wiring, 0.5d nvprof measurement pass).Papers (4)
- Matsuki et al., 'Gaussian Splatting SLAM' (MonoGS), CVPR 2024, §4.3 (tracking/mapping decomposition) and §5.1 (Replica PSNR 37.50)
- Huang et al., 'Photo-SLAM', CVPR 2024, §4.2 (3080 Ti Laptop FPS table) and §4.4 (PSNR 34.96 Replica)
- Yoo et al., 'Q-ViT: Fully Quantized Vision Transformers for Edge Inference', ICCV 2021, §4.2 (per-channel symmetric PTQ accuracy bands)
- Dettmers et al., 'GPTQ', ICLR 2023, §3 (outlier-aware quantization for layers with heavy-tailed activations)
Domain notes
Two non-obvious risks the SLAM lens misses: (1) Gaussian rasterizer is a custom CUDA kernel, not a torch.nn graph — TensorRT won't ingest it directly, so the FP16 win requires hand-porting the kernel or using CUTLASS FP16 tiles, which I've done for HydraNets. (2) PSNR is the wrong oracle for a layperson; PSNR 28 looks fine, PSNR 35 looks identical — the visual_oracle binary recognition question is what actually gates 'website visitor can judge it'. Also: 16GB VRAM is tight for MonoGS at full Replica resolution; INT8 weights on the tracking head buy back ~2GB which derisks OOM at the 2h scene budget.
- B-004.card-10#9 rankcross-pollination
Night-alley raw-Bayer-to-Gaussian-splat: differentiable ISP feeds MonoGS, ship a side-by-side WebGL viewer of raw-ISP-splat vs stock-JPEG-splat with a 'read the sign' visitor task.
Combines: B-004.card-5B-004.card-1
Novelty: Most raw-ISP work is evaluated on 2D denoising metrics; most GS-SLAM work assumes clean sRGB. The synthesis asks: does sensor-domain conditioning survive the entire 3D pipeline and show up as LEGIBLE TEXT in a navigable splat? The visual oracle is text legibility at a viewpoint the input video never sampled — a property neither 2D ISP benchmarks nor standard Replica SSIM can express. This is genuinely cross-pillar (low-level x 3D x robotics) rather than additive.
Expected gain: +3 to +5 dB PSNR (Brooks et al., Unprocessing Images for Learned Raw Denoising, CVPR 2019); legibility delta in a 3D splat is novel and speculativeEffort: 6-7 days (card-5's 5-7 day spine dominates; capture + diff-ISP integration are the long poles) - B-004.card-5#10 rankresearcher-lowlevel-soyoung-choi
Build a raw-domain night-drive SLAM demo: feed 14-bit Bayer frames through a lightweight differentiable ISP into MonoGS and render a navigable Gaussian splat of a streetlit alley.
Night-driving SLAM fails at the sensor: rolling-shutter + low-light raw noise + tonemapped JPEGs destroy the photometric consistency every dense mapper assumes — fix the pixels before the geometry.
Capture (or use existing) raw Bayer sequences of a night street scene at IMX477-class sensor, ISO 1600-3200, 1/30s exposure with known rolling-shutter readout, then process two parallel pipelines into MonoGS: (A) the camera's stock sRGB JPEG output, (B) a differentiable ISP (Unprocessing-style noise model + learned denoise + simple tone curve) tuned end-to-end against MonoGS photometric loss. Low-level vision matters because dense Gaussian splatting optimizes photometric residual directly on pixel intensities — clipped highlights from streetlights and Poisson-Gaussian shot noise in shadows are the dominant failure mode below 10 lux, and standard ISPs throw away the very dynamic range SLAM needs. The visitor-facing payoff is a side-by-side WebGL splat: stock-JPEG reconstruction is full of floaters and washed-out lamp posts, raw-ISP reconstruction shows crisp signage and recoverable shadow detail.
Expected gain: +3 to +5 dB PSNR and ~30% reduction in tracking drift on low-light sequences, anchored to Brooks et al. Unprocessing §4 (raw denoising +3 dB over sRGB) and Photo-SLAM Tab.2 PSNR sensitivity to input noise; layperson legibility gain is speculative pending blind test.Effort: 5-7 days (1 day raw capture/calibration, 2 days diff-ISP wiring into MonoGS photometric loss, 1 day training, 1-2 days WebGL side-by-side viewer + blind test scaffolding).Papers (4)
- Brooks et al., Unprocessing Images for Learned Raw Denoising, CVPR 2019, §3 (inverse ISP pipeline) and §4.2 (raw vs sRGB denoising gain)
- Matsuoka et al., Photo-SLAM, CVPR 2024, §4.1 (photometric loss formulation) and Tab.2 (Replica PSNR sensitivity)
- Chen et al., Learning to See in the Dark, CVPR 2018, §3 (raw-to-sRGB end-to-end pipeline for extreme low light)
- Mosleh et al., Hardware-in-the-Loop End-to-End Optimization of Camera ISPs, CVPR 2020, §3 (downstream-task-driven ISP tuning)
Domain notes
Three things only the low-level lens catches: (1) MonoGS photometric loss is computed on 8-bit sRGB, so shadow gradients below code value ~12 are quantization-dead — raw 14-bit input recovers ~6 stops of usable signal; (2) rolling shutter at 1/30s over a moving vehicle warps every Gaussian into a smear unless we model line-time readout in the projection; (3) streetlight highlights are clipped pre-demosaic in stock ISPs, which means the splat learns a flat white blob where a lamp should be — recoverable only if we keep raw highlights and apply tone-mapping after splat optimization, not before.