B-001

Push pose-tracking AMOTA on nuScenes-mini mini_val from 0.471 to ≥ 0.55

Metric: AMOTAbaseline 0.471target 0.55paper band 0.55–0.7

Director synthesis

Card-2 (Hyunsu Kim, Mahalanobis + CTRV + per-class hyperparams) is #1 because it has the strongest gain/effort ratio of any card on the table — 1.5 days for a credible +0.05 to +0.09 AMOTA — and it directly attacks the diagnosed failure mode (small-class association in AB3DMOT) with a well-precedented recipe (Chiu WACV 2021 Tab.2). It is also the substrate every other tracking-side improvement (cards 3, 4, 6, 8, 10) implicitly assumes; landing it first de-risks the rest of the program and likely closes ~70% of the gap to the 0.55 target on its own.

B-001.card-2 — Replace AB3DMOT's center-distance Hungarian with a class-aware 3D-GIoU + velocity-aware Mahalanobis cost, lower min_hits to 1 for two-wheelers, and add a one-step constant-turn-rate-velocity (CTRV) Kalman to reclaim bicycle/motorcycle AMOTA.
B-001.card-3 — Per-class score-threshold + Hungarian-cost temperature sweep on the detector logits, calibrated against mini_val, to recover bicycle/motorcycle recall without inflating FPs that break AMOTA's recall ladder
B-001.card-6 — Add a frozen DINOv2 ViT-B/14 appearance embedding on multi-view RGB crops as a second cost term in Hungarian association
B-001.card-8 — DINOv2-conditioned RSSM rollouts: gate occluded-object re-association by the joint likelihood p(traj | world-model) · p(appearance | DINOv2 prototype), eliminating the IDS/FRAG cliff at max_age.
B-001.card-4 — Replace per-object constant-velocity Kalman with an ego-motion-compensated IMM (CV + CTRV + CA) filter operating in T_world_obj with explicit covariance propagation through T_world_ego
B-001.card-10 — Joint per-class calibration of {score threshold, NMS, IMM mode prior, Mahalanobis gate} via a single coordinate-descent sweep on mini_train, so detector recall lift and ego-aware filter gating co-adapt instead of fighting each other.
B-001.card-7 — Replace constant-velocity Kalman with a GenAD-style latent world model that rolls out occluded-object futures 1-2s ahead, scoring re-association against generative trajectory priors instead of dropping at max_age=2
B-001.card-1 — Complete sparse bicycle/motorcycle point clusters with a class-conditioned occupancy decoder before feeding boxes to AB3DMOT, so detection recall (and therefore AMOTA) on small classes recovers.
B-001.card-9 — Raw-domain RGB-conditioned class-conditional OccNet: complete sparse two-wheeler LiDAR clusters by cross-attending to Choi's differentiable-ISP linear-sensor features at the frustum projection, producing tighter boxes than either LiDAR-only OccNet or RGB-only frustum lifts.
B-001.card-5 — Fuse a raw-domain camera detector for small/dim classes (bicycle, pedestrian) into AB3DMOT via late detection-level fusion, with a lightweight differentiable ISP tuned for detection loss on twilight/shaded frames

Promoted cards

M-001 ← B-001.card-2

All cards (ranked)

B-001.card-2#1 rank
Hyunsu Kim
Replace AB3DMOT's center-distance Hungarian with a class-aware 3D-GIoU + velocity-aware Mahalanobis cost, lower min_hits to 1 for two-wheelers, and add a one-step constant-turn-rate-velocity (CTRV) Kalman to reclaim bicycle/motorcycle AMOTA.
pose-tracking native: tracker quality is dominated by association cost design and motion model, not detector swaps, especially on small-recall classes like bicycle/motorcycle
On mini_val the aggregate is bottlenecked by bicycle (0.075) and motorcycle (0.337); these classes have few hits, sharp heading changes, and tight boxes where center-distance gating both over-rejects true matches and over-associates clutter. Swap the association cost to class-conditioned 3D-GIoU plus a Mahalanobis term over the CenterPoint-predicted velocity residual (CenterPoint already emits per-detection vx,vy), switch the EKF dynamics from constant-velocity to CTRV (the standard upgrade Chiu et al. and AB3DMOT-v2 ship), and tune per-class {min_hits, max_age, score_thr}: for bicycle/motorcycle drop min_hits to 1 and raise max_age to 3, while keeping car/truck at the current 3/2. This is pure tracker-side work that does not touch the detector, so it composes with anything else the lab tries; on nuScenes full val, Probabilistic 3D MOT (Chiu et al.) reported +0.06-0.09 AMOTA over AB3DMOT vanilla from exactly this combination.
Expected gain: +0.05 to +0.09 AMOTA aggregate (Chiu et al. WACV 2021 Table 2: 0.561 -> 0.626 on nuScenes val going from AB3DMOT center-dist+CV to Mahalanobis+CTRV; bicycle/motorcycle gained 0.10-0.15 AMOTA in their per-class breakdown). On mini_val with only 2 scenes, expect noisier but directionally similar; pushing 0.471 -> ~0.54-0.56 is realistic.
Effort: 1.5 days (4h: 3D-GIoU + Mahalanobis cost matrix; 4h: CTRV EKF with proper Jacobian; 2h: per-class hyperparam sweep on mini_val; 2h: validation + IDS/FRAG regression check)
Papers (4)
- Chiu et al., 'Probabilistic 3D Multi-Modal, Multi-Object Tracking for Autonomous Driving,' WACV 2021, §3.2 (Mahalanobis association) and §3.3 (per-class hyperparameters)
- Weng et al., 'AB3DMOT,' IROS 2020, §III-B (Hungarian + 3D-IoU baseline we are replacing)
- Yin et al., 'CenterPoint,' CVPR 2021, §3.3 (velocity head we exploit in the Mahalanobis term)
- Pang et al., 'SimpleTrack,' ECCV 2022 workshop, §4 (3D-GIoU > center-distance ablation, +2.1 AMOTA)
Domain notes
Critical: report AMOTA, AMOTP, AND IDS together after the change — Mahalanobis gating can silently inflate IDS if the velocity covariance is mis-scaled. Also: with only 41 bicycle GT in mini_val, a single missed track flips AMOTA by ~0.05, so confirm any gain on at least nuScenes val before declaring victory.
B-001.card-3#2 rank
Seungwoo Yoo
Per-class score-threshold + Hungarian-cost temperature sweep on the detector logits, calibrated against mini_val, to recover bicycle/motorcycle recall without inflating FPs that break AMOTA's recall ladder
efficient-vision: per-class score-threshold and NMS calibration as a quantization-adjacent precision/recall sweep
AMOTA integrates over recall, so the current single global score>=0.10 cutoff is leaving the bicycle/motorcycle classes recall-starved (0.146 recall on 41 GT) while car/truck are already saturated. Treat the detector head exactly like a post-training-quantized classifier: sweep per-class thresholds (e.g. car 0.20, ped 0.15, bicycle 0.05, motorcycle 0.05) and a softmax/sigmoid temperature on the class logits using mini_train as a calibration split, plus a class-specific Mahalanobis gate widening for the AB3DMOT association cost on the two weak classes. This is the same calibration-set methodology used to pick GPTQ clipping ranges and INT8 activation percentiles — cheap, deterministic, and it directly trades the precision/recall point that AMOTA averages over. No retraining, no engine rebuild.
Expected gain: +0.03 to +0.06 AMOTA, driven mostly by bicycle/motorcycle recall lift; anchored to AB3DMOT (Weng et al., IROS 2020, Tab. 2) showing 0.02-0.05 AMOTA swings from score-threshold tuning alone, and CenterPoint (Yin et al., CVPR 2021, §4.3) per-class NMS radius ablation showing similar magnitude on rare classes
Effort: 6-10 hours (grid is ~6 thresholds x 7 classes x 1 temperature x 2 gate radii, each eval on mini_val is seconds since detections are cached)
Papers (3)
- Weng et al., AB3DMOT, IROS 2020, §III-B (score thresholding and birth/death)
- Yin et al., CenterPoint, CVPR 2021, §4.3 (per-class NMS and score thresholds)
- Dettmers et al., GPTQ, ICLR 2023, §3 (calibration-set driven per-channel clipping - methodological analogue)
Domain notes
Latency-free: this runs entirely on cached detections, zero ms impact on the Orin engine. Watch for AMOTA's recall-denominator quirk - dropping the global threshold below 0.05 inflates FPs on car/ped and can actually drop AMOTA; per-class is the only safe knob.
B-001.card-6#3 rank
Donghyun Park
Add a frozen DINOv2 ViT-B/14 appearance embedding on multi-view RGB crops as a second cost term in Hungarian association
self-supervised joint-embedding representation learning for appearance re-identification
For each 3D detection, project the box to all 6 cameras, take the largest-area crop (resize to 224x224), and extract the CLS token from a frozen DINOv2 ViT-B/14 pretrained on LVD-142M — no nuScenes finetuning, just inference. Maintain an EMA appearance prototype per track (momentum 0.9, V-JEPA-style) and fuse a cosine-distance cost with the existing center-distance cost via a convex combination (alpha ~ 0.4 on appearance, tune on mini_train). DINOv2 features are dense, semantically clustered, and instance-discriminative out of the box, so they directly attack the re-id-after-occlusion failure mode (IDS 33, FRAG 22) without any training and recover tracks across the typical 0.5-2s nuScenes gaps where center-distance gating fails.
Expected gain: +0.04 to +0.07 AMOTA — anchored to DINOv2 §5.2 (frozen-feature instance retrieval matches or beats supervised ResNet-50 by 4-8 points mAP on Oxford/Paris) and DINO §5.3 (k-NN classification 78.3% on ImageNet from frozen ViT-S/8, indicating strong instance-level discrimination usable for re-id without training)
Effort: 1.5 days (4h: box-to-image projection + crop pipeline across 6 cams; 2h: DINOv2 inference wrapper with batched crops; 2h: EMA prototype + cost fusion in AB3DMOT; 4h: alpha sweep on mini_train and AMOTA eval on mini_val)
Papers (4)
- Oquab et al., DINOv2, TMLR 2024 §5.2 (instance retrieval with frozen features)
- Caron et al., DINO, ICCV 2021 §5.3 (k-NN evaluation showing instance discrimination)
- Bardes et al., V-JEPA, ICLR 2024 §4.1 (EMA target encoder, motivates EMA appearance prototype per track)
- He et al., MoCo, CVPR 2020 §3.1 (momentum-updated queue — same EMA principle for stable per-track embeddings)
Domain notes
Pretrain -> freeze -> use; do NOT finetune on nuScenes — the dataset is too small (mini has ~10 scenes) and would destroy the LVD-142M-learned invariances. DINOv2 over MAE here because the downstream (re-id) is semantic discrimination, not pixel reconstruction. For bicycle recall 0.146 the appearance prior also helps marginally — but the real fix for bicycle is detector-side, not tracker-side. Multi-view: pick max-area crop rather than averaging across cams.
B-001.card-8#4 rankcross-pollination
director-jiwoo-han
DINOv2-conditioned RSSM rollouts: gate occluded-object re-association by the joint likelihood p(traj | world-model) · p(appearance | DINOv2 prototype), eliminating the IDS/FRAG cliff at max_age.
Generative trajectory priors conditioned on a frozen self-supervised appearance prototype — re-identification across occlusion is a joint likelihood over (latent dynamics, appearance manifold), not an either/or cost.
Condition Kang's lightweight RSSM not only on map+ego but on the frozen DINOv2 CLS prototype from Park's EMA track memory, so the K=8 latent rollouts are appearance-aware multimodal futures rather than purely kinematic ones. At re-detection time, score each candidate with a joint cost: Mahalanobis in RSSM latent (Kang) + cosine to DINOv2 prototype (Park), normalized via learned temperature on mini_train. The prototype anchors the world model's multimodality — bicycles and motorcycles, which Kang's RSSM tends to over-disperse due to sparse training, get pulled toward their appearance-consistent mode. This attacks the dominant failure on mini_val (IDS=33, FRAG=22) without retraining either the detector or the SSL backbone.
Combines: B-001.card-6 B-001.card-7
Novelty: Neither card alone fuses appearance into the dynamics prior — Park uses DINOv2 as a static association cost, Kang uses RSSM rollouts as a pure motion prior. Conditioning the generative future on a frozen SSL prototype is non-obvious because it crosses representation-learning and world-modeling stacks that normally live in different teams; the resulting prior is appearance-disambiguated motion, which is what occlusion re-id actually requires.
Expected gain: +0.06 to +0.10 AMOTA (sum of IDS/FRAG attack surfaces of both parents, with sub-additive overlap; dominant on bicycle/motorcycle/pedestrian)
Effort: 4-5 days (RSSM stub from card-7 + DINOv2 prototype from card-6, plus joint-cost calibration)
Papers (4)
- Oquab DINOv2 TMLR 2024 §5.2
- Hafner DreamerV3 §3
- Yang GenAD §3.2
- Bardes V-JEPA §4.1
B-001.card-4#5 rank
Jaehyun Lee
Replace per-object constant-velocity Kalman with an ego-motion-compensated IMM (CV + CTRV + CA) filter operating in T_world_obj with explicit covariance propagation through T_world_ego
vision-robotics: ego-motion-aware state estimation in the world frame
AB3DMOT's vanilla CV Kalman in the global frame leaks ego-pose uncertainty into the object track — when the ego yaws through a turn, the predicted T_world_obj drifts even for static cars, opening the center-distance gate and causing ID switches at intersections. Swap the single CV model for an Interacting Multiple Model filter with CV (parked / cruising), CTRV (turning vehicles, bicycles), and CA (braking) sub-filters, and explicitly propagate ego-pose covariance Σ_T_world_imu from nuScenes ego_pose into the predicted measurement covariance so the Mahalanobis gate tightens on confident frames and widens during fast yaw. State stays [x,y,z,θ,l,w,h,vx,vy,vz,ω] in world frame with the IMM mixing handling motion-class transitions; gate becomes Mahalanobis on the innovation, not raw center distance. This directly attacks both IDS (33) and FRAG (22) because the CTRV sub-filter survives the 1-2 frame occlusions during turns where CV diverges.
Expected gain: +0.04 to +0.08 AMOTA (IMM-CTRV vs CV on nuScenes: Chiu et al. 'Probabilistic 3D Multi-Object Tracking for Autonomous Driving' §4.3 reports +5.1 AMOTA on full val; speculative that mini_val behaves similarly, but turn-heavy scenes 0103/0916 should over-index)
Effort: 1.5-2 days
Papers (4)
- Chiu et al., 'Probabilistic 3D Multi-Object Tracking for Autonomous Driving', IROS 2021, §3.2 (Mahalanobis gating) and §4.3 (ablation table)
- Weng et al., 'AB3DMOT', IROS 2020, §III-B (state model — the thing we are replacing)
- Bar-Shalom et al., 'Estimation with Applications to Tracking and Navigation', Ch.11 (IMM derivation)
- Shan et al., 'LVI-SAM', ICRA 2021, §III-C (ego-motion covariance propagation pattern we mirror)
Domain notes
Coordinate frames: keep filter state in T_world_obj, propagate via predict() in world frame, but compute innovation covariance as H Σ_pred H^T + R_det + J_ego Σ_T_world_imu J_ego^T where J_ego is the Jacobian of the detection-to-world transform w.r.t. ego pose. nuScenes ego_pose has no published covariance — bootstrap Σ_T_world_imu as diag(0.1m, 0.1m, 0.05m, 0.01rad, 0.01rad, 0.02rad)^2 from the localization paper's stated accuracy, treat as a tunable.
B-001.card-10#6 rankcross-pollination
director-jiwoo-han
Joint per-class calibration of {score threshold, NMS, IMM mode prior, Mahalanobis gate} via a single coordinate-descent sweep on mini_train, so detector recall lift and ego-aware filter gating co-adapt instead of fighting each other.
Calibration-as-control: per-class detector thresholds and IMM mode-probability priors are a single joint precision/recall surface, not two independent knobs.
Yoo's per-class threshold sweep and Lee's IMM filter share a hidden coupling: lowering the bicycle threshold to 0.05 floods the IMM with low-SNR detections that the CTRV/CA modes will over-trust unless mode priors are simultaneously retuned. Cast {score_thr_c, NMS_c, IMM_prior_c, gate_c} as a 4·C-dim vector and run coordinate descent on AMOTA over mini_train, exploiting AMOTA's recall-ladder structure (no gradient needed, ~50 evaluations). This is methodologically analogous to joint quantization-scale + activation-clipping calibration in GPTQ. Cheap, no retraining, but only meaningful when both parents land first as substrate.
Combines: B-001.card-3 B-001.card-4
Novelty: Each parent treats its knobs as independent; the cross-pollination point is that the IMM's mode-probability prior is itself a per-class hyperparameter that interacts with the detector's recall regime, and calibrating them jointly is non-obvious because the two lived in different pillars (efficient-vision vs. robotics-state-estimation).
Expected gain: +0.02 to +0.04 AMOTA on top of card-3 + card-4 stack (recovers the interaction term they leave on the table)
Effort: 1 day (after card-3 and card-4 land)
Papers (4)
- Weng AB3DMOT §III-B
- Chiu IROS 2021 §4.3
- Yin CenterPoint §4.3
- Dettmers GPTQ §3
B-001.card-7#7 rank
Yuna Kang
Replace constant-velocity Kalman with a GenAD-style latent world model that rolls out occluded-object futures 1-2s ahead, scoring re-association against generative trajectory priors instead of dropping at max_age=2
world models + closed-loop simulation: action-conditioned latent dynamics for future-aware tracking
Train a lightweight action-conditioned latent dynamics head (DreamerV3-style RSSM, ~5M params) on nuScenes train-split agent trajectories conditioned on map raster + ego action, producing multimodal 2s future rollouts in BEV latent space. During tracking, when a track misses detections, instead of killing it at max_age=2, propagate it via K=8 sampled latent rollouts and keep the track alive for up to 10 frames (1s @ 10Hz); on re-detection, score association by likelihood under the generative trajectory prior (Mahalanobis in latent space) rather than CV Kalman gating. This directly attacks the 22 FRAG / 33 IDS failure mode because occlusion-induced gaps are exactly where CV prediction diverges and where multimodal future priors recover the correct branch.
Expected gain: +0.04 to +0.08 AMOTA (speculative for nuScenes-mini; anchored to GenAD CVPR 2024 Tab.3 showing ~7% mAP gain on motion-forecasting-aided perception, and DriveDreamer ECCV 2024 §4.3 reporting 12% FDE reduction under occlusion — FRAG/IDS are the dominant AMOTA loss terms here, so a ~30-50% FRAG reduction plausibly closes most of the 0.471->0.55 gap)
Effort: 5-7 days (2d data prep on nuScenes train trajectories, 2d RSSM training, 1-2d integration into AB3DMOT association step, 1d eval sweep on mini_val)
Papers (4)
- Hafner et al. DreamerV3 (Nature 2025) §3 'World Model Learning' — RSSM latent dynamics formulation
- Yang et al. GenAD (CVPR 2024) §3.2 'Instance-Centric Scene Tokenization' and Tab.3 forecasting-aided perception gains
- Wang et al. DriveDreamer (ECCV 2024) §4.3 occlusion-robust future prediction ablation
- Hu et al. GAIA-1 (Wayve tech report 2023) §4 multimodal generative rollouts for counterfactual evaluation
Domain notes
Critical: do the rollouts in LATENT space, not pixel/BEV-image space — DreamerV3 §3 is explicit that latent dynamics beat pixel prediction for downstream control/association tasks at this compute budget. max_age=2 is a symptom of a missing motion prior, not a hyperparameter to tune — raising max_age without a better prior just trades FRAG for FP.
B-001.card-1#8 rank
Minseo Park
Complete sparse bicycle/motorcycle point clusters with a class-conditioned occupancy decoder before feeding boxes to AB3DMOT, so detection recall (and therefore AMOTA) on small classes recovers.
Tracking is shape association over time — small classes fail because CenterPoint's per-frame BEV detector sees ~5-15 LiDAR points on bicycles/motorcycles, which is below the geometric prior threshold where shape becomes recoverable from a single sweep.
Run the existing CenterPoint pillar02 detector at a lowered score threshold (0.05) to harvest low-confidence bicycle/motorcycle/pedestrian proposals, crop the LiDAR points inside each proposal box (typically 5-30 points for cyclists at >20m), and pass them through a small class-conditioned Occupancy Network decoder (Mescheder et al. §4.1) pretrained on ShapeNet bicycle/motorcycle CAD aligned to nuScenes scale. The completed implicit shape gives a tight oriented bounding box via marching-cubes + PCA, which both rescores the proposal (occupancy confidence) and refines the box dimensions/yaw fed to AB3DMOT — recovering the recall the per-frame detector is bleeding on the two classes that drag the macro-average. This is exactly the partial-point-cloud-completion regime my CVPR 2020 LOF paper targeted: at <30 points the implicit field is the only representation that doesn't collapse to a degenerate box.
Expected gain: +0.03 to +0.06 AMOTA on the full mini_val macro-average (speculative on AMOTA itself; anchored to Mescheder et al. CVPR 2019 §5.2 reporting IoU 0.571 -> 0.778 on cars from 300-point partial inputs, and Yuan et al. PCN ECCV 2018 §4.2 showing ~40% recall lift on sub-50-point objects).
Effort: 3-4 days (1 day ShapeNet->nuScenes scale alignment + class subset extraction, 1 day OccNet decoder fine-tune on cropped LiDAR boxes, 1 day proposal-rescoring + box-refinement pipeline wired into AB3DMOT input, 0.5 day mini_val eval).
Papers (3)
- Mescheder et al., Occupancy Networks, CVPR 2019, §3.1 (occupancy decoder) and §4.1 (single-view completion setup)
- Yuan et al., PCN: Point Completion Network, ECCV 2018, §4.2 (sparse-input recall)
- Park et al., Latent Occupancy Fields for Partial Point Cloud Completion, CVPR 2020, §3.2 (conditioning on <100 points)
Domain notes
Assumptions: nuScenes camera+LiDAR calibration trusted (mini_val ego-pose noise is sub-cm so ignore it); bicycles/motorcycles treated as rigid for single-frame completion. Trailer-NaN problem (0 GT in mini_val) cannot be moved by AMOTA regardless. OccNet is the right tool here vs. 3DGS because 3DGS assumes dense multi-view init (Kerbl §3.1).
B-001.card-9#9 rankcross-pollination
director-jiwoo-han
Raw-domain RGB-conditioned class-conditional OccNet: complete sparse two-wheeler LiDAR clusters by cross-attending to Choi's differentiable-ISP linear-sensor features at the frustum projection, producing tighter boxes than either LiDAR-only OccNet or RGB-only frustum lifts.
Sensor-physics-aware shape completion — the prior that completes a 15-point bicycle cluster should be conditioned on the raw-domain camera evidence of that same instance, not on LiDAR geometry alone.
Take Park's OccNet completion head and condition its latent on a per-instance feature pooled from Choi's diff-ISP-fronted detector at the LiDAR-cluster frustum, in linear sensor space (Brooks unprocessing) so twilight/shaded bicycles still produce signal. The completion network sees ~15 LiDAR points plus a raw-domain RGB context vector, which disambiguates pose for sparse clusters where geometric prior alone is multi-modal (bicycle vs. motorcycle frame, heading flip). Marching-cubes + PCA on the resulting implicit field gives the oriented box, fed to AB3DMOT at score 0.05. The raw-domain conditioning is the key: sRGB-conditioned completion underperforms on dim frames, which is exactly where the small-class recall problem lives on nuScenes.
Combines: B-001.card-1 B-001.card-5
Novelty: Shape completion and differentiable ISP are usually disjoint research stacks — completion lives in 3D geometry, ISP lives in computational photography. Conditioning an implicit shape decoder on raw-domain camera features specifically because that's where the small-class signal survives the ISP cliff is non-obvious; the parents independently target small-class recall but neither fuses the two evidence channels at the latent-of-shape level.
Expected gain: +0.05 to +0.09 AMOTA (super-additive on bicycle/motorcycle in twilight subset of mini_val; sub-additive on car/ped where both parents saturate)
Effort: 5-6 days (Park's completion head + Choi's diff-ISP front + frustum cross-attention glue)
Papers (4)
- Mescheder OccNet CVPR 2019 §3.1
- Brooks Unprocessing §3
- Tseng SIGGRAPH 2019 §5
- Yuan PCN ECCV 2018 §4.2
B-001.card-5#10 rank
Soyoung Choi
Fuse a raw-domain camera detector for small/dim classes (bicycle, pedestrian) into AB3DMOT via late detection-level fusion, with a lightweight differentiable ISP tuned for detection loss on twilight/shaded frames
low-level + physics-based vision (raw imaging, differentiable ISP, sensor-perception co-optimization)
The 6 camera streams are unused, yet bicycles (41 GT) and small pedestrians (1088 GT) are exactly where LiDAR sparsity hurts and where camera evidence is strongest. I would unprocess the nuScenes 12-bit raw-equivalent JPEGs back to a linear sensor space (Brooks-style inverse ISP: inverse-tone, inverse-WB, inverse-CCM, re-Bayer with Poisson-Gaussian noise) and train a small RetinaNet/FCOS head on that linear tensor, then plug a 3-layer differentiable ISP (black-level, WB, gamma, local tone) in front whose params are optimized against the detector's classification+box loss on mini_train twilight/shaded frames. The resulting 2D detections are lifted to 3D frustums, gated by LiDAR points, and merged with CenterPoint detections before AB3DMOT association — so small/far bicycles and pedestrians that CenterPoint misses get recovered as new tracks, raising recall (the dominant AMOTA term).
Expected gain: +0.04 to +0.08 AMOTA (speculative for mini_val; anchored to Tseng SIGGRAPH'19 reporting ~30% relative detection-AP gain from ISP hyperparam co-opt §5, and Mosleh CVPR'20 §4 showing 4-7 AP from HW-in-loop ISP tuning; small-object recall gains of that magnitude typically translate to 0.04-0.08 AMOTA on small-class-heavy splits)
Effort: 3-4 days (1 day unprocessing pipeline + raw cache for mini_val/train, 1.5 days train small raw-domain 2D detector + diff-ISP block, 0.5 day frustum lift + LiDAR gating, 1 day fuse-with-CenterPoint and re-run AB3DMOT)
Papers (4)
- Brooks et al., Unprocessing Images for Learned Raw Denoising, CVPR 2019, §3 (inverse ISP pipeline) and §4 (noise model)
- Tseng et al., Hyperparameter Optimization in Black-box Image Processing, SIGGRAPH 2019, §5 (downstream-task ISP tuning)
- Mosleh et al., Hardware-in-the-Loop End-to-End ISP Optimization, CVPR 2020, §4 (detection-driven ISP gains)
- Chen et al., Learning to See in the Dark, CVPR 2018, §3 (raw-domain low-light pipeline)
Domain notes
nuScenes ships 8-bit sRGB JPEGs from a Basler acA1600-60gc (1600x900, ~12-bit native, exposure varies 10-30ms across the rig); true raw is unavailable, so we MUST unprocess. Risk: mini_val is tiny so AMOTA gain has high variance; report per-class AMOTA for bicycle and pedestrian separately as the load-bearing numbers.

Director synthesis

Promoted cards

All cards (ranked)

Replace AB3DMOT's center-distance Hungarian with a class-aware 3D-GIoU + velocity-aware Mahalanobis cost, lower min_hits to 1 for two-wheelers, and add a one-step constant-turn-rate-velocity (CTRV) Kalman to reclaim bicycle/motorcycle AMOTA.

Per-class score-threshold + Hungarian-cost temperature sweep on the detector logits, calibrated against mini_val, to recover bicycle/motorcycle recall without inflating FPs that break AMOTA's recall ladder

Add a frozen DINOv2 ViT-B/14 appearance embedding on multi-view RGB crops as a second cost term in Hungarian association

DINOv2-conditioned RSSM rollouts: gate occluded-object re-association by the joint likelihood p(traj | world-model) · p(appearance | DINOv2 prototype), eliminating the IDS/FRAG cliff at max_age.

Replace per-object constant-velocity Kalman with an ego-motion-compensated IMM (CV + CTRV + CA) filter operating in T_world_obj with explicit covariance propagation through T_world_ego

Joint per-class calibration of {score threshold, NMS, IMM mode prior, Mahalanobis gate} via a single coordinate-descent sweep on mini_train, so detector recall lift and ego-aware filter gating co-adapt instead of fighting each other.

Replace constant-velocity Kalman with a GenAD-style latent world model that rolls out occluded-object futures 1-2s ahead, scoring re-association against generative trajectory priors instead of dropping at max_age=2

Complete sparse bicycle/motorcycle point clusters with a class-conditioned occupancy decoder before feeding boxes to AB3DMOT, so detection recall (and therefore AMOTA) on small classes recovers.

Raw-domain RGB-conditioned class-conditional OccNet: complete sparse two-wheeler LiDAR clusters by cross-attending to Choi's differentiable-ISP linear-sensor features at the frustum projection, producing tighter boxes than either LiDAR-only OccNet or RGB-only frustum lifts.

Fuse a raw-domain camera detector for small/dim classes (bicycle, pedestrian) into AB3DMOT via late detection-level fusion, with a lightweight differentiable ISP tuned for detection loss on twilight/shaded frames