Class-aware Mahalanobis + CTRV in AB3DMOT for AMOTA ≥ 0.55

pose-tracking · Hyunsu Kim · from B-001.card-2

Goal

Metric:AMOTA→≥ 0.55(baseline 0.471)

Eval fixture: nuScenes v1.0-mini, mini_val (scene-0103 + scene-0916)

Baseline artifact: centerpoint-ab3dmot-nusc-mini-MEASURED-001

Achieved: 0.4028 ✗ → centerpoint-ab3dmot-nusc-mini-MEASURED-002

Approach

Class-aware Mahalanobis + CTRV in AB3DMOT

Replace AB3DMOT's center-distance Hungarian with a class-conditioned 3D-GIoU + velocity-aware Mahalanobis cost (using CenterPoint's per-detection vx,vy). Switch the EKF dynamics from constant-velocity to CTRV. Per-class hyperparameter tuning: for bicycle/motorcycle drop min_hits to 1 and raise max_age to 3; keep car/truck at 3/2. Tracker-side only — no detector retrain, composes with any later detector/ssl/world-model upgrade.

References (4)

Chiu et al., 'Probabilistic 3D Multi-Modal MOT', WACV 2021 §3.2 (Mahalanobis association) §3.3 (per-class hyperparameters); Table 2 reports 0.561 → 0.626 AMOTA on nuScenes val going from AB3DMOT center-dist+CV to Mahalanobis+CTRV.
Weng et al., 'AB3DMOT', IROS 2020 §III-B (Hungarian + 3D-IoU baseline we are replacing).
Yin et al., 'CenterPoint', CVPR 2021 §3.3 (velocity head we exploit in the Mahalanobis term).
Pang et al., 'SimpleTrack', ECCV 2022 workshop §4 (3D-GIoU > center-distance ablation, +2.1 AMOTA).

Code plan

01✓ DONE (iter 2) — Step 0 (NEW — infra): Parametrize the Makefile run id. Change `RUN_ID := MEASURED-001` to `RUN_ID ?= MEASURED-001` so `make track RUN_ID=MEASURED-002` writes to runs/MEASURED-002/ without clobbering the baseline. Because this mission changes ONLY tracker post-processing (not the detector), seed runs/MEASURED-002/ by copying the baseline detections: `cp runs/MEASURED-001/detections.pkl runs/MEASURED-002/` and touch the .marker.detect so `make track` does not retrigger the ~5 min `make detect`. File: algorithm/pose-tracking/Makefile.
02✓ DONE (iter 3) — Step 1: In algorithm/pose-tracking/src/measured/run_tracking.py add `cost_3d_giou(box_a, box_b)` — generalized 3D IoU between two oriented BEV boxes (x,y,l,w,theta) extruded in z by (z,h). Compute BEV intersection via shapely or a rotated-rectangle SAT/clip routine, multiply by z-overlap for 3D intersection, subtract from union, then subtract the (enclosing_volume - union)/enclosing_volume GIoU penalty. Return value in [-1,1]; cost contribution is 0.5*(1 - giou). Add a small pytest-style self-check in `__main__` guarded behind `--selftest` (identical boxes → giou≈1, disjoint → giou<0). Implemented self-contained (math-only) via Sutherland–Hodgman clip — no shapely dep added. Selftest passes locally: identical=1.0000, disjoint=-0.9985, partial=0.3333, rotated90=0.0833.
03✓ DONE (iter 4) — Step 2: Add `cost_mahalanobis_velocity(det, track, cov)` in run_tracking.py — residual r = [det.vx - track.v*cos(theta), det.vy - track.v*sin(theta)] (predicted velocity from CTRV state), Mahalanobis d² = rᵀ cov⁻¹ r with cov = diag([1.0, 1.0]) m²/s² (module constant VEL_COV, tunable). Normalize to a bounded cost via 1 - exp(-d²/2) so it composes additively with the GIoU term. Implemented: added module constant VEL_COV=[[1,0],[0,1]], cost_mahalanobis_velocity(det, track, cov=None) using numpy for Σ⁻¹ and the 1-exp(-d²/2) squash, plus a _track_pred_velocity(track) helper that reads vx,vy straight from the filter state and is agnostic to CV (dim_x==4, [x,y,vx,vy]) vs the upcoming CTRV (dim_x==5, [x,y,theta,v,omega] → vx=v·cosθ, vy=v·sinθ) layout, so Step 4 needs no change here. py_compile passes.
04✓ DONE (iter 5) — Step 3: Replace `step_per_class()`'s center-distance cost with a class-conditioned blend `w_giou*0.5*(1-giou) + w_vel*(1-exp(-d²/2))`, gated by GATE_M (keep the distance gate as a cheap pre-filter to set 1e6 for far pairs). Add module dict CLASS_WEIGHTS = {car:(0.4,0.6), truck:(0.4,0.6), bus:(0.4,0.6), pedestrian:(0.6,0.4), bicycle:(0.7,0.3), motorcycle:(0.7,0.3)} (w_giou,w_vel); default (0.5,0.5). Thread `cls` into step_per_class signature. Implemented: added CLASS_WEIGHTS module dict (weights sum to 1 so blended cost stays in [0,1) and the 1e6 gate sentinel remains separable — the existing `cost[r,c] < 1e6` match-acceptance check is unchanged); changed step_per_class signature to (detections, tracks, gate, cls); inside the cost loop the center-distance is now only a cheap pre-gate (>gate → 1e6, `continue`), and for surviving pairs cost = w_giou*0.5*(1-cost_3d_giou(t_box,d)) + w_vel*cost_mahalanobis_velocity(d,t), where t_box is built from the track's (x,y from kf, z/l/w/h/theta auxiliaries); updated the single call site to pass cls. py_compile OK and the dependency-free --selftest still passes (identical=1.0000, disjoint=-0.9985, partial=0.3333, rotated90=0.0833).
05✓ DONE (iter 6) — Step 4: Replace the CV KalmanFilter in Track.__init__ with a CTRV model. Use filterpy.kalman.UnscentedKalmanFilter (already importable) with state [x, y, theta, v, omega], dim_z=3 measuring [x,y,theta]; define fx (CTRV transition with dt=0.5s, ω→0 limit handled) and hx. Keep z, w, l, h, conf as constant auxiliaries updated on measurement. Update predict()/update()/to_dict() to read v,omega and emit vx=v*cos(theta), vy=v*sin(theta). Keep a CV fallback flag in case UKF is unstable on mini. Implemented: added module flag USE_CTRV=True (CV fallback) + DT=0.5; _ctrv_fx (closed-form turn with the |ω|<1e-6 straight-line limit), _ctrv_hx, plus circular-angle handling for the heading dim (residual_x/residual_z=_angle_residual wrapping index 2, x_mean_fn/z_mean_fn=_angle_mean via atan2) so the UKF mean/innovation don't break across the ±π wrap. _make_ctrv_ukf seeds x=[x,y,theta, v0, 0] with v0 = vx·cosθ+vy·sinθ (signed speed along heading from CenterPoint's velocity head), P=diag(10,10,1,10,1), R=diag(1,1,0.5), Q=diag(.01,.01,.01,.1,.1) (let v,ω adapt), MerweScaledSigmaPoints(α=1e-3,β=2,κ=0). Track now branches CTRV vs CV in __init__/update and exposes layout-agnostic accessors pos()/velocity()/heading() (CTRV state is a 1-D UKF vector x[i]; CV is a 2-D column vector x[i,0]); to_dict and step_per_class's t_box now read t.pos()/t.heading() and _track_pred_velocity delegates to track.velocity(). py_compile OK; dependency-free --selftest still PASS (identical=1.0000, disjoint=-0.9985, partial=0.3333, rotated90=0.0833). UKF runtime path exercises only inside Docker at Step 6 (filterpy/numpy absent on this host).
06✓ DONE (iter 7) — Step 5: Replace global MIN_HITS/MAX_AGE with per-class dict CLASS_HYPER = {bicycle:(1,3,0.05), motorcycle:(1,3,0.05), pedestrian:(3,2,0.10), car:(3,2,0.15), truck:(3,2,0.15), bus:(3,2,0.15), trailer:(3,2,0.15)} = (min_hits, max_age, score_thr). Apply score_thr as a detection pre-filter, min_hits at track-confirmation, max_age at track-deletion (replace the hardcoded `t.misses <= MAX_AGE` and `t.hits < MIN_HITS`). Implemented: added module dict CLASS_HYPER={bicycle/motorcycle:(1,3,0.05), pedestrian:(3,2,0.10), car/truck/bus/trailer:(3,2,0.15)} = (min_hits,max_age,score_thr) plus DEFAULT_HYPER=(MIN_HITS,MAX_AGE,0.0) fallback for unlisted (BEV-only static) classes. Wired all three: score_thr is a per-class detection pre-filter applied in main() (cls_dets=[d for d in cls_dets if d['conf']>=score_thr]) BEFORE step_per_class so weak detections never seed/sustain a track; min_hits replaces the global MIN_HITS at the track-emission gate (`if t.hits < min_hits: continue`); max_age replaces the global MAX_AGE at deletion inside step_per_class (looked up via CLASS_HYPER.get(cls,DEFAULT_HYPER)[1], return filter now `t.misses <= max_age`). py_compile OK; dependency-free --selftest still PASS (identical=1.0000, disjoint=-0.9985, partial=0.3333, rotated90=0.0833). This was the last coding-only step — Step 6 is the first Docker-touching run/eval step.
07✓ DONE (iter 9) — Step 5b (FIX — coding, added iter 8 after Step-6 runtime failure): The CTRV UKF crashed at runtime in `make track` with `numpy.linalg.LinAlgError: 3-th leading minor of the array is not positive definite` from filterpy's `sigma_points()` Cholesky `sqrt((λ+n)·P)` on the heading dim. Root cause: `MerweScaledSigmaPoints(n=5, alpha=1e-3, beta=2, kappa=0)` (run_tracking.py:164) gives λ+n≈5e-6, an extremely tight sigma spread (√≈0.0022); combined with the custom atan2 circular angle-mean (`_angle_mean`), the reconstructed posterior covariance loses positive-definiteness numerically. Fix: (a) widen the sigma spread to a numerically stable value — `alpha=0.3` (or 1.0), keep beta=2, kappa=0 so λ+n is O(1); (b) defensively symmetrize + jitter-regularize P in `predict()`/`update()` after each UKF step: `self.kf.P = (self.kf.P + self.kf.P.T)/2 + 1e-6*np.eye(5)` to guarantee PD before the next sigma_points() call; (c) keep USE_CTRV=True (CV fallback remains a one-flag escape hatch if it still misbehaves). Implemented all three: alpha 1e-3→0.3 in _make_ctrv_ukf (with an explanatory comment deriving λ+n); added Track._regularize_P() = (P+Pᵀ)/2 + 1e-6·I called after every UKF predict() and update() (CTRV path only, gated on USE_CTRV); USE_CTRV stays True. Verified py_compile OK and the dependency-free geometry --selftest still PASS (identical=1.0000, disjoint=-0.9985, partial=0.3333, rotated90=0.0833). The UKF PD fix itself exercises only inside Docker at Step 6 (filterpy/numpy absent on this host). Next: Step 6 re-run `make track`+`make eval RUN_ID=MEASURED-002`.
08✓ DONE (iter 11) — Step 5c (FIX — coding, added iter 10 after Step-6 re-run STILL crashed non-PD): The Step-5b fix (alpha 1e-3→0.3 + 1e-6·I jitter) did NOT resolve the crash — identical `numpy.linalg.LinAlgError: 3-th leading minor not positive definite` at filterpy sigma_points() Cholesky on the heading dim, on the first predict() of an already-updated track. Refined root cause: with alpha=0.3,n=5,kappa=0 the sigma-point SPREAD is fine (λ+n=0.45>0) but the central COVARIANCE weight W_c[0]=λ/(λ+n)+(1−α²+β)=−4.55/0.45+2.91=−7.2 is large and NEGATIVE; combined with the custom atan2 circular angle-mean (_angle_mean) this reconstructs a posterior P with genuinely negative eigenvalues (≪ −1e-6), which the tiny +1e-6·I jitter in _regularize_P cannot repair. Two-part fix: (a) set alpha=1.0 in _make_ctrv_ukf → λ=0, λ+n=5, W_c[0]=0/5+(1−1+2)=+2.0 (POSITIVE central weight, spread √5≈2.24) — removes the dominant source of indefiniteness; (b) replace _regularize_P's jitter with a true PD PROJECTION via symmetric eigen-decomposition: `w,V=np.linalg.eigh((P+P.T)/2); w=np.maximum(w,1e-3); self.kf.P=(V*w)@V.T` so P is guaranteed PD (min eigenvalue ≥1e-3) regardless of UKF round-off, and call it on the seed P too. Keep USE_CTRV=True. ESCAPE HATCH: if the Step-6 re-run after 5c STILL crashes, the very next iteration flips USE_CTRV=False (the known-stable CV model that produced the 0.471 baseline) to secure a real measurement of the GIoU+Mahalanobis+per-class-hyperparam changes rather than burning the remaining budget on the motion model. File: algorithm/pose-tracking/src/measured/run_tracking.py. Implemented: (a) alpha 0.3→1.0 in _make_ctrv_ukf (λ=0, λ+n=5, spread √5≈2.24, central weight W_c[0]=+2.0 positive — removes the dominant indefiniteness source; updated the derivation comment); (b) rewrote Track._regularize_P to a true PD projection — symmetrize, w,V=np.linalg.eigh(P), w=np.maximum(w,1e-3), P=(V*w)@V.T — guaranteeing min-eigenvalue≥1e-3 regardless of round-off (replaces the insufficient +1e-6·I jitter); (c) also PD-project the seed P by calling self._regularize_P() right after _make_ctrv_ukf in Track.__init__. USE_CTRV stays True. Verified python -m py_compile OK and the dependency-free geometry --selftest still PASS (identical=1.0000, disjoint=-0.9985, partial=0.3333, rotated90=0.0833). The UKF PD path itself only exercises inside Docker at Step 6 (filterpy/numpy absent on this host). Next: Step 6 re-run docker compose track + eval RUN_ID=MEASURED-002.
09✓ DONE (iter 12) — Step 6 (RUN/EVAL — Docker): ran `docker compose run --rm pose-tracking run_tracking.py` (track) then `run_eval.py` (eval) on the seeded MEASURED-002 detections (host has no `make`). After three failed UKF attempts, the Step-5c PD-projection fix HELD — `make track` completed with no LinAlgError and wrote tracks.json + bev-frames.json. Eval initially failed on a Git-Bash MSYS path-mangling artifact (the literal arg `/data` was rewritten to `C:/Program Files/Git/data/...` before reaching the container); re-ran with `MSYS_NO_PATHCONV=1 MSYS2_ARG_CONV_EXCL='*'` and nuscenes-devkit TrackingEval completed in 26.4s, writing metrics.json. Captured: AMOTA=0.3757, AMOTP=0.9540, IDS=79, FRAG=43. NOTE: AMOTA 0.376 is BELOW baseline 0.471 — per-class breakdown shows bicycle & motorcycle at AMOTA 0.000 with enormous FP (3506 / 2569) and FAF (4328 / 3172), i.e. the score_thr=0.05 two-wheeler floor is flooding the tracker with false positives; car/truck/bus are healthy (0.65/0.67/0.63). This is the regression signal the evaluating→planning retry must address (raise two-wheeler score_thr, e.g. ≥0.10–0.15). Real measurement secured regardless. This is the first Docker-touching step; everything above is coding-only.
10✓ DONE (iter 15) — Step 6b (RETRY FIX — coding, added iter 14 after MEASURED-002 regression): AMOTA=0.3757 came in below the 0.471 baseline, driven entirely by bicycle/motorcycle AMOTA=0.000 with massive FP (3506/2569) and FAF (4328/3172) — the Step-5 two-wheeler score_thr=0.05 floor floods the tracker with weak detections that min_hits=1 then promotes into spurious tracks. Fix in algorithm/pose-tracking/src/measured/run_tracking.py: in CLASS_HYPER raise bicycle and motorcycle score_thr 0.05→0.15 (match car/truck/bus) so weak two-wheeler detections are pre-filtered before seeding tracks, and raise their min_hits 1→2 to require a second confirming hit (keep max_age=3 for intersection occlusion gaps). Coding-only; verify py_compile + dependency-free --selftest still PASS. Then re-run Step 6 (docker compose track + eval RUN_ID=MEASURED-002) and re-measure. If two-wheeler classes recover but overall AMOTA still < 0.55, fall through to the Step-7 sweep. Implemented: CLASS_HYPER bicycle/motorcycle (1,3,0.05)→(2,3,0.15) — score_thr 0.05→0.15 and min_hits 1→2, max_age held at 3; updated the explanatory comment to record the regression and rationale. car/truck/bus/trailer/pedestrian unchanged. Verified python -m py_compile OK and the dependency-free geometry --selftest still PASS (identical=1.0000, disjoint=-0.9985, partial=0.3333, rotated90=0.0833). Next: re-run Step 6 (docker compose track + eval RUN_ID=MEASURED-002) and re-measure.
11✓ DONE (iter 19) — Step 6c (RETRY FIX — coding, added iter 18 after MEASURED-002 re-measure AMOTA=0.4228 < 0.471 baseline < 0.55 target): The Step-6b two-wheeler fix recovered motorcycle (0.000->0.283) and collapsed the FP flood (~6000->704 total FP), but overall AMOTA is still BELOW the center-distance CV baseline. Decisive signal: RECALL fell to 0.4992 vs the baseline's 0.668. AMOTA = mean over a confidence-driven recall sweep of (1-(IDS+FP+FN)/GT); the hard per-class score_thr pre-filter (car/truck/bus/pedestrian=0.15, two-wheelers=0.15 after 6b) deletes low-confidence TPs BEFORE the tracker, truncating the recall sweep at ~0.50 and capping AMOTA — exactly the high-recall region where the baseline earned its 0.471. Fix in algorithm/pose-tracking/src/measured/run_tracking.py CLASS_HYPER: drop the hard confidence floor on the healthy/high-GT classes — car/truck/bus/trailer/pedestrian score_thr 0.15->0.0 (no hard pre-filter; let detection confidence drive the AMOTA recall sweep as it did in the baseline) and DEFAULT_HYPER score_thr->0.0; keep two-wheelers at a MODEST score_thr 0.15->0.10 with min_hits=2/max_age=3 (enough to keep the FP flood suppressed but lower than 0.15 to restore recall headroom). Rely on min_hits/max_age for FP/ID-churn suppression rather than a confidence floor. Coding-only; verify py_compile + dependency-free --selftest still PASS, then re-run Step 6 (docker compose track + eval RUN_ID=MEASURED-002) and re-measure. BUDGET NOTE: only 2 iterations remain after this retry — the path planning->executing(code), executing->measuring(run+eval) consumes both, leaving the measuring->evaluating->done/failed tail to bump against budget=0; this is the single highest-leverage knob (recall headroom), so spend the remaining budget on it rather than the broader Step-7 3x3 sweep, which no longer fits. Implemented: single edit to CLASS_HYPER — car/truck/bus/trailer/pedestrian score_thr 0.15->0.0 (no hard pre-filter; confidence now drives the AMOTA recall sweep as in the baseline) and bicycle/motorcycle 0.15->0.10 (min_hits=2/max_age=3 held, so FP suppression rides on the track lifecycle not a confidence floor); DEFAULT_HYPER was already (MIN_HITS,MAX_AGE,0.0). Rewrote the CLASS_HYPER comment to record the RECALL-collapse root cause and rationale. Verified python -m py_compile OK and the dependency-free geometry --selftest still PASS (missing-deps guard reports 'No module named filterpy'; identical=1.0000, disjoint=-0.9985, partial=0.3333, rotated90=0.0833). The runtime tracking path only executes inside Docker at the next step (filterpy/numpy absent on this host). Step 6c was the only unfinished coding step; next is the Step-6 re-run (docker compose track + eval RUN_ID=MEASURED-002), the first Docker-touching step.
12Step 7: If AMOTA < 0.55 and budget remains, run a small per-class sweep: bicycle/motorcycle gate ∈ {1.5,2.0,2.5} × w_giou ∈ {0.5,0.7,0.9} (3×3), re-`make track`+`make eval` each, keep best. Record grid + best config in the postmortem.
13Step 8: `python src/measured/build_artifact.py --run-dir runs/MEASURED-002 --out runs/centerpoint-ab3dmot-nusc-mini-MEASURED-002.json`. Add explicit fields `parent_mission: M-001` and `upstream_simulated_artifact` to the artifact (extend build_artifact.py if it does not already accept them). This artifact becomes result.measured_artifact.

Execution

budget 0/20

File change matrix+12 ~10 · 7 files · 21 attempts

File	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21
algorithm/pose-tracking/Makefile	·	~	·	·	·	·	·	·	·	·	·	·	·	·	·	·	·	·	·	·	·
algorithm/pose-tracking/src/measured/run_tracking.py	·	·	~	~	~	~	~	·	~	·	~	·	·	·	~	·	·	·	~	·	·
algorithm/pose-tracking/runs/MEASURED-002/bev-frames.json	·	·	·	·	·	·	·	·	·	·	·	+	·	·	·	+	·	·	·	+	·
algorithm/pose-tracking/runs/MEASURED-002/metrics.json	·	·	·	·	·	·	·	·	·	·	·	+	·	·	·	+	·	·	·	+	·
algorithm/pose-tracking/runs/MEASURED-002/tracks.json	·	·	·	·	·	·	·	·	·	·	·	+	·	·	·	+	·	·	·	+	·
algorithm/pose-tracking/runs/centerpoint-ab3dmot-nusc-mini-MEASURED-002.json	·	·	·	·	·	·	·	·	·	·	·	·	+	·	·	·	+	·	·	·	·
lab/missions/M-001-postmortem.md	·	·	·	·	·	·	·	·	·	·	·	·	·	·	·	·	·	·	·	·	+

+ created ~ modified

iter 21 · measuring → failed · budget-exhausted-finalize-failedjust now
Budget exhausted (iteration_budget_remaining=0, all 20/20 iterations consumed) while status was still 'measuring', so per the Step-1 budget rule the mission finalizes as failed. Read the real metrics.json (nuscenes-devkit TrackingEval, iter-20 re-run) to fix the authoritative achieved value: AMOTA=0.4027750112338069. Compared vs goal target_op=ge target_value=0.55 — NOT met, and below the 0.471 baseline too. Across three distinct real measurements the tracker never beat baseline (0.3757 -> 0.4228 -> 0.4028); the recall ceiling (~0.50 vs baseline 0.668) was traced to the new GIoU+Mahalanobis association cost and/or the CTRV motion model dropping true positives, NOT the score_thr pre-filter (the iter-19 Step-6c hypothesis was disproven — dropping the floor moved recall only 0.4992->0.5045 and AMOTA regressed). car/truck/bus stayed healthy (~0.66/0.66/0.63); two-wheelers remained weak on the tiny mini_val support. Populated result with target_met=false, the real achieved value, the measured artifact id, the postmortem path, and honest caveats; wrote lab/missions/M-001-postmortem.md. The planned Step-7 sweep and Step-8 artifact-field cleanup were never reached (budget spent on 3 UKF PD-remediation rounds + 3 measure/evaluate cycles). Terminal write; no further transitions.
Created: lab/missions/M-001-postmortem.md
iter 20 · executing → measuring · step-6-rerun-track-eval-step6cjust now
Re-ran Step 6 (the run/eval step) end-to-end with the iter-19 Step-6c recall-headroom fix applied (CLASS_HYPER: car/truck/bus/trailer/pedestrian score_thr 0.15->0.0, bicycle/motorcycle 0.15->0.10 with min_hits=2/max_age=3). Verified the Step-6c edit IS live in source (run_tracking.py:81 CLASS_HYPER shows the 0.0 / 0.10 floors), confirmed Docker is real on this host (Docker 28.2.2 / Compose v2.37.1), and that metrics.json still carried the stale iter-16 0.4228 numbers. Ran `docker compose run --rm pose-tracking python .../run_tracking.py` then `run_eval.py` directly on the seeded MEASURED-002 detections.pkl (host has no `make`; used MSYS_NO_PATHCONV=1 MSYS2_ARG_CONV_EXCL='*' to stop Git-Bash mangling the `/data` container arg). The CTRV UKF again ran cleanly (Step-5c PD projection still holding — no LinAlgError) and track wrote tracks.json + bev-frames.json; nuscenes-devkit TrackingEval completed in 29.6s and overwrote metrics.json. REAL MEASURED RESULT: AMOTA=0.4028, AMOTP=0.9970, RECALL=0.5045, MOTA=0.4040, IDS=70, FRAG=44, TP=2179, FP=429, FN=1420. HONEST FINDING: the Step-6c recall hypothesis was WRONG — dropping the confidence floor barely moved recall (0.4992->0.5045) and AMOTA actually REGRESSED 0.4228->0.4028. Per-class: car 0.656, truck 0.660, bus 0.627 stay healthy; pedestrian unchanged at 0.308 (recall 0.479 didn't move despite score_thr->0.0); motorcycle fell 0.283->0.141, bicycle rose 0.000->0.025. The recall ceiling is therefore NOT the score_thr pre-filter but lives in the GIoU+Mahalanobis association cost and/or the CTRV motion model losing true positives relative to the center-distance CV baseline. This is the 3rd distinct measurement (0.3757 -> 0.4228 -> 0.4028); none of the three beat the 0.471 baseline, and all are far below the 0.55 target. This eval was the final metric-producing run, so per the run/eval rule status moves executing -> measuring. BUDGET: this consumes the last iteration (1 -> 0); the next turn will hit the budget-exhausted terminal rule and must finalize the mission as `failed` (achieved 0.4028 < target 0.55) with a failure postmortem and a populated result block.
Created: algorithm/pose-tracking/runs/MEASURED-002/tracks.json, algorithm/pose-tracking/runs/MEASURED-002/bev-frames.json, algorithm/pose-tracking/runs/MEASURED-002/metrics.json
iter 19 · planning → executing · step-6c-recall-headroom-fixjust now
Executed Step 6c (coding only — no Docker), the recall-headroom remediation for the MEASURED-002 re-measure (AMOTA=0.4228 < 0.471 baseline < 0.55 target). Root cause from iter-16/iter-18: the Step-6b two-wheeler fix recovered motorcycle (0.000->0.283) and collapsed the FP flood (~6000->704), but RECALL fell to 0.4992 vs the baseline's 0.668 — the hard per-class score_thr=0.15 pre-filter deletes low-confidence TPs BEFORE the tracker, truncating the confidence-driven AMOTA recall sweep at ~0.50, exactly the high-recall region where the baseline earned its 0.471. Single edit to src/measured/run_tracking.py CLASS_HYPER: dropped the hard confidence floor on the healthy/high-GT classes — car/truck/bus/trailer/pedestrian score_thr 0.15->0.0 (let detection confidence drive the recall sweep as in the baseline) — and relaxed the two-wheelers 0.15->0.10 while holding min_hits=2/max_age=3, so FP/ID-churn suppression now rides on the track lifecycle rather than a confidence floor. DEFAULT_HYPER was already (MIN_HITS,MAX_AGE,0.0). Rewrote the CLASS_HYPER comment to record the RECALL-collapse signal and rationale. Verified: python -m py_compile OK and the dependency-free geometry --selftest still PASS (identical=1.0000, disjoint=-0.9985, partial=0.3333, rotated90=0.0833; the missing-deps guard confirms filterpy/numpy are absent on this host, so the runtime tracking path only exercises inside Docker). Step 6c was the only unfinished coding step; the next step is the Step-6 re-run (docker compose track + eval RUN_ID=MEASURED-002), the first Docker-touching run/eval, so status moves planning -> executing. BUDGET: 1 iteration remains after this turn — the next iteration must run track+eval AND, given budget will then hit 0, the measuring/evaluating tail will bump against the budget-exhausted failure rule unless target is met; this recall-headroom knob is the highest-leverage single move available. The seeded runs/MEASURED-002/ detections.pkl + upstream markers remain valid and are reused as-is.
Modified: algorithm/pose-tracking/src/measured/run_tracking.py
iter 18 · evaluating → planning · evaluating-retryjust now
Compared the real measured AMOTA=0.4228 (read from runs/centerpoint-ab3dmot-nusc-mini-MEASURED-002.json; target_op=ge vs target_value=0.55) — target NOT met, and still below the 0.471 baseline. Budget remains (3) and this is only the 2nd distinct measurement (0.3757 -> 0.4228, not stalled 3x at one value), so per the state machine this is an evaluating->planning retry rather than a failure. Root-caused the residual gap from the iter-16 metrics: the Step-6b two-wheeler fix worked directionally (AMOTA 0.3757->0.4228, motorcycle 0.000->0.283, total FP ~6000->704), but RECALL collapsed to 0.4992 (baseline 0.668). Since AMOTA averages accuracy over a confidence-driven recall sweep, the hard per-class score_thr=0.15 pre-filter removes low-confidence TPs and caps the achievable recall at ~0.50, truncating the AMOTA integral exactly where the baseline earned its score. Adjustment Z (queued as code_plan Step 6c): drop the hard confidence floor on the healthy/high-GT classes (car/truck/bus/trailer/pedestrian + DEFAULT_HYPER score_thr 0.15->0.0) so confidence drives the recall sweep as in the baseline, and lower two-wheelers 0.15->0.10 (keep min_hits=2/max_age=3) to restore recall headroom while holding the FP flood down. With only 2 iterations left, this single recall-headroom knob is the highest-leverage move and replaces the now-unaffordable Step-7 3x3 sweep. Reverted status evaluating->planning so the next iteration revises and executes the fix; the seeded runs/MEASURED-002/ detections.pkl + markers remain valid and are reused as-is. No source files edited this turn (coding happens next, in planning->executing).
iter 17 · measuring → evaluating · rebuild-measured-002-artifactjust now
Read the iter-16 re-run metrics from runs/MEASURED-002/metrics.json (the real nuscenes-devkit TrackingEval output after the Step-6b two-wheeler fix) and REBUILT the dashboard artifact, since the iter-13 artifact still carried the stale pre-fix 0.3757 numbers (metrics.json was overwritten in place by the iter-16 re-run). Ran `python src/measured/build_artifact.py --run-dir runs/MEASURED-002 --out runs/centerpoint-ab3dmot-nusc-mini-MEASURED-002.json --run-id centerpoint-ab3dmot-nusc-mini-MEASURED-002 --upstream-simulated centerpoint-ab3dmot-nusc-mini-001` on this host (build_artifact.py imports no torch/numpy at module scope; the git/hardware reproducibility fields degrade to None off-container, which is cosmetic). The artifact now carries the current measured metrics: AMOTA=0.4228, AMOTP=1.0854, MOTA=0.4054, RECALL=0.4992, IDS=69, FRAG=40, plus the full per-class breakdown (bicycle 0.000, motorcycle 0.283, pedestrian 0.308, car 0.651, truck 0.669, bus 0.627, trailer null). Known cosmetic debt deferred to Step 8: the artifact's hardcoded notes/tracker block still describe the AB3DMOT center-distance baseline rather than this mission's GIoU+Mahalanobis+CTRV+per-class changes, and the parent_mission:M-001 field is not yet added — both are refined in Step 8 only once a config that meets target is found. The artifact exists on disk and is the candidate result.measured_artifact. Per the one-transition rule, the target comparison (achieved 0.4228 vs target ≥0.55) happens next in evaluating → planning/done/failed. Status moves measuring → evaluating.
Created: algorithm/pose-tracking/runs/centerpoint-ab3dmot-nusc-mini-MEASURED-002.json
iter 16 · executing → measuring · step-6-rerun-track-eval-successjust now
Re-ran Step 6 (the run/eval step) end-to-end with the iter-15 Step-6b fix applied (CLASS_HYPER bicycle/motorcycle score_thr 0.05->0.15, min_hits 1->2, max_age held 3), and PRODUCED REAL METRICS. Invoked `docker compose run --rm pose-tracking run_tracking.py` then `run_eval.py` directly on the seeded MEASURED-002 detections.pkl (host has no `make`; used MSYS_NO_PATHCONV=1 MSYS2_ARG_CONV_EXCL='*' to stop Git-Bash mangling the `/data` container arg). The CTRV UKF again ran cleanly (Step-5c PD projection holding) and track wrote tracks.json + bev-frames.json; nuscenes-devkit TrackingEval completed in 22.7s and overwrote metrics.json. Measured: AMOTA=0.4228, AMOTP=1.0854, RECALL=0.4992, MOTA=0.4054, IDS=69, FRAG=40 (TP=2111, FP=704, FN=1489). The Step-6b two-wheeler fix WORKED directionally: AMOTA rose 0.3757->0.4228, motorcycle recovered 0.000->0.283, and the FP flood collapsed (~6000 -> 704 total FP) — bicycle is still AMOTA=0.000 (only 41 GT, RECALL=0.220, FP=396) and pedestrian sits at 0.308, while car/truck/bus stay healthy (0.651/0.669/0.627). Net is still BELOW the 0.471 baseline and the 0.55 target. This eval was the final metric-producing run, so per the run/eval rule status moves executing -> measuring. Next: read metrics.json and rebuild the MEASURED-002 artifact, then the evaluating-state target comparison.
Created: algorithm/pose-tracking/runs/MEASURED-002/tracks.json, algorithm/pose-tracking/runs/MEASURED-002/bev-frames.json, algorithm/pose-tracking/runs/MEASURED-002/metrics.json
iter 15 · planning → executing · step-6b-two-wheeler-score-thr-fixjust now
Executed Step 6b (coding only — no Docker), the remediation for the MEASURED-002 regression (AMOTA=0.3757 < 0.471 baseline). Root cause from iter-12's per-class breakdown: bicycle AMOTA=0.000 (FP=3506, FAF=4328) and motorcycle AMOTA=0.000 (FP=2569, FAF=3172) flooded the tracker with false positives, while car/truck/bus were healthy (0.651/0.669/0.627). The Step-5 two-wheeler floor score_thr=0.05 + min_hits=1 admitted far too many weak detections that the lifecycle then promoted into spurious tracks. Single targeted edit to src/measured/run_tracking.py CLASS_HYPER: bicycle (1,3,0.05)→(2,3,0.15) and motorcycle (1,3,0.05)→(2,3,0.15) — i.e. score_thr 0.05→0.15 (match car/truck/bus) to pre-filter weak detections before they seed tracks, and min_hits 1→2 to require a second confirming hit; max_age held at 3 to preserve survival across intersection occlusion gaps. car/truck/bus/trailer/pedestrian and the DEFAULT_HYPER fallback are unchanged. Rewrote the CLASS_HYPER comment to record the regression and rationale. Verified: python -m py_compile OK and the dependency-free geometry --selftest still PASS (identical=1.0000, disjoint=-0.9985, partial=0.3333, rotated90=0.0833) — the runtime tracking path only executes inside the Docker container at the next step since filterpy/numpy are absent on this host (selftest confirmed 'missing runtime deps: filterpy', geometry path needs none). Step 6b was the only unfinished coding step; the next step is the Step-6 re-run (docker compose track + eval RUN_ID=MEASURED-002), the first Docker-touching step, so status moves planning → executing. The seeded runs/MEASURED-002/ detections.pkl + upstream markers remain valid and are reused as-is on the re-run.
Modified: algorithm/pose-tracking/src/measured/run_tracking.py
iter 14 · evaluating → planning · evaluating-retryjust now
Compared the real measured AMOTA=0.3757 (read from runs/centerpoint-ab3dmot-nusc-mini-MEASURED-002.json, target_op=ge vs target_value=0.55) — target NOT met, and in fact BELOW the 0.471 baseline. Budget remains (7) and this is the first/only measurement (not stalled 3× at one value), so per the state machine this is an evaluating→planning retry rather than a failure. Root cause is unambiguous in the per-class breakdown captured at iter-12: bicycle AMOTA=0.000 (FP=3506, FAF=4328) and motorcycle AMOTA=0.000 (FP=2569, FAF=3172) are flooding the tracker with false positives, while car/truck/bus are healthy (0.651/0.669/0.627). The cause is the Step-5 two-wheeler score_thr=0.05 floor, which admits far too many weak detections that the min_hits=1 / max_age=3 lifecycle then promotes into spurious tracks. Adjustment Z for the next planning→executing turn (queued as code_plan Step 6b): raise bicycle/motorcycle score_thr 0.05→0.15 (match car/truck) so weak two-wheeler detections are pre-filtered before they can seed tracks, and revert their min_hits 1→2 to require a second confirming hit (keep max_age=3 for the occlusion gaps). This is a single, well-motivated knob change targeted at the two classes that zeroed out; the broader Step-7 3×3 gate×w_giou sweep stays available if 6b alone does not clear 0.55. Reverted status evaluating→planning so the next iteration revises and executes the fix. The seeded runs/MEASURED-002/ detections.pkl + markers remain valid and are reused as-is. No source files edited this turn (coding happens next, in planning→executing).
iter 13 · measuring → evaluating · build-measured-002-artifactjust now
Read runs/MEASURED-002/metrics.json (the real nuscenes-devkit TrackingEval output from iter-12) and assembled the dashboard artifact via build_artifact.py — ran `python src/measured/build_artifact.py --run-dir runs/MEASURED-002 --out runs/centerpoint-ab3dmot-nusc-mini-MEASURED-002.json --run-id centerpoint-ab3dmot-nusc-mini-MEASURED-002 --upstream-simulated centerpoint-ab3dmot-nusc-mini-001`. Wrote centerpoint-ab3dmot-nusc-mini-MEASURED-002.json carrying the measured metrics (AMOTA=0.3757, AMOTP=0.9540, IDS=79, FRAG=43, MOTA=0.3667, RECALL=0.6683) plus the full per-class breakdown and reproducibility block. Note: build_artifact.py ran on this host directly (it imports no torch/numpy at module scope — hardware/git fields degrade to None off-container, which is cosmetic). The artifact's hardcoded notes/tracker block still describe the AB3DMOT center-distance baseline rather than this mission's GIoU+Mahalanobis+CTRV+per-class changes; refining those strings (and adding the parent_mission field) is deferred to Step 8 after a config that actually meets target is found. Artifact now exists on disk and becomes the candidate result.measured_artifact. Per the one-transition rule, the target comparison itself (achieved 0.3757 vs target ≥0.55) happens next in evaluating → planning/done/failed. Status moves measuring → evaluating.
Created: algorithm/pose-tracking/runs/centerpoint-ab3dmot-nusc-mini-MEASURED-002.json
iter 12 · executing → measuring · step-6-track-eval-successjust now
Executed Step 6 (the first Docker-touching run/eval) end-to-end and PRODUCED REAL METRICS. Ran `docker compose run --rm pose-tracking run_tracking.py` then `run_eval.py` on the seeded MEASURED-002 detections.pkl (host has no `make`, so invoked docker compose directly). Two outcomes worth recording: (1) the CTRV UKF that crashed non-PD in iters 8 & 10 finally ran cleanly — the Step-5c PD-projection fix (alpha=1.0 → positive central weight W_c[0]=+2.0, plus eigh-based eigenvalue-floor projection of P with min-eig≥1e-3 on seed + after every predict/update) HELD: track completed with no LinAlgError and wrote tracks.json + bev-frames.json. (2) eval first failed on a host-side Git-Bash MSYS path-conversion artifact — the literal container arg `/data` was rewritten to `C:/Program Files/Git/data/v1.0-mini/v1.0-mini` before reaching docker, raising 'Database version not found'; this was NOT a code bug. Re-ran the identical command with `MSYS_NO_PATHCONV=1 MSYS2_ARG_CONV_EXCL='*'` and nuscenes-devkit TrackingEval completed in 26.4s, writing metrics.json. Measured: AMOTA=0.3757, AMOTP=0.9540, IDS=79, FRAG=43, RECALL=0.668. This is BELOW the 0.471 baseline (a regression), driven entirely by the two-wheeler classes: bicycle AMOTA=0.000 (FP=3506, FAF=4328) and motorcycle AMOTA=0.000 (FP=2569, FAF=3172) — the per-class score_thr=0.05 floor set in Step 5 is flooding those tracks with false positives, while car (0.651), truck (0.669), bus (0.627) are healthy. The evaluating-state comparison and any planning→retry (raise two-wheeler score_thr toward 0.10–0.15, possibly tighten their gate) come in later iterations per the one-transition rule. The eval was the final metric-producing run, so status moves executing → measuring. Next: read metrics.json and build the MEASURED-002 artifact.
Created: algorithm/pose-tracking/runs/MEASURED-002/tracks.json, algorithm/pose-tracking/runs/MEASURED-002/bev-frames.json, algorithm/pose-tracking/runs/MEASURED-002/metrics.json
iter 11 · planning → executing · step-5c-ukf-pd-projection-fixjust now
Executed Step 5c (coding only — no Docker), the second remediation for the CTRV UKF non-PD crash that iter-8 (alpha 1e-3) and iter-10 (alpha 0.3 + 1e-6·I jitter) both failed to fix. Refined diagnosis confirmed: alpha=0.3 left the central covariance weight W_c[0]=λ/(λ+n)+(1−α²+β)≈−7.2 large and negative, so the atan2 circular angle-mean reconstructed a posterior P with eigenvalues ≪−1e-6 that the tiny jitter could not repair. Three source edits to src/measured/run_tracking.py: (a) MerweScaledSigmaPoints alpha 0.3→1.0 in _make_ctrv_ukf so λ=0, λ+n=5 (spread √5≈2.24) and W_c[0]=+2.0 (positive central weight), removing the dominant source of indefiniteness; updated the inline derivation comment. (b) Rewrote Track._regularize_P from the insufficient symmetrize+1e-6·I jitter to a true PD projection via symmetric eigendecomposition — P=(P+Pᵀ)/2; w,V=np.linalg.eigh(P); w=np.maximum(w,1e-3); self.kf.P=(V*w)@V.T — guaranteeing min-eigenvalue≥1e-3 before every sigma_points() Cholesky regardless of UKF round-off. (c) Applied the projection to the seed covariance too by calling self._regularize_P() immediately after _make_ctrv_ukf in Track.__init__. Kept USE_CTRV=True; the escape hatch remains — if the Step-6 re-run STILL crashes non-PD, the next iteration flips USE_CTRV=False to the known-stable CV model to secure a real measurement. Verified python -m py_compile OK and the dependency-free geometry --selftest still PASS (identical=1.0000, disjoint=-0.9985, partial=0.3333, rotated90=0.0833) — the UKF PD path only exercises inside the Docker container at Step 6 since filterpy/numpy are absent on this host. Step 5c was the only unfinished coding step; the next step (Step 6) is the first Docker-touching run/eval (docker compose run track + eval RUN_ID=MEASURED-002), so status moves planning → executing.
Modified: algorithm/pose-tracking/src/measured/run_tracking.py
iter 10 · executing → planning · execute-failed-step-6-track-ukf-still-not-pd21m ago
Re-ran Step 6's track half (`docker compose run --rm pose-tracking python .../run_tracking.py` on the seeded MEASURED-002 detections.pkl, since the host has no `make`). The Step-5b UKF PD fix did NOT hold: identical crash `numpy.linalg.LinAlgError: 3-th leading minor of the array is not positive definite`, raised at filterpy UKF.predict → compute_process_sigmas → MerweScaledSigmaPoints.sigma_points → scipy cholesky of (λ+n)·P on the heading (theta) dimension, on the first predict() of an already-updated track. Verified the 5b edits ARE in the source (alpha=0.3 at run_tracking.py:169, _regularize_P with +1e-6·I at :410-416 wired after predict/update at :421/:427), so the fix was applied but is insufficient. Refined diagnosis: alpha=0.3 fixes the sigma-point SPREAD (λ+n=0.45>0) but leaves the central COVARIANCE weight W_c[0]=λ/(λ+n)+(1−α²+β)=−4.55/0.45+2.91≈−7.2 large and NEGATIVE; with the custom atan2 circular angle-mean this reconstructs a posterior P whose eigenvalues go well below −1e-6, so the tiny jitter cannot restore positive-definiteness before the next sigma_points() Cholesky. No metrics produced. Per the run/eval error rule, reverted status executing→planning and queued a concrete remediation as code_plan 'Step 5c (FIX)': (a) alpha 0.3→1.0 so W_c[0]=+2.0 (positive central weight, λ+n=5), (b) replace the jitter in _regularize_P with a true PD projection via eigh + eigenvalue floor 1e-3 (also applied to the seed P), keeping USE_CTRV=True with an explicit escape hatch — if the next re-run still crashes, flip USE_CTRV=False to the known-stable CV model to secure a real measurement. The seeded detections.pkl + upstream markers in runs/MEASURED-002/ remain valid and are reused as-is. No source files edited this turn (coding happens next, in planning→executing).
iter 9 · planning → executing · step-5b-ukf-pd-fix46m ago
Executed Step 5b (coding only — no Docker), the remediation for iter-8's UKF positive-definiteness crash. Two source edits to src/measured/run_tracking.py: (1) widened the sigma-point spread in _make_ctrv_ukf — MerweScaledSigmaPoints alpha 1e-3 → 0.3 (keeping beta=2, kappa=0) so λ+n = alpha²·(n+kappa) goes from ≈5e-6 (spread √≈0.0022, the value that drove the heading-dim Cholesky non-PD when combined with the atan2 circular angle-mean) to O(1) (spread √≈0.67), with an inline comment deriving the math; (2) added Track._regularize_P() that symmetrizes and jitter-regularizes the covariance — self.kf.P = (P+Pᵀ)/2 + 1e-6·I — and wired it to run after every UKF predict() and update() on the CTRV path (gated on USE_CTRV so the CV fallback is untouched), guaranteeing P is PD before the next sigma_points() Cholesky. Kept USE_CTRV=True (the CV fallback remains a one-flag escape hatch). Verified: python -m py_compile OK, and the dependency-free geometry --selftest still PASS (identical=1.0000, disjoint=-0.9985, partial=0.3333, rotated90=0.0833). The PD fix itself only exercises inside the Docker container at Step 6 since filterpy/numpy are absent on this host. Step 5b was the only unfinished coding step; the next step (Step 6) is the first Docker-touching run/eval (`make track`+`make eval RUN_ID=MEASURED-002`), so status moves planning → executing and the next iteration will run it.
Modified: algorithm/pose-tracking/src/measured/run_tracking.py
iter 8 · executing → planning · execute-failed-step-6-track-ukf-not-pd1h ago
Executed Step 6's first half (`make track RUN_ID=MEASURED-002`, run directly as `docker compose run --rm pose-tracking python .../run_tracking.py` since the host has no `make`; the MEASURED-002 run dir was already seeded with the reused baseline detections.pkl + upstream markers). The CTRV UKF crashed at runtime: `numpy.linalg.LinAlgError: 3-th leading minor of the array is not positive definite`, raised inside filterpy's UKF.predict -> compute_process_sigmas -> MerweScaledSigmaPoints.sigma_points -> scipy cholesky of (lambda+n)*P, failing on the heading (theta) dimension. Diagnosed root cause: MerweScaledSigmaPoints(n=5, alpha=1e-3, beta=2, kappa=0) at run_tracking.py:164 yields lambda+n~=5e-6 -- an extremely tight sigma spread (sqrt~=0.0022) that, combined with the custom atan2 circular angle-mean (_angle_mean) reconstructing the posterior covariance, drives P numerically non-positive-definite, so the next predict's Cholesky fails. No metrics were produced. Per the run/eval error rule I reverted status executing->planning so the next iteration can apply the fix. Queued a concrete remediation step into approach.code_plan as 'Step 5b (FIX)': widen alpha to ~0.3 (lambda+n becomes O(1)), defensively symmetrize+jitter-regularize P (P=(P+P^T)/2 + 1e-6*I) after each UKF predict/update so PD is guaranteed before the next sigma_points() call, and keep USE_CTRV=True with the CV fallback as an escape hatch. The seeded detections.pkl/markers in runs/MEASURED-002/ remain valid and are reused as-is on the next track attempt. No source files were edited this turn (coding happens next, in planning->executing).
iter 7 · executing → executing · step-5-per-class-hyperparameters2h ago
Executed Step 5 (coding only — no Docker), the final coding step. Replaced the two global track-lifecycle constants (MIN_HITS=3, MAX_AGE=2) with a per-class CLASS_HYPER dict mapping each class to (min_hits, max_age, score_thr): bicycle/motorcycle=(1,3,0.05) so a short, frequently-occluded two-wheeler track confirms on a single hit and survives the 1.5s gaps it hits at intersections; pedestrian=(3,2,0.10) and car/truck/bus/trailer=(3,2,0.15) keep the conservative AB3DMOT 3/2 confirmation/deletion with a higher confidence floor to suppress false-positive ID churn. Added DEFAULT_HYPER=(MIN_HITS,MAX_AGE,0.0) so the BEV-only static classes (barrier/traffic_cone) and construction_vehicle fall back to the old behaviour. Wired all three knobs: (a) score_thr is a per-class detection pre-filter in main() — cls_dets is filtered to conf>=score_thr BEFORE step_per_class, so low-score detections never seed or sustain a track; (b) min_hits replaces the hardcoded MIN_HITS at the track-emission gate; (c) max_age replaces the hardcoded MAX_AGE at deletion inside step_per_class (looked up once per call via CLASS_HYPER.get(cls,DEFAULT_HYPER)[1]; the return filter is now t.misses<=max_age). Verified: python -m py_compile OK and the dependency-free --selftest still PASS (identical=1.0000, disjoint=-0.9985, partial=0.3333, rotated90=0.0833) — runtime/UKF paths only execute inside Docker since filterpy/numpy are absent on this host. All coding steps (0-5) are now done; the next step is Step 6, the first Docker-touching run/eval (`make track RUN_ID=MEASURED-002` then `make eval RUN_ID=MEASURED-002`), so status stays `executing` and the next iteration will execute it and transition to `measuring`.
Modified: algorithm/pose-tracking/src/measured/run_tracking.py
iter 6 · executing → executing · step-4-ctrv-ukf2h ago
Executed Step 4 (coding only — no Docker). Swapped the 4-state constant-velocity KalmanFilter for a 5-state CTRV model driven by filterpy's UnscentedKalmanFilter: state [x,y,theta,v,omega], measuring [x,y,theta]. Added module flag USE_CTRV=True (a CV fallback — flip to False to restore the old filter if the UKF is unstable on mini) and DT=0.5s. _ctrv_fx uses the closed-form constant-turn solution with a |omega|<1e-6 straight-line limit to dodge the 1/omega singularity; _ctrv_hx observes [x,y,theta]. Crucially handled the heading wrap-around: residual_x/residual_z=_angle_residual normalize the theta component to (-pi,pi], and x_mean_fn/z_mean_fn=_angle_mean compute the sigma-point heading mean via atan2(Σw·sinθ, Σw·cosθ) so the UKF unscented transform and innovation don't blow up across ±pi. _make_ctrv_ukf seeds the state from the detection pose with v0 = vx·cosθ + vy·sinθ (signed speed projected onto heading, from CenterPoint's velocity head) and omega0=0; P=diag(10,10,1,10,1), R=diag(1,1,0.5), Q=diag(.01,.01,.01,.1,.1) so speed and yaw-rate can adapt. Refactored Track to be motion-model-agnostic: __init__ and update() branch CTRV vs CV, and three new accessors pos()/velocity()/heading() hide that the CTRV state is a 1-D UKF vector (x[i]) while the CV state is a 2-D column vector (x[i,0]). to_dict() and step_per_class's t_box now read t.pos()/t.heading()/t.velocity(), and _track_pred_velocity delegates to track.velocity() (it previously assumed 2-D indexing, which the UKF's 1-D state would have broken). Verified: python -m py_compile OK, and the dependency-free --selftest still PASS (identical=1.0000, disjoint=-0.9985, partial=0.3333, rotated90=0.0833) — the UKF runtime path only executes inside the Docker container at Step 6 since filterpy/numpy are absent on this host. Step 5 (per-class hyperparameters) is still coding-only and Step 6 is the first Docker step, so status stays `executing`. Next: Step 5 (per-class CLASS_HYPER dict for min_hits/max_age/score_thr).
Modified: algorithm/pose-tracking/src/measured/run_tracking.py
iter 5 · executing → executing · step-3-class-conditioned-cost-blend2h ago
Executed Step 3 (coding only — no Docker). Replaced step_per_class's center-distance association cost with the class-conditioned blend w_giou*0.5*(1-giou) + w_vel*(1-exp(-d²/2)), wiring together the Step 1 GIoU and Step 2 velocity-Mahalanobis terms. Added module dict CLASS_WEIGHTS={car/truck/bus:(0.4,0.6), pedestrian:(0.6,0.4), bicycle/motorcycle:(0.7,0.3)} with default (0.5,0.5); weights sum to 1 so the blended cost stays in [0,1) and the 1e6 distance-gate sentinel remains separable from real costs, leaving the `cost[r,c] < 1e6` match-acceptance check correct without modification. Threaded `cls` through the step_per_class signature (detections, tracks, gate, cls) and updated the single call site to pass it. The center-distance is now only a cheap pre-gate: pairs with dist>gate get 1e6 and `continue` (skipping the GIoU/Mahalanobis math entirely); surviving pairs build a track box dict {x,y from kf state, z/l/w/h/theta from the track auxiliaries} and score via cost_3d_giou(t_box, d) + cost_mahalanobis_velocity(d, t). Verified: python -m py_compile OK, and the dependency-free --selftest still passes (identical=1.0000, disjoint=-0.9985, partial=0.3333, rotated90=0.0833) — filterpy/numpy are absent in this host env, which is fine since the runtime tracking path only executes inside the Docker container at Step 6. Steps 4 (CTRV UKF) and 5 (per-class hyperparameters) remain coding-only and Step 6 is the first Docker step, so status stays `executing`. Next: Step 4 (swap CV KalmanFilter → CTRV UnscentedKalmanFilter).
Modified: algorithm/pose-tracking/src/measured/run_tracking.py
iter 4 · executing → executing · step-2-mahalanobis-velocity3h ago
Executed Step 2 (coding only — no Docker). Added the velocity-aware Mahalanobis association cost to src/measured/run_tracking.py. New module constant VEL_COV=[[1,0],[0,1]] (m²/s², isotropic, tunable in the Step 7 sweep). cost_mahalanobis_velocity(det, track, cov=None) forms the residual r=[det.vx-track_vx, det.vy-track_vy], computes d²=rᵀΣ⁻¹r via numpy's linalg.inv, and squashes to a bounded cost 1-exp(-d²/2) ∈ [0,1) so it adds cleanly to the 0.5*(1-giou) GIoU term in step_per_class (Step 3). Added a _track_pred_velocity(track) helper that reads the track's predicted velocity directly from the filter state and is agnostic to motion model: CV (dim_x==4, [x,y,vx,vy]) returns x[2],x[3]; CTRV (dim_x==5, [x,y,theta,v,omega]) returns v·cosθ,v·sinθ — so the upcoming Step 4 CTRV swap needs no edit here. Kept numpy-based (runtime path already imports numpy); the --selftest geometry path is untouched and still dependency-free. Verified with python -m py_compile: COMPILE OK. Steps 3-5 remain coding-only and Step 6 is the first Docker step, so status stays `executing`. Next: Step 3 (blend GIoU+velocity cost in step_per_class with per-class weights).
Modified: algorithm/pose-tracking/src/measured/run_tracking.py
iter 3 · executing → executing · step-1-3dgiou3h ago
Executed Step 1 (coding only — no Docker). Added the 3D generalized-IoU cost and supporting geometry helpers to src/measured/run_tracking.py: cost_3d_giou(box_a, box_b) returns giou ∈ [-1,1] for two oriented boxes (x,y,z,l,w,h,theta); the additive cost used later by step_per_class is 0.5*(1-giou). BEV intersection is computed exactly via Sutherland–Hodgman clipping of one rotated rectangle by the other (_bev_corners, _signed_area, _poly_area, _ensure_ccw, _clip_polygon) multiplied by the 1-D z-overlap; the GIoU penalty uses the axis-aligned enclosing box (standard rotated-GIoU approximation). Deliberately kept it self-contained (math only) rather than adding shapely, so the container image is unchanged and no dependency-approval gate is triggered. Added module-level `import math` and a _selftest() reachable via `python run_tracking.py --selftest`, and widened the missing-deps import guard so --selftest runs even without numpy/filterpy/scipy (the geometry path needs none of them). Verified locally: selftest PASS — identical=1.0000, disjoint=-0.9985, partial=0.3333, rotated90=0.0833, with monotonic ordering disjoint<partial<identical. More coding steps remain (Steps 2-5 are still coding-only, Step 6 is the first Docker step), so status stays `executing`. Next: Step 2 (cost_mahalanobis_velocity).
Modified: algorithm/pose-tracking/src/measured/run_tracking.py
iter 2 · planning → executing · step-0-makefile-infra4h ago
Executed Step 0 (infra, coding/filesystem only — no Docker). Made the Makefile's run id overridable: `RUN_ID := MEASURED-001` → `RUN_ID ?= MEASURED-001`, and parametrized the previously-hardcoded `artifact` target output path to `centerpoint-ab3dmot-nusc-mini-$(RUN_ID).json` so `make track/eval/artifact RUN_ID=MEASURED-002` writes to its own run dir without clobbering the MEASURED-001 baseline. Then seeded the new run dir for this tracker-only mission: created runs/MEASURED-002/, copied runs/MEASURED-001/detections.pkl into it (2.1 MB CenterPoint outputs reused verbatim — no detector retrain), and touched the upstream markers in dependency order (.marker.data, .marker.weights, .marker.detect) so a later `make track RUN_ID=MEASURED-002` resolves the detect prerequisite as up-to-date and skips the ~5 min inference + ~10 min data download. Note: the baseline MEASURED-001 dir had no marker files on disk, so I created fresh empty markers rather than copying. No track/eval markers were created, so the tracking + eval steps will still run once the code is changed. Next: Step 1 (cost_3d_giou in run_tracking.py).
Modified: algorithm/pose-tracking/Makefile
iter 1 · queued → planning · elaborate-code-plan4h ago
Read run_tracking.py and the Makefile to ground the 8 abstract steps into concrete file changes. Key decisions: (a) the change is tracker-side-only, so I reuse runs/MEASURED-001/detections.pkl rather than rerunning the ~5 min CenterPoint inference, and parametrize the Makefile's hardcoded RUN_ID (RUN_ID ?= MEASURED-001) so MEASURED-002 writes to its own run dir without clobbering the baseline (added as Step 0). (b) The current Track uses a 4-state CV filterpy.KalmanFilter and step_per_class uses a scalar center-distance gate with global MIN_HITS=3/MAX_AGE=2 — so the CTRV swap (Step 4) moves to an UnscentedKalmanFilter with [x,y,theta,v,omega], the cost (Steps 1-3) threads `cls` through step_per_class and blends GIoU+velocity-Mahalanobis behind the existing distance pre-gate, and per-class hyperparameters (Step 5) replace the two module-level constants. (c) Step 6 is the first Docker-touching step (make track + make eval RUN_ID=MEASURED-002); all edits in Steps 0-5 are coding-only. Verified lead hyunsu-kim, baseline artifact MEASURED-001, and pillar pose-tracking all exist on disk.

Results

2 artifacts produced · centerpoint-ab3dmot-nusc-mini-MEASURED-002 is the result of record

centerpoint-ab3dmot-nusc-mini-MEASURED-002

measuredmeasured

13h ago

Metrics

Metric	Value
AMOTA	0.4228
AMOTP	1.085
IDS	69
FRAG	40
MOTA	0.4054
MOTP	0.4174
RECALL	0.4992

Notes

MEASURED on real nuScenes v1.0-mini val (mini_val split: scene-0103 + scene-0916). CenterPoint pretrained ckpt (centerpoint_02pillar_second_secfpn_circlenms_4x8_cyclic_20e_nus) + AB3DMOT (Weng IROS 2020) with default per-class center-distance gates. All metrics are direct outputs of nuscenes-devkit TrackingEval (config: tracking_nips_2019). Compare to simulated counterpart `centerpoint-ab3dmot-nusc-mini-001` for delta analysis.

AMOTA · per class6 · ↑ better · worst first

bicycle
0
motorcycle
0.2828
pedestrian
0.3076
bus
0.6271
car
0.6509
truck
0.6688

centerpoint-ab3dmot-nusc-mini-MEASURED-001

measuredmeasured

16h ago

Metrics

Metric	Value
AMOTA	0.4710
AMOTP	0.9774
IDS	33
FRAG	22
MOTA	0.4684
MOTP	0.3370
RECALL	0.5629

Notes

AMOTA · per class6 · ↑ better · worst first

bicycle
0.0750
motorcycle
0.3373
bus
0.5033
pedestrian
0.5804
car
0.6642
truck
0.6657

Honest caveats

Single-fixture eval: only 2 nuScenes mini_val scenes (scene-0103 + scene-0916); AMOTA on this support is high-variance, especially for low-GT classes (bicycle ~41 GT, motorcycle).
The 0.4228 -> 0.4028 dip from the Step-6c score_thr change is within plausible run-to-run/threshold-sensitivity noise on 2 scenes and should not be over-read as a clean causal signal.
The measured artifact JSON was last rebuilt at iteration 17 (AMOTA 0.4228); the live metrics.json reflects the final iteration-20 re-run (AMOTA 0.4028), which is the authoritative achieved value reported here.
Root cause of the recall loss (~0.50 vs baseline 0.668) is attributed to the new GIoU+Mahalanobis cost and/or CTRV motion model but was NOT isolated by an ablation; the planned Step-7 sweep and a CV-vs-CTRV / cost-vs-baseline ablation were never run due to budget exhaustion.