BRIDGE (Behavior Ranking In Diffusion-Generated Embeddings) is an offline-RL stack I built around the question “what does a latent-diffusion policy actually need from its conditioning channel?”. The pipeline is a β-VAE skill encoder feeding a latent DDPM (with min-SNR-γ), an XQL batch-constrained critic, and a sample-and-rank deployment policy. The headline result on D4RL Franka-Kitchen partial-v2 is 75.0% task completion, +16.7 pts over BCQ, with zero per-seed variance across 100 evaluation episodes.

Demo

The grid below is six independent evaluation rollouts on the same partial-kitchen seed. The top row is the two pareto-optimal recipes (state-decoder auxiliary head, L2-normalized goal-qpos conditioning). The bottom row is refuted ablations — they show up here so you can watch the policy fail in the ways the design space predicts it should fail.

GQ-v2 goalqpos-N + SD
winner — ret 1049
GQ-hybrid + SD
winner — ret 1043
taskmask + SD
winner — ret 1042
baseline (no SD)
ablated — ret 526
sparse-H20-β0.5
refuted — ret 417
steps=500, T=500
refuted — ret 417

Synopsis

Built BRIDGE, an offline reinforcement-learning stack composed of a β-VAE skill encoder, latent diffusion model (DDPM with min-SNR-γ), batch-constrained Q-critic (XQL), and a sample-and-rank deployment policy. Designed and ablated 12 architectural variants on D4RL/kitchen/partial-v2, identifying two pareto-optimal recipes:

  • SDstate-decoder auxiliary head — 3.00 / 4 tasks with zero per-seed variance.
  • GQ-v2L2-normalized goal-qpos conditioning — the only recipe that surfaces 4-task completions.

Authored a modular, registry-driven codebase (conditioning heads, auxiliary losses, chunk-reweighting, bootstrap targets all swappable from a config), 354 passing unit tests, and full MLflow tracking. Along the way, diagnosed and patched a fork-after-CUDA DataLoader deadlock that had been silently stalling parallel jobs. Documented the campaign in a 12-section forensic report and a Manim animated walkthrough that takes the viewer from encoder → ELBO → composite kitchen loss → downstream propagation.

What landed (highlights)

  • +16.7 success-rate points over BCQ baseline (58.3 → 75.0 %) on D4RL Franka-Kitchen partial-v2; zero per-seed variance across 100 evaluation episodes.
  • Implemented β-VAE + latent DDPM + XQL critic + sample-and-rank deployment as a modular, registry-driven Python package (PyTorch, uv, MLflow, Manim).
  • Ran an exhaustive 12-variant architectural ablation; documented the refuted levers (chunk reweighting, skill-prior regularizer, classifier-free guidance, priority-ordered conditioning) alongside the two pareto keepers.
  • Proved that disambiguating the conditioning channel eliminates rare-event opportunism — a generalizable result about conditioning geometry in latent-diffusion policies.
  • Production engineering: 354 passing tests, fork-after-CUDA DataLoader fix, signature-keyed cache invalidation, MLflow per-seed tagging for reproducible filtering.
  • Authored a pedagogical Manim animation explaining the architecture end-to-end (encoder → ELBO → composite kitchen loss → downstream propagation).

Stack

PyTorch · Diffusion models · Offline RL · D4RL / Minari / gymnasium-robotics · MLflow · uv · CUDA / multiprocessing · Reproducible experiment design · Architectural ablation methodology