Trajectories and open problems
Multimodal learning · vision · diffusion. 10+ pubs at CVPR · NeurIPS · ICLR · ECCV. Visiting scholar at CMU LTI (2020).
Iruda (LLM dialogue) · Santa TOEIC (vision tutor) · wrtn + crack (multimodal agents) · now WoRV (robotics FM).
Korea-based RFM team — full stack: data · policy · world models. Returns in Ch8.
What "foundation" means in robotics
CH 2How do we tokenize action?
CH 3How do we collect and grow action data?
CH 4How does a VLM become a VLA?
CH 5Speed and reasoning, at the same time
CH 6Can a model learn the world's dynamics?
CH 7Does "knowing" yield zero-shot policies?
CH 8How we're working on this from Korea
What does "foundation" actually mean in robotics?
Robotics has been "about to break out" for decades. The honest case for this wave is that four enablers are arriving in parallel.
PaliGemma, DINOv3, SigLIP, CLIP — pretrained backbones now exist that can be fine-tuned for embodied tasks instead of trained from scratch.
∆action grounding still doesn't inherit cleanly
Open hardware (ALOHA, UMI) + open datasets (OpenX 1M+ episodes, DROID 76k) + synthetic pipelines (Cosmos-Transfer) — the field finally has a data stack.
∆still ~10⁶× short of LLM token counts
The 2024-26 wave of physical platforms gives the field a shared body to train on — Figure F.03, Unitree G1/H1, 1X NEO, Tesla Optimus, Rainbow Robotics RB-Y1.
∆cost & dexterity still bottleneck deployment
$5B+ raised across Physical Intelligence, Skild, Figure, Wayve in 2024-25. National programs in KR, China, US, EU. Top NLP researchers explicitly pivoting.
∆signal-to-noise still messy; many bets won't survive
Already proven in NLP / Vision — next slide: what breaks when you copy this to robotics?
When we say a robot "generalizes," we always mean along one of these three axes.
New verbs in a known environment. "fold the towel" vs "roll the towel."
flagship — RT-2 (semantic reasoning)
Same task in unseen rooms, lighting, objects. π0.5 hits 94% follow-rate in new homes.
flagship — π0.5 (OOD home)
Brand-new hardware, zero-shot. DreamZero adapts to a YAM arm with 30 min of play.
flagship — DreamZero · GR00T N1
Unlike internet text, action data doesn't grow on its own. RT-1 ~130k episodes vs LLM ~trillions of tokens — a 5-6 order-of-magnitude gap.
So we need data tricks to scale — Ch3 is entirely about this problem.
130k episodes,
discretized actions
web knowledge
→ actions
7B open,
OpenX 970k
flow-matching
action expert
S2 7B + S1 80M
@ 200 Hz
open humanoid
stack
unseen home
94% follow
joint video +
action generation
Fan's framework: "The Great Parallel" — robotics is replaying the GPT trajectory (pretrain → align → reason → auto-research). The three concrete unlocks he names for the pretrain stage are exactly the spine of this talk.
Text, image, audio — you already know how to tokenize. What about action?
| Modality | Canonical tokenizer | Example model |
|---|---|---|
| Text | BPE / SentencePiece | GPT, Llama |
| Image | ViT patches / VQ-VAE | CLIP, LLaVA, Chameleon |
| Audio | Mel-spec / EnCodec | Whisper, AudioLM |
| Video | Tubelets / latent frames | VideoMAE, Sora |
| ACTION | — today's question — | RT-1, π0, DreamZero |
We are adding one more row to a table the Week 10 deck already filled in.
Image ↔ text tokens trained with contrastive loss — the canonical proof that any modality fits the "tokenize & predict" recipe.
Looks like a simple continuous vector. Hides embodiment dependence — the bill we pay later.
Cross-entropy "just works" on 50k tokens. Joint angles live in ℝn; we must invent the alphabet.
The same command on the same scene yields a different outcome — mass, friction, contact, noise.
1013 tokens vs 106 teleop episodes — no internet for tying shoelaces.
Mistakes have physical cost — safety, supervision, evaluation all become first-class problems.
"No internet for robot actions" · 02:08–03:32
Slice each continuous dim into 256 bins, treat each bin as a vocab token, predict with cross-entropy — same loop as an LLM.
flagship — RT-1, RT-2, OpenVLA
Keep the action continuous; bolt a separate diffusion / flow head on top of the VLM that denoises a whole action chunk at once.
flagship — Diffusion Policy, ACT, π0
EfficientNet → tokens → action tokens · 02:07–05:17
"Diffusion policy = sequence of future actions" · 20:02–21:05
(1) chunk → (2) DCT → (3) quantize → (4) flatten low-freq first → (5) BPE compress
| Modality | Canonical tokenizer | Example model |
|---|---|---|
| Text | BPE / SentencePiece | GPT, Llama |
| Image | ViT patches / VQ-VAE | CLIP, LLaVA |
| Audio | Mel-spec / EnCodec | Whisper |
| Video | Tubelets / latent frames | Sora |
| ACTION | (A) Bin discretization · 256/dim | RT-1, RT-2, OpenVLA |
| (B) Continuous chunk + diffusion / flow | ACT, Diffusion Policy, π0 | |
| (C) FAST · DCT → BPE re-discretize | π0-FAST |
Backbone (Transformer / VLM) × Action head (Discrete token / Diffusion / Flow / DiT) — 7 known combinations
If actions don't grow on the internet, how do we scale them?
Plus OpenX-Embodiment as the union of nearly everyone's lab data — see slide 29.
Open-source bimanual leader-follower rig, under $20k. Paired with ACT (Action Chunking Transformer).
Wheeled base + ALOHA arms → whole-home tasks (cooking, laundry, elevator).
v2 hardware + sim · Unleashed = diffusion + lots of teleop → shoelace tying, gear insertion.
GoPro + handheld gripper → robot-compatible data collected anywhere (home, restaurant, outdoors). Cuts the per-episode cost by an order of magnitude.
Same UMI team productized it: a $200 wearable glove + Memo humanoid. 2,000+ gloves shipped, 10M household-chore episodes from 500 homes.
3,670 hours of first-person video from 923 participants across 74 cities, 9 countries. The biggest single ego-video pool.
Paired ego + exo third-person views of the same activity — supplies the alignment needed for body / hand transfer.
Aria glasses + bimanual robot, co-trained on paired human + robot data.
arXiv:2410.24221Pretrain a VLA on human video, fine-tune on a tiny robot set.
arXiv:2507.12440Unitree H1 shadows humans from 3rd-person video. Boxing, piano, table tennis.
arXiv:2406.10454GPU-accelerated physics sim. Thousands of parallel envs, randomized scenes / lighting / textures.
100k procedurally-generated kitchen tasks — appliances, layouts, AI textures.
World model generates trajectories — 3 months → 36 hours of human data for GR00T N1.5.
Not a new collection method — a convention that aligns 60+ labs' formats so they're trainable together.
Almost every SOTA jump
came from new data,
not a bigger model.
Next chapter — how do we actually pour this data into a model?
If a VLM outputs text, what does it take to make it output an action?
Same vision encoder. Same LM backbone. Different decoder.
LLaVA, PaLI, PaliGemma, Chameleon — the row in last week's table.
RT-2, OpenVLA, π0, GR00T — the only structural change is the right-hand block.
The paper that made co-fine-tuning on web data + robot data the default.
Robot picks the "extinct animal" toy · never trained on the word "dinosaur" (rt2_teaser.mp4)
The moment academia could reproduce a SOTA VLA on its own GPUs.
7B params · LoRA fine-tunes on a single 24 GB GPU · multi-embodiment generalist
OpenVLA is the first direct payoff of the OpenX union. 970k episodes from 22 embodiments — the dataset Ch3 ended on — trained in one pass on one open model.
Same problem, different action head: diffusion vs. block-wise transformer vs. dual-system.
Weights + code + dataset recipes · LoRA-finetune on one GPU.
Empty cell — if it comes from a university, the incentive is to release.
2025-26 shift — openpi (Feb '25) + π0.5 PyTorch (Sep '25) + GR00T Isaac repo (N1→N1.7) + DreamZero, InternVLA-A1, LingBot all open in 2026 H1. Industry releasing weights for ecosystem leverage.
Paper or blog + demo videos — weights not released even when tech report is public (π0.6/0.7). Vertical-integrated bet: Helix, GENE-26.5.
VLM limits + robotics-native limits = the four open problems of 2026.
Threading a needle, plugging a USB, inserting a key —
sub-millimeter force-modulated contact is where smooth VLA rollouts fall off a cliff.
"Make breakfast" — 30 sub-tasks, recovery from a dropped egg, no global plan to fall back on.
VLAs drift after ~30s of autonomous rollout.
Trained on Franka + ALOHA → deployed on a new arm with different DoF, gripper, joint limits.
Zero-shot collapse is near-universal.
RT-2 runs at ~1 Hz; OpenVLA ~5 Hz.
Reactive contact and dynamic motion need 30–200 Hz. Big VLM = smart but slow.
The field splits along the Kahneman line. Both halves get their own chapter.
Take the action head off the LM's critical path. A small, fast expert runs at 30–200 Hz; the VLM only steers it.
Externalize reasoning. Give the VLM a world to roll forward, a plan to follow, a video to imagine.
Y-axis: chronology · X-axis: architectural lineage (CNN → Transformer → VLM → Diffusion / DiT → Flow → Latent action → Hierarchical)
Each x-column is a family of decoder choices. Same vision input, completely different action head — that's why "VLA" isn't a single recipe.
DDPM you just learned, re-cast as a policy — and the dual-system pattern that became consensus in 2026.
(a) Explicit: regression / GMM · (b) Implicit: energy-based · (c) Diffusion: learn the gradient field
Observation chunk → εθ(O, A, k) denoises a Tp-step action chunk · CNN (FiLM) or Transformer
System 2 (low-freq plan, 1/n × step) drives System 1 (high-freq motor, every step)
| Model | Rate |
|---|---|
| RT-2 55B end-to-end | ~1 Hz |
| OpenVLA 7B AR | ~5 Hz |
| Diffusion Policy | ~10 Hz |
| ACT chunked | 50 Hz |
| π0 flow-matching | ~50 Hz |
| GR00T N1 DiT expert | ~50 Hz |
| FiS-VLA shared params | 117.7 Hz |
| Helix S2 7B + S1 80M | 200 Hz |
A 20-year-old cognitive-science partition, now reified in silicon — in robotics and in LLMs.
Two modes communicate through a shared intent signal — S2 sets goals, S1 executes.
Fully autonomous Airbnb bedroom cleanup · 5× speed
Helix grocery put-away — Figure 02 humanoids collaborate · Feb 2025
Both robots run the same network — no role-specific finetuning, no per-robot models.
Both S1 and S2 run on an embedded GPU on the robot — no cloud.
No paper, no weights — the architecture description is from the blog text. Figure 03 follow-up streamed live in May 2026.
arxiv:2503.14734
research.nvidia.com/labs/gear/gr00t-n1_5
arxiv:2602.15922 · dreamzero0.github.io
arxiv:2506.01953 · fast-in-slow.github.io
deepmind.google/blog/gemini-robotics-15
Can a model learn the dynamics of the world — and use them?
The same partition appeared three times: model-based RL · recurrent controller · V·M·C decomposition.
Model-based RL: train the model from real interaction, then plan / value-iterate inside the model. Same agent learns real + imagined experience.
Sutton, ML Workshop 1990
Differentiable world model: gradients of future reward flow through M back into π. This is exactly the modern policy-via-rollout pattern.
Schmidhuber 1990, TR FKI-126/147
Same partition, modernized: VAE for V, MDN-RNN for M, tiny CMA-ES controller. Agent never touches the real env during RL.
arXiv:1803.10122 · worldmodels.github.io



Wu et al. trained a real A1 quadruped to walk from scratch in ~1 hour — physical-world WM-based RL, no sim. The earliest concrete bridge from V·M·C to robotics.
What "WM" actually points to in 2025-26 robotics papers — three branches.
Small, plannable WMs in latent space — not pixels back.
Frozen DINOv2 features + small dynamics head trained on (z, a) → ẑ'. At test-time, optimize action sequence by gradient descent on planning loss to goal latent.
dino-wm.github.io
github.com/lucas-maes/le-wm
Sora·Veo·Wan·Cosmos·Genie are video generators: prompt → video. They become WMs only with action conditioning (Slide 56, 60).
Sora·Veo·Wan are video generators. WMs predict next state given action. Two true 2026 WMs:
Carry the geometry — don't re-infer it every frame.
Explicit 3D Gaussians + per-Gaussian motion over time. Real-time render, novel-view eval — the geometric substrate Lyra 2 generates and Cosmos Transfer renders on.
hustvl/4DGaussians
Structured inputs (depth / segmentation / edge) → photoreal video via frozen Cosmos-1 + per-modality ControlNets. The standard 2025-26 sim-to-real data pipe.
research.nvidia.com/labs/dir/cosmos1
One image / prompt → navigable 3D Gaussian scene — a geometry-native WM.
"World model" = 3D reconstruction as a feed-forward foundation. N views → cameras, depth, geometry in one pass.
N views → cameras + depth + pointmaps + tracks in one feed-forward pass. The foundation.
What's new vs VGGT: Register Attention (lighter), Matching + Point losses (training-only).
Universal feed-forward metric 3D — real-world units, not just relative scale.
Without an action input, it's just video gen. With one — it's a robotics WM. This is the bridge to Ch 7.
Three input streams — video, action, text — into one AR transformer. Counterfactual rollouts: "what if I steer left?"
wayve.ai/thinking/introducing-gaia1
Initial frame → WM → synthetic robot videos. Pseudo-actions â auto-extracted — used for new behaviors and unseen environments.
developer.nvidia.com/blog/r2d2
Three core capabilities of an actionable world model — foresight, imagination-planning, data amplification.
Predict next state under candidate actions — the policy can reason about contact, dynamics, and physical regularities language-only pretraining never captures.
MPC / search inside the WM. Use the imagined future to compare candidate behaviors before acting.
Trained WM = generator of new (obs, action) pairs — replace expensive teleop with imagined rollouts.
Roll a policy inside the WM — no real-robot time. DeepMind's Gemini-in-Veo: 1,600+ real evals replaced. WorldEval (Midea) is the academic counterpart.
2026 trend (NTU MARS): WMs no longer auxiliary — VLA-RFT, WMPO, RynnVLA-002 co-evolve policy + WM in one loop.
Does knowing the world’s dynamics turn into zero-shot policies?
Linear in dollars. Every new frame = another teleop hour. OpenVLA, DROID, OpenX all live here.
Sub-linear in dollars. 2k labeled hours teach the IDM → the IDM labels 70k unlabeled web hours → first foundation policy in Minecraft.
Drop the hand-defined action label. Learn a latent action with VQ-VAE on inter-frame deltas.
Quantize inter-frame deltas → discrete latent z, no human label.
Train VLA to predict next latent action from raw video. Zero robot data.
Map latent z to real joint actions with a small labeled set.
🇰🇷 First authors + both advisors (Kimin Lee · Minjoon Seo) at KAIST.
Three architectural styles — the bifurcation point splits Slides 66 (canonical) and 68 (hybrid).
Recover the missing label using a tiny labeled set.
Learn the action as a VQ-VAE latent. Drop hand labels.
Stop separating them. Generate future video AND action together.
Single-Backbone (b). One generative model over both video and action. AR diffusion. DreamZero, Cosmos-Policy.
MoT-Style (c). Video expert + action expert, joint attention. Video-pred as auxiliary loss. InternVLA-A1, LingBot-VA.
DreamZero overview · 14B AR video diffusion + action
vs VLA 27.4%
vs from-scratch ≈0%
vs π0.5 33% · GR00T N1.6 31%
DreamZero didn't appear out of nowhere — a year+ of single-backbone work made it scalable. Cosmos-Policy (Jan 2026) was NVIDIA's own one-month-prior step.
14B normally implies seconds-per-step. Flash distillation + KV-cache reuse lands at robot-usable rate — not in the π0 zone, but enough for tabletop manip.
YAM teddy transfer · 30 min of plays, new robot · one model
The same generative prior that gives zero-shot tasks also gives fast embodiment adaptation.
MoT (Mixture-of-Transformers) — one backbone, three experts: understanding, generation, action. 2-3B params, 692M pretraining frames.
arXiv:2601.02456
AR diffusion over future frames + policy execution in shared latent space. SOTA on all 6 real-robot tasks vs π0.5; +8-9% at horizon 3.
arXiv:2601.21998No future-video generation. Just a dual branch with learnable latent queries between perception and action. Pretrained on 200k h ego video + 15k h robot demos — the Ch3 ego pool, finally cashed.
Teacher hidden states → compressors → routed adapters → student text-embedding space. Knowledge transfer between WAMs without touching the backbone.
Treat objects as first-class addressable entities inside the world model. Lets the policy refer to “the red cup” the way a VLA refers to a token — not a pixel patch.
“Test-time future imagination is unnecessary.” Skip the rollout, get SOTA at 4× speed — directly challenges DreamZero / Being-H0.7’s core premise.
FFDC-WAM threads the needle with conditional imagination.
Decouple: predict subgoal / latent, then act. VPT · UniPi · LAPA
One model, video × action jointly. UVA · UWM · DreamZero
Expert fusion / joint attention. LingBot-VA · GE-Act · BagelVLA
VLA + foresight / video pred. as aux. GR-1 · InternVLA-A1 · UniVLA
JEPA-style, no pixel gen. V-JEPA 2 · VLA-JEPA · FLARE · Being-H0.7
How we're working on all of this from Korea
maum.ai's physical-AI division.
We build foundation models that work in the physical world.
(Taxonomy reused from Ch7 slide 70 — we build a Korean stack on the same line.)
Data → Models → Deployment. End-to-end stack from teleop to B2B install.
ALOHA / UMI / egocentric video pipelines + Cosmos-Transfer synth factory.
CANVAS · HuggingFace maum-ai/CANVAS-S · maum-ai/COMMAND
Ch6 / Ch7 lines implemented in-house — our WAM × dual-system bet.
internal · targeted public release in 2026 H2
CostNav measures navigation in real economic cost. I'm last author.
arXiv:2511.20216 · worv-ai.github.io/CostNav
maum-ai/CANVAS · maum-ai/CostNav · open-world-agents/D2E (next slide).
Three projects we're (co-)leading right now — data, world models, eval infra.
Generalist Inverse Dynamics Model — trained on desktop game data (Brotato, Minecraft, BF6, …) → transfers to real-world manipulation (Meta-World, LIBERO) and navigation (CANVAS).
github.com/worv-ai/D2E · HF: open-world-agents/Generalist-IDM-1B
Camera-controllable progressive AR video transformer — 3,000 minutes of gameplay frames with camera trajectories. The "WM as controllable simulator" research line.
github.com/cvlab-kaist/WorldCam
Unified VLA eval across 18 benchmarks × 13 model servers. 47× faster LIBERO via batched parallel eval. Decouples model from environment.
github.com/allenai/vla-evaluation-harness · v0.2.0
Specific customers are confidential — verticals only