HUFS LAI · 2026-05-27 Week 13 · Advanced 1 — Multimodal AI

Robotics
Foundation
Models

Trajectories and open problems

Yunsung Lee Head of Research · WoRV @ maum.ai
Intro 02 / 77

About Me

01 · Academic

Korea Univ MS · CMU Visiting

Multimodal learning · vision · diffusion. 10+ pubs at CVPR · NeurIPS · ICLR · ECCV. Visiting scholar at CMU LTI (2020).

02 · Industry journey

ScatterLab → Riiid → Wrtn → maum.ai

Iruda (LLM dialogue) · Santa TOEIC (vision tutor) · wrtn + crack (multimodal agents) · now WoRV (robotics FM).

03 · Now

Head of Research · WoRV @ maum.ai

World model for Robotics and Vehicle control

Korea-based RFM team — full stack: data · policy · world models. Returns in Ch8.

Intro 03 / 77

Today 8 chapters

Chapter 1 — Robotics Foundation Models? 04 / 77
01

Robotics Foundation Models?

What does "foundation" actually mean in robotics?

Ch 1 — RFM? 05 / 77

Why robotics AI — now

four forces converging · first time at once

Robotics has been "about to break out" for decades. The honest case for this wave is that four enablers are arriving in parallel.

01
FROM NLP / VISION SPILLOVER

Foundation models are ready

PaliGemma, DINOv3, SigLIP, CLIP — pretrained backbones now exist that can be fine-tuned for embodied tasks instead of trained from scratch.

Examples: RT-2 ← PaLI-X · OpenVLA ← Llama-2 · π0 ← PaliGemma

action grounding still doesn't inherit cleanly

02
REAL-WORLD ROBOT DATA

Data infrastructure exists

Open hardware (ALOHA, UMI) + open datasets (OpenX 1M+ episodes, DROID 76k) + synthetic pipelines (Cosmos-Transfer) — the field finally has a data stack.

Datasets: OpenX-Embodiment · DROID · BridgeData V2 · RoboCasa

still ~10⁶× short of LLM token counts

03
REAL HARDWARE AT REAL VOLUME

Humanoid + arm platforms shipped

The 2024-26 wave of physical platforms gives the field a shared body to train on — Figure F.03, Unitree G1/H1, 1X NEO, Tesla Optimus, Rainbow Robotics RB-Y1.

Platforms: Figure · 1X · Unitree · Tesla · Apptronik · Rainbow

cost & dexterity still bottleneck deployment

04
CAPITAL + TALENT REALLOCATION

Industry pull is real

$5B+ raised across Physical Intelligence, Skild, Figure, Wayve in 2024-25. National programs in KR, China, US, EU. Top NLP researchers explicitly pivoting.

Signal: Physical Intelligence ($400M) · Skild ($300M) · Figure ($675M)

signal-to-noise still messy; many bets won't survive

still unsolved
data wall · generalization across tasks & environments · embodiment-portability  — that's why we spend the next 70 min on the field's 2024-26 attempts.
Ch 1 — RFM? 06 / 77

What "foundation" means in NLP / Vision

Bommasani et al. · Stanford CRFM · 2021
Web text Images Audio Video FOUNDATION MODEL QA Translation Code gen Captioning

One model → many downstream tasks

  • Large-scale self-supervised pretraining
  • Downstream tasks need only light fine-tuning / zero-shot
  • Capability emergence backed by scaling laws

Already proven in NLP / Vision — next slide: what breaks when you copy this to robotics?

Ch 1 — RFM? 07 / 77

What "foundation" means for robots

Web text + image Egocentric video Teleop demonstrations Simulated rollouts ROBOT FOUNDATION MODEL Pick · Place Fold · Wipe Navigate Long-horizon assembly
Input: multimodal observation (vision + proprioception + language)
Output: physical action — both must generalize
Ch 1 — RFM? 08 / 77

Three axes of generalization

When we say a robot "generalizes," we always mean along one of these three axes.

AXIS 1

Task

new verb · same room

New verbs in a known environment. "fold the towel" vs "roll the towel."

flagship — RT-2 (semantic reasoning)

AXIS 2

Environment

same verb · new room

Same task in unseen rooms, lighting, objects. π0.5 hits 94% follow-rate in new homes.

flagship — π0.5 (OOD home)

AXIS 3

Embodiment

same brain · new body

Brand-new hardware, zero-shot. DreamZero adapts to a YAM arm with 30 min of play.

flagship — DreamZero · GR00T N1

Most recent papers bet on one of these three axes. No model generalizes across all three yet — that's the destination of "robotics foundation."
Ch 1 — RFM? 09 / 77

Why "scaling" alone isn't obvious here

Data scale comparison · log axis
10⁴ 10⁶ 10⁸ 10¹⁰ 10¹³ LLM ~10¹³ tokens ImageNet ~10⁷ images OpenX ~10⁶ episodes single lab ~10⁴ episodes 5-6 orders gap

"We don't have an internet of actions."

Unlike internet text, action data doesn't grow on its own. RT-1 ~130k episodes vs LLM ~trillions of tokens — a 5-6 order-of-magnitude gap.

So we need data tricks to scale — Ch3 is entirely about this problem.

Ch 1 — RFM? 10 / 77

The 2023 — 2026 unlock

2022
RT-1

130k episodes,
discretized actions

2023
RT-2

web knowledge
→ actions

2024
OpenVLA

7B open,
OpenX 970k

2024-10
π0

flow-matching
action expert

2025-02
Helix

S2 7B + S1 80M
@ 200 Hz

2025-03
GR00T N1

open humanoid
stack

2025-04
π0.5

unseen home
94% follow

2026-02
DreamZero

joint video +
action generation

Three years on one page — action tokenization → web knowledge transfer → open stack → dual system → OOD generalization → joint video × action generation.
Ch 1 — RFM? 11 / 77

The core thesisJim Fan's roadmap, this talk's evidence

aligned with Sequoia AI Ascent · May 2026
Jim Fan · NVIDIA · Sequoia AI Ascent 2026 · "Robotics: Endgame"

Fan's framework: "The Great Parallel" — robotics is replaying the GPT trajectory (pretrain → align → reason → auto-research). The three concrete unlocks he names for the pretrain stage are exactly the spine of this talk.

Jim Fan · May 2026 · paraphrased "Robotics' Endgame is on the GPT-parallel roadmap. Three unlocks get us through the pretrain stage."
His 3 unlocks — this talk's spine
Sensorized human dataegocentric video · UMI
Ch 3data scaling
Neural simulatorsDreamDojo · NVIDIA
Ch 6world models
Video-first WAMsDreamZero · vision/action as first-class
Ch 7world action models
+ 2 layers we add
+
S1/S2 dual-system convergenceHelix · π0.5 · GR00T
Ch 5diffusion + dual
+
Korea-built RFM stackdata, model, deploy — in-house
Ch 8WoRV @ maum.ai
Ch 1 — RFM? 12 / 77

Roadmap — 7 questions, 7 chapters

CH 2 Action modality How do we tokenize action?
CH 3 Data scaling How do we collect and grow that action data?
CH 4 VLAs How does a VLM become a VLA?
CH 5 Diffusion + Dual Speed and reasoning, simultaneously
CH 6 World Models Can a model learn the world's dynamics?
CH 7 World Action Models Does "knowing" turn into zero-shot policies?
CH 8 WoRV @ maum.ai How we're working on this from Korea — and who to talk to about joining
Further reading — for broader coverage of the field, see NTU MARS et al. arXiv:2605.00080 (2026-04).
Chapter 2 — A New Modality: Action 13 / 77
02

A New Modality: Action

Text, image, audio — you already know how to tokenize. What about action?

Ch 2 — Action Modality 14 / 77

Recap — the modalities you already know

Week 10 deck · Multimodal LLM fusion patterns
Modality Canonical tokenizer Example model
Text BPE / SentencePiece GPT, Llama
Image ViT patches / VQ-VAE CLIP, LLaVA, Chameleon
Audio Mel-spec / EnCodec Whisper, AudioLM
Video Tubelets / latent frames VideoMAE, Sora
ACTION — today's question — RT-1, π0, DreamZero

We are adding one more row to a table the Week 10 deck already filled in.

Original figure · CLIP · Radford et al. 2021
CLIP — contrastive pre-training + zero-shot prediction (Radford et al. 2021 Fig 1)

Image ↔ text tokens trained with contrastive loss — the canonical proof that any modality fits the "tokenize & predict" recipe.

Ch 2 — Action Modality 15 / 77

What is an "action" in a robot?

One control step · 7-DoF arm + 1 gripper
base gripper q1 q2 q3 q4 q5 q6 q7 grip a ∈ 8 7 joints + 1 gripper Franka Emika Panda 7-DoF arm with Joint 0 through Joint 6 labeled on each axis

A continuous vector — but embodiment-bound

  • Joint angles (Franka) vs end-effector pose (RT-1) vs motor torques (Atlas)
  • 7-DoF single arm → 14-DoF bimanual ALOHA → 35-DoF Helix upper body
  • Grippers add a 1-D switch; dexterous hands add 20+ extra DoF
  • The "action space" is not portable — every robot speaks its own language

Looks like a simple continuous vector. Hides embodiment dependence — the bill we pay later.

Ch 2 — Action Modality 16 / 77

Why action is harder than text

1 · DISCRETENESS

Text: discrete by birth.
Action: continuous.

Cross-entropy "just works" on 50k tokens. Joint angles live in ℝn; we must invent the alphabet.

2 · DETERMINISM

Text: next-token softmax.
Action: real physics.

The same command on the same scene yields a different outcome — mass, friction, contact, noise.

3 · DATA SOURCE

Text: scrape the internet.
Action: teleop, one hour at a time.

1013 tokens vs 106 teleop episodes — no internet for tying shoelaces.

4 · REVERSIBILITY

Text: regenerate the answer.
Action: drop the cup, it breaks.

Mistakes have physical cost — safety, supervision, evaluation all become first-class problems.

Clip · Chelsea Finn @ YC AI Startup School

"No internet for robot actions" · 02:08–03:32

The whole rest of the talk is a response to cell #3. Ch3 fixes data; Ch5 fixes speed; Ch6–7 fix evaluation.
Ch 2 — Action Modality 17 / 77

Two tokenization strategies

STRATEGY A

Bin discretization

Slice each continuous dim into 256 bins, treat each bin as a vocab token, predict with cross-entropy — same loop as an LLM.

q ∈ ℝ bin #137 (256 bins / dim) <act_137> one vocab token per dim P(aₐ|s) ↔ softmax
  • Reuses LLM loss, optimizer, decoder verbatim
  • Quantization error; per-dim factorization

flagship — RT-1, RT-2, OpenVLA

STRATEGY B

Continuous head

Keep the action continuous; bolt a separate diffusion / flow head on top of the VLM that denoises a whole action chunk at once.

◌ ◌ ◌ ◌ ◌ ◌ ◌ ◌ ◌ noise DENOISE εΘ(aᵗ, t, o) aₘ aₘ₊₁ ··· aₘ₊ₖ k-step chunk
  • Captures multimodal action distributions (no mode collapse)
  • Slower per call — but predicts a whole chunk at once

flagship — Diffusion Policy, ACT, π0

This is the #1 design choice behind every modern VLA. The next three slides take one example of each — and a 2025 hybrid that re-discretizes.
Ch 2 — Action Modality 18 / 77

Discrete — RT-1's action vocabulary

11 dims × 256 bins per dim
CONTINUOUS +0.124 -0.038 +0.211 +0.000 +0.046 -0.155 +0.302 +0.018 -0.071 +0.094 +1.000 256-bin quant. DISCRETE TOKENS <act_159> <act_123> <act_182> <act_128> <act_134> <act_108> <act_206> <act_130> <act_119> <act_140> <act_255> RT-1 35M transformer next-token CE
Clip · RT-1 supplementary video

EfficientNet → tokens → action tokens · 02:07–05:17

  • 11 dims = 7 arm joints + 3 base velocities + 1 gripper
  • 256 bins per dim → 11 tokens emitted per control step
  • Loss = next-token cross-entropy — identical to an LLM
  • Opens the door to VLM → VLA (Ch4)
Ch 2 — Action Modality 19 / 77

Continuous — action chunks

Predict k future actions in one shot
PER-STEP — predict only aₘ aₘ model runs every 33ms · 30 Hz CHUNKED — predict aₘ … aₘ₊ₖ aₘ aₘ₊₁ aₘ₊₂ aₘ₊₃ aₘ₊₄ aₘ₊₅ aₘ₊₆ model runs once per chunk · "open loop" replay VLM / DP diffusion or flow at:t+k = εθ(z, t, o)
  • ACT — bimanual VAE-Transformer trained with action-chunk MSE
  • Diffusion Policy — same DDPM you saw in Week 12, but the target is an action sequence, not pixels
  • Chunking absorbs human teleop noise + handles multimodal action distributions
Clip · Russ Tedrake @ Princeton Robotics Seminar

"Diffusion policy = sequence of future actions" · 20:02–21:05

DDPM → policy. Same noise schedule, same εθ objective. Only the target tensor changes — from pixels to an action chunk.
Ch 2 — Action Modality 20 / 77

FAST — a compression-based tokenizer

Pertsch et al. Jan 2025 · Fig 2
FAST tokenization method (paper Fig 2) — 5-step pipeline: normalized action chunk → DCT → quantize → sparse matrix → flatten → BPE-compressed tokens

(1) chunk → (2) DCT → (3) quantize → (4) flatten low-freq first → (5) BPE compress

Quantize & Drop: DCT coefficients are scaled and rounded. High-frequency components (noise) collapse to zero and are omitted (dropped) prior to BPE encoding, yielding 5× compression.

Continuous → back to discrete, in a smarter basis.

  • DCT = Discrete Cosine Transform (same idea as JPEG image compression). Energy concentrates in the low-frequency components — the high-frequency tail is mostly noise.
  • BPE = Byte-Pair Encoding (same tokenizer family as GPT). Greedy merge of frequent integer pairs → compact vocabulary.
  • Result: ~5× shorter action sequences than RT-1-style 256-bin discretization → faster train + faster inference, same accuracy.
Big picture: FAST is Strategy C — the 2025 hybrid that quietly returns to discrete tokens, but in a basis where each token carries real information about the trajectory.
Ch 2 — Action Modality 21 / 77

Takeaway — action is the new token

Same table as slide 14 — with ACTION filled in
Modality Canonical tokenizer Example model
Text BPE / SentencePiece GPT, Llama
Image ViT patches / VQ-VAE CLIP, LLaVA
Audio Mel-spec / EnCodec Whisper
Video Tubelets / latent frames Sora
ACTION (A) Bin discretization · 256/dim RT-1, RT-2, OpenVLA
(B) Continuous chunk + diffusion / flow ACT, Diffusion Policy, π0
(C) FAST · DCT → BPE re-discretize π0-FAST
Action is now a modality that LM, diffusion, and flow can all consume. Next chapter — where do we get the data?
7 action-head architectures · Survey Fig 3
Survey Fig 3 — 7 sensorimotor architecture types for VLA (Transformer/VLM × Discrete/Diffusion/Flow)

Backbone (Transformer / VLM) × Action head (Discrete token / Diffusion / Flow / DiT) — 7 known combinations

Chapter 3 — Data Scaling 22 / 77
03

Data Scaling.

If actions don't grow on the internet, how do we scale them?

Ch 3 — Data Scaling 23 / 77

The action data wall

Robot-learning datasets · log scale
10³ 10⁴ 10⁵ 10⁶ 10⁷ RoboMimic ~6k BridgeV2 60k DROID 76k RoboCasa 100k OpenX 1M+ ep 10¹³ LLM training tokens — 10¹³ · · · six orders of magnitude gap · · · still ~10⁶× short

5-6 orders of magnitude short.

  • Every jump on this chart = a new collection method
  • Lab teleop < cross-lab union < in-the-wild < human video < sim
  • The rest of the chapter = four strategies to close the gap

Plus OpenX-Embodiment as the union of nearly everyone's lab data — see slide 29.

Ch 3 — Data Scaling 24 / 77

Strategy 1 — Better hardware (ALOHA family)

cheaper teleop → more data
ALOHA bimanual teleop hero — 6 dexterous tasks back-to-back · Zhao et al. 2023
ALOHA
2023 · Stanford

Open-source bimanual leader-follower rig, under $20k. Paired with ACT (Action Chunking Transformer).

Mobile ALOHA
2024 · Stanford

Wheeled base + ALOHA arms → whole-home tasks (cooking, laundry, elevator).

ALOHA 2 · ALOHA Unleashed
2024 · DeepMind

v2 hardware + sim · Unleashed = diffusion + lots of teleop → shoelace tying, gear insertion.

price ↓  =  data ↑  ·  cheap rigs let any lab contribute.
Ch 3 — Data Scaling 25 / 77

Strategy 2 — Handheld in-the-wild (UMI → Sunday)

no robot needed at collection time
UMI
Chi et al. · RSS 2024
arXiv:2402.10329

GoPro + handheld gripper → robot-compatible data collected anywhere (home, restaurant, outdoors). Cuts the per-episode cost by an order of magnitude.

Sunday Glove · Memo
Sunday Robotics · Nov 2025
launch

Same UMI team productized it: a $200 wearable glove + Memo humanoid. 2,000+ gloves shipped, 10M household-chore episodes from 500 homes.

Narrative bridge — Tony Zhao + Cheng Chi (ALOHA · Mobile ALOHA) → spun out as Sunday. Same people, lower friction at every step: $20k rig → $400 GoPro rig → $200 glove.
Ch 3 — Data Scaling 26 / 77

Strategy 3 — Egocentric video (Ego4D · Ego-Exo4D)

massive · unlabeled · first-person
Ego4D
Grauman et al. · CVPR 2022
arXiv:2110.07058

3,670 hours of first-person video from 923 participants across 74 cities, 9 countries. The biggest single ego-video pool.

Ego-Exo4D
Grauman et al. · CVPR 2024
arXiv:2311.18259

Paired ego + exo third-person views of the same activity — supplies the alignment needed for body / hand transfer.

Human video has no action labels — but it is the only pool that already exists at scale. The next slide (EgoMimic / EgoVLA / HumanPlus) is about how we turn pixels into policy. Forward-ref → Ch7 LAPA recovers latent actions directly from these frames.
Ch 3 — Data Scaling 27 / 77

Strategy 3+Video → robot

embodiment bridge, not just supervision
EgoMimic
Georgia Tech · ICRA 2025

Aria glasses + bimanual robot, co-trained on paired human + robot data.

arXiv:2410.24221
EgoVLA
NVIDIA + UCSD · 2025

Pretrain a VLA on human video, fine-tune on a tiny robot set.

arXiv:2507.12440
HumanPlus
Stanford · CoRL 2024

Unitree H1 shadows humans from 3rd-person video. Boxing, piano, table tennis.

arXiv:2406.10454
Not just more supervision — an embodiment bridge. Human pixels transferred to robot policy, then to humanoid bodies.
Ch 3 — Data Scaling 28 / 77

Strategy 4 — Synthetic (sim + world-model dreams)

no humans in the loop
NVIDIA "From Dreams to Reality" — DreamGen / GR00T-Dreams synthetic trajectories
Isaac Lab

GPU-accelerated physics sim. Thousands of parallel envs, randomized scenes / lighting / textures.

RoboCasa

100k procedurally-generated kitchen tasks — appliances, layouts, AI textures.

DreamGen · GR00T-Dreams
arXiv:2505.12705

World model generates trajectories — 3 months → 36 hours of human data for GR00T N1.5.

Forward-ref → Ch6 World Models · Ch7 WAMs reinterpret generation itself as the data factory.
Ch 3 — Data Scaling 29 / 77

OpenX-Embodiment — the union

22 embodiments · 60+ labs · 1M+ episodes
OpenX 22-embodiment task montage — same skill, different bodies · Collaboration et al. 2023

"Let's pool what we have."

Not a new collection method — a convention that aligns 60+ labs' formats so they're trainable together.

embodiments
22
episodes
1M+
institutions
60+
skills
527
Spawned the dataset wave around it: DROID (76k traj, OpenVLA's training pool), BridgeData V2, ALOHA Unleashed, RoboCasa. Same field-wide pressure to grow the pool.
Ch 3 — Data Scaling 30 / 77

Recap — data > model

almost every SOTA jump came from a new data source
ACTION DATA strategy 1 Better hardware ALOHA · Mobile · Unleashed strategy 2 Handheld device UMI · Sunday Glove strategy 3 Ego video Ego4D · EgoMimic · HumanPlus strategy 4 Synthetic Isaac · RoboCasa · DreamGen OpenX — the union

Almost every SOTA jump
came from new data,
not a bigger model.

  • RT-1 → RT-2 = web data (not bigger ViT)
  • OpenVLA = OpenX 970k (not new architecture)
  • π0.5 OOD homes = co-training with 22k web episodes
  • GR00T N1.5 = DreamGen synthetic (3 mo → 36 hr)

Next chapter — how do we actually pour this data into a model?

Chapter 4 — VLAs: Vision-Language-Action Models 31 / 77
04

VLAs.

If a VLM outputs text, what does it take to make it output an action?

Ch 4 — VLAs 32 / 77

Definition — swap the head, get a VLA

Same vision encoder. Same LM backbone. Different decoder.

A · Vision-Language Model (VLM)
image text Vision encoder SigLIP / ViT LM backbone Llama / PaLI text tokens image + text → text

LLaVA, PaLI, PaliGemma, Chameleon — the row in last week's table.

B · Vision-Language-Action Model (VLA)
image text Vision encoder SigLIP / ViT LM backbone Llama / Gemma action tokens / chunk / flow image + text → ACTION

RT-2, OpenVLA, π0, GR00T — the only structural change is the right-hand block.

Ch 4 — VLAs 33 / 77

RT-2 — web knowledge → robot actions

The paper that made co-fine-tuning on web data + robot data the default.

Official project-page video · robotics-transformer2.github.io

Robot picks the "extinct animal" toy · never trained on the word "dinosaur" (rt2_teaser.mp4)

Backbone
PaLI-X / PaLM-E
5B · 55B variants, frozen web pre-training kept.
Action head
Actions as text tokens
256-bin discretization, integers in the LM's own vocab.
Trick
Co-fine-tune
VQA + caption + robot trajectories in one batch → emergent semantic generalization.
Result: the robot can act on concepts it never saw in robot data — "pick the extinct animal" works because the LM knew the word and the action head spoke the same token language.
Ch 4 — VLAs 34 / 77

OpenVLA — opening the door

The moment academia could reproduce a SOTA VLA on its own GPUs.

Project site · openvla.github.io

7B params · LoRA fine-tunes on a single 24 GB GPU · multi-embodiment generalist

7B
parameters · open weights
970k
OpenX episodes
Recipe
SigLIP + DINOv2 → Llama‑2 7B → bin tokens
Same head style as RT-2 (256-bin discretization), but every part is open and swap-in/swap-out.
← Bridge back to Ch3 slide 29

OpenVLA is the first direct payoff of the OpenX union. 970k episodes from 22 embodiments — the dataset Ch3 ended on — trained in one pass on one open model.

Ch 4 — VLAs 35 / 77

Generalist policies — three takes on the same year

Same problem, different action head: diffusion vs. block-wise transformer vs. dual-system.

RDT-1B

Oct 2024
Diffusion transformer over bimanual ALOHA. 1B params. Multi-modal action distribution → cleaner mode separation than bin tokens.
Tsinghua · arXiv:2410.07864

Octo

May 2024
Block-wise transformer + diffusion action head. Trained on 800k OpenX trajectories — runs on 9 embodiments out of the box.
Berkeley · arXiv:2405.12213

GR00T N1

Mar 2025
Dual system: Eagle VLM (S2) + DiT action head (S1). First open humanoid foundation model — data pyramid web → human ego → teleop.
NVIDIA GEAR · arXiv:2503.14734
Three labs, twelve months apart, each picking a different action head — VLA is not a single recipe. Each design choice (diffusion / block transformer / dual-system) becomes a whole chapter of follow-up work (Ch5).
Ch 4 — VLAs 36 / 77

Open vs closed VLA landscape

2026 H1 · 22 models
Open weights
Closed / API-only
Academic
★ Open + Academic — "your semester project" 4
OpenVLA '24 Stanford RDT-1B '24 Tsinghua Octo '24 Berkeley LAPA '24 KAIST
Closed + Academic — rare ~0
Industry / product
Open + Industry — "open core" is winning 10
π0 '24 PI · openpi π0-FAST '25 PI π0.5 '25 PI · Sep PyTorch GR00T N1 '25 NVIDIA GR00T N1.5 '25 NVIDIA GR00T N1.6 '26 NVIDIA GR00T N1.7 '26 NVIDIA DreamZero '26 NVIDIA InternVLA-A1 '26 Shanghai AI Lab LingBot-VLA '26 Ant Group
Closed + Industry — product moat / tech report only 8
RT-1 '22 Google RT-2 '23 Google Helix '25 Figure Gemini Rob 1.5 '25 DeepMind π0.6 '25-11 PI · report only π0.7 '26-04 PI · report only GEN-1 '26 Generalist AI GENE-26.5 '26 Genesis AI
2026 shift
Open column went from 4 → 14 models in 18 months. Closed bet now splits into vertical-integrated (Helix, GENE-26.5) vs tech-report-only (π0.6, π0.7).
Ch 4 — VLAs 37 / 77

What VLAs are still bad at

VLM limits + robotics-native limits = the four open problems of 2026.

FAIL 01   Precise contact

Threading a needle, plugging a USB, inserting a key —
sub-millimeter force-modulated contact is where smooth VLA rollouts fall off a cliff.

VLM tokens have no haptic channel ·
tactile sensing not yet in the input pipe.
FAIL 02   Long horizon

"Make breakfast" — 30 sub-tasks, recovery from a dropped egg, no global plan to fall back on.
VLAs drift after ~30s of autonomous rollout.

No explicit planner · error compounds
across token-by-token rollouts.
FAIL 03   Novel embodiment

Trained on Franka + ALOHA → deployed on a new arm with different DoF, gripper, joint limits.
Zero-shot collapse is near-universal.

Action vocabulary is embodiment-specific ·
DreamZero needs 30 min of YAM plays to adapt.
FAIL 04   Speed

RT-2 runs at ~1 Hz; OpenVLA ~5 Hz.
Reactive contact and dynamic motion need 30–200 Hz. Big VLM = smart but slow.

1 Hz vs 200 Hz · the smarter the model,
the slower it serves — Ch5's whole motivation.
Ch 4 — VLAs 38 / 77

Two responses — faster S1, smarter S2

The field splits along the Kahneman line. Both halves get their own chapter.

VLA today smart but slow · contact-fragile · embodiment-locked
split along Kahneman line
S1

Faster motor expert

Take the action head off the LM's critical path. A small, fast expert runs at 30–200 Hz; the VLM only steers it.

  • Diffusion Policy · denoise an action chunk
  • π0 flow-matching expert · 50 Hz
  • Helix S1 · 80M params @ 200 Hz
  • FAST tokens · shorter sequences, same head
→ CHAPTER 5
S2

Smarter high-level brain

Externalize reasoning. Give the VLM a world to roll forward, a plan to follow, a video to imagine.

  • Embodied chain-of-thought · Gemini Robotics ER 1.5
  • World models as simulators · rollout-as-reasoning
  • World Action Models · joint video + action generation
  • DreamZero · the canonical WAM
→ CHAPTERS 6 & 7
Ch 4 — VLAs 39 / 77

VLA timeline — 14 milestone models

Survey Fig 2 · bookmark for Ch5–7
Timeline of major VLA models — Kawaharazuka et al. Survey Fig 2

Y-axis: chronology · X-axis: architectural lineage (CNN → Transformer → VLM → Diffusion / DiT → Flow → Latent action → Hierarchical)

Read the chart left–right.

Each x-column is a family of decoder choices. Same vision input, completely different action head — that's why "VLA" isn't a single recipe.

  • Open-weight track — OpenVLA · Octo · RDT-1B · GR00T N1/N1.5 are your candidate list for hands-on work.
  • Backbone shrinks over time — PaLI-X 55B → Llama-2 7B → PaliGemma 3B → Eagle 2B. Smaller, faster, robot-tuned.
  • Right edge = today — Hierarchical (π0.5) + Latent action (LAPA) + DiT (GR00T) are the live frontier.
The two models in this talk that aren't yet on the survey chart: Helix (Figure AI, 2025-02 — closed) and DreamZero (NVIDIA GR00T N2, 2026 — covered in Ch7 as a WAM, not a VLA).
Chapter 5 — Diffusion + Dual System 40 / 77
05

Smart & Fast

DDPM you just learned, re-cast as a policy — and the dual-system pattern that became consensus in 2026.

Ch 5 — Diffusion + Dual 41 / 77

Diffusion as policy — original figures, RSS 2023

Chi et al. RSS 2023 · Fig 1
Diffusion Policy Fig 1 — Explicit / Implicit / Diffusion policy comparison

(a) Explicit: regression / GMM · (b) Implicit: energy-based · (c) Diffusion: learn the gradient field

Chi et al. RSS 2023 · Fig 2
Diffusion Policy Fig 2 — observation horizon → action chunk; CNN + Transformer variants

Observation chunk → εθ(O, A, k) denoises a Tp-step action chunk · CNN (FiLM) or Transformer

In one line
The same εθ that denoises image pixels now denoises an action chunk conditioned on observation history — sample from a multi-modal action distribution instead of regressing to the mean.
Ch 5 — Diffusion + Dual 42 / 77

Why "fast" matters — control frequency

FiS-VLA architecture · Chen et al. 2025 · Fig 2
FiS-VLA architecture (paper Fig 2) — System 2 (low-frequency planner) drives System 1 (high-frequency motor expert)

System 2 (low-freq plan, 1/n × step) drives System 1 (high-freq motor, every step)

Reported inference rate · log Hz
Model Rate
RT-2 55B end-to-end ~1 Hz
OpenVLA 7B AR ~5 Hz
Diffusion Policy ~10 Hz
ACT chunked 50 Hz
π0 flow-matching ~50 Hz
GR00T N1 DiT expert ~50 Hz
FiS-VLA shared params 117.7 Hz
Helix S2 7B + S1 80M 200 Hz
Contact: 30-50 Hz · Force-control: 200 Hz+. Smart-single-loop models all fall below; red rows = dual-system.
Ch 5 — Diffusion + Dual 43 / 77

Dual system — the Kahneman analogy

A 20-year-old cognitive-science partition, now reified in silicon — in robotics and in LLMs.

HUMAN · Kahneman 2002
S2
Prefrontal · slow
speed slow · deliberate
effort serial · effortful
nature reasoning · planning
e.g. solving 17 × 24
S1
Motor · fast
speed fast · automatic
effort parallel · effortless
nature reflex · skill
e.g. driving on an empty road

Two modes communicate through a shared intent signal — S2 sets goals, S1 executes.

ROBOT · Helix · π0 · GR00T
VLM BACKBONE PaliGemma / Gemma-2B / 7B S2 5 — 9 Hz latent z embedding ACTION EXPERT diffusion / flow / 80M visuomotor S1 50 — 200 Hz image · instruction · scene joint torques · EE pose S2 thinks slowly about the goal · S1 executes the motion at 20× the rate
model
S2 · slow
S1 · fast
Helix
7B VLM 7–9 Hz
80M expert 200 Hz
π0 / π0.5
PaliGemma 3B
300M flow expert
GR00T N1
VLM planner
DiT diffusion
Thinking, Fast and Slow — Daniel Kahneman (2011), Farrar, Straus and Giroux book cover
Origin · cognitive science
Thinking, Fast and Slow
Kahneman 2002 Nobel lecture & 2011 best-seller. The S1 / S2 partition that the whole field is now copying.
prompt S2 · thinking <think> ... long chain-of-thought </think> 10 — 60 s · ~k tokens S1 · answer final tokens fast · concise
Same pattern · non-robot AI
LLM reasoning — "think then answer"
OpenAI o1 / DeepSeek-R1 / Claude Extended Thinking all separate a long S2 reasoning trace from a fast S1 answer.
arXiv:2412.16720 (o1) · arXiv:2501.12948 (R1)
Ch 5 — Diffusion + Dual 44 / 77

π0 & π0.5 — flow-matching action expert in unseen homes

π0 architecture · Black et al. 2024 · Fig 1
pi0 architecture: pi-dataset + OXE + internet pre-training → SigLIP 400M + Gemma 2.6B pre-trained VLM → 300M action expert → 14/18/7-DoF embodiments
VLM + action expert mixture-of-experts. Flow matching: a single learned velocity field vθ(at, t, z) — one short ODE solve at inference vs T DDPM steps.
π0.5 · never-seen homes

Fully autonomous Airbnb bedroom cleanup · 5× speed

94%
language-follow rate, unseen homes
100+
never-seen eval environments
10×
inference speedup over DP-style DDPM
Ch 5 — Diffusion + Dual 45 / 77

Helix — Figure AI's 7B over 80M, two robots, one weights

Two robots · full upper-body 35-DoF · shared weights

Helix grocery put-away — Figure 02 humanoids collaborate · Feb 2025

Frequency split · built from Helix blog text
S2 · 7B VLM scene + language + intent 7 — 9 Hz SINGLE LATENT z S1 · 80M visuomotor transformer closed-loop dual-arm motor control 200 Hz ~25× faster than S2 · bandwidth makes the difference
single weights

Both robots run the same network — no role-specific finetuning, no per-robot models.

on-board inference

Both S1 and S2 run on an embedded GPU on the robot — no cloud.

closed-source

No paper, no weights — the architecture description is from the blog text. Figure 03 follow-up streamed live in May 2026.

Ch 5 — Diffusion + Dual 46 / 77

GR00T N1 → N1.5 → N2 — one company, one year, three generations

MAR 2025 · GEN 1

N1

first open humanoid stack
VLM S2 planner 2 B DiT action expert diffusion data: real + sim + human video
  • S2 VLM + S1 diffusion transformer — the GR00T template
  • Open weights — first downloadable generalist humanoid policy

arxiv:2503.14734

SEP 2025 · GEN 1.5

N1.5

synthetic data via GR00T-Dreams
Cosmos / GR00T-Dreams synthetic data action-labelled video N1.5 training 3 mo human data ↓ 36 hours
  • Same dual-system arch, retrained on world-model-generated data
  • Novel-object generalization — pick objects never seen in teleop

research.nvidia.com/labs/gear/gr00t-n1_5

FEB 2026 · GEN 2 (DreamZero)

N2 a.k.a. DreamZero

video × action joint generation
14B AR VIDEO DIFFUSION joint frames & actions video action
  • Dual-system collapses into one — generation = policy
  • Cross-embodiment in 30 min of plays — we revisit in Ch 7

arxiv:2602.15922 · dreamzero0.github.io

One company, one year — open dual-system → world-model-generated data → joint video×action generation. Train cost on synthetic data fell from 3 months → 36 hours.
Ch 5 — Diffusion + Dual 47 / 77

2026 frontier — Fast-in-Slow + Gemini Robotics 1.5

Fast-in-Slow

share parameters
FiS-VLA · Chen et al. NeurIPS 2025
S2 · VLM scene + language + reasoning ~29 Hz S1 nested inside S2 selected layers shared · action expert reuses VLM features 117.7 Hz 1:4 ratio · S2 thinks every 4 S1 ticks
  • No latent vector handoff — S1 reuses S2's intermediate features directly
  • Reaches 117.7 Hz on a single NVIDIA 4090 with chunk size 8

arxiv:2506.01953 · fast-in-slow.github.io

Gemini Robotics 1.5

externalize reasoning
Google DeepMind · Sep 2025 + follow-ups
ER 1.5 · embodied reasoner multi-step plan · web tools · chain-of-thought thinking EXPLICIT TEXT PLAN Gemini Robotics 1.5 · VLA actor plan + image → action chunk · embodiment-agnostic Aloha · bi-arm · humanoid
  • Communication channel is natural language — not a latent vector
  • Plans are inspectable, debuggable, and use external tools

deepmind.google/blog/gemini-robotics-15

2026 trend
Beyond separate S1 & S2 — share parameters (FiS-VLA) or externalize the reasoning channel (Gemini Robotics 1.5). Two opposite answers to "make the bridge thicker."
Chapter 6 — World Models 48 / 77
06

World Models.

Can a model learn the dynamics of the world — and use them?

Ch 6 — World Models 49 / 77

Origin — not a 2018 invention

The same partition appeared three times: model-based RL · recurrent controller · V·M·C decomposition.

1990 · ML Workshop

Dyna

Sutton
"Integrated Architectures for
Learning, Planning, and Reacting"
real env (s, a) → s', r learned model M̂(s, a) → ŝ', r̂ learn policy Q-learning real exp. imagined

Model-based RL: train the model from real interaction, then plan / value-iterate inside the model. Same agent learns real + imagined experience.

Sutton, ML Workshop 1990

1990 · FKI-126 / 147 TR

RNN world model
+ controller

Schmidhuber
"Making the World Differentiable"
+ "An On-line Algorithm…"
backprop through time controller RNN π world model RNN M̂ action ∂ loss / ∂ a (differentiable) controller learns by propagating reward gradients through M̂

Differentiable world model: gradients of future reward flow through M back into π. This is exactly the modern policy-via-rollout pattern.

Schmidhuber 1990, TR FKI-126/147

2018 · arXiv:1803.10122

World Models · V·M·C

Ha & Schmidhuber
deep V·M + tiny policy formula
deep V·M + tiny C → policy inside a dream V VAE z M MDN-RNN p(z'|z,a,h) C CMA-ES 867 params policy trained inside M's dream

Same partition, modernized: VAE for V, MDN-RNN for M, tiny CMA-ES controller. Agent never touches the real env during RL.

arXiv:1803.10122 · worldmodels.github.io

unifying
recipe
learn M̂(s, a) → ŝ' · use it to improve π.
Every WM since — Dreamer, V-JEPA, DreamDojo, DreamZero — is a re-mix of these three boxes with bigger backbones and richer data.
Ch 6 — World Models 50 / 77

The Dreamer family — one idea, five years

Hafner et al. · latent-imagination policy learning
2020 · arXiv:1912.01603
Dreamer V1
Dreamer V1: dataset of experience → learned latent dynamics → value and action learned by latent imagination
RSSM latent dynamics + actor-critic in imagination. Continuous control SOTA (DMC).
2021 · arXiv:2010.02193
Dreamer V2
Dreamer V2 RSSM: image x_t → encoder → stochastic z_t (32 categoricals × 32 classes) + deterministic h_t → predicted reward r̂ + reconstruction x̂
Discrete latents (32 categoricals × 32 classes). First model-based agent to beat humans on Atari 200M.
2023 · arXiv:2301.04104
Dreamer V3
Dreamer V3 unified configuration vs tuned experts across 8 benchmarks (Atari, ProcGen, DMLab, Minecraft, Atari100k, Proprio Control, Visual Control, BSuite). V3 matches or beats tuned baselines with one set of hyperparameters. Plus Minecraft Diamond return curve.
One unified config matches or beats tuned experts across 8 benchmark families — that's the V3 thesis. Diamond from scratch is the bonus.
Historical anchor · CoRL 2022

DayDreamer — Dreamer on real robots

Wu et al. trained a real A1 quadruped to walk from scratch in ~1 hour — physical-world WM-based RL, no sim. The earliest concrete bridge from V·M·C to robotics.

World model = imagination simulator the policy trains inside.

  • One family, three generations — same V·M·C recipe, scaled
  • RL framing dominated through 2023
  • DayDreamer proves transfer to a physical robot — bridge to Slide 61
Ch 6 — World Models 51 / 77

Two axes — methodology × purpose

Same word, different cells
Entertainment / Games
Robotics / Physical AI
Latent dynamics
(RNN/RSSM)
Dreamer V3 (Minecraft)
2023 diamond from scratch — in-game RL
DayDreamer
CoRL'23 quadruped, 1 h real-world walk
JEPA
(latent predict)
not a games line
V-JEPA 2 · LeWorldModel · DINO-WM
2025-26 action-conditioned representation prediction (Slides 53-54)
A-cond video WM
(obs,a)→obs'
Genie 1 · 2 · 3 · Oasis · Hunyuan-GameCraft · Mirage · MS Muse / WHAMM
2024-26 playable: keyboard/mouse = action
DreamDojo · DreamZero · GAIA-1 · GR00T-Dreams
2026 robot action conditioning (Slides 55-56, 60) · Sora·Veo·Wan = data sources, not WMs
3D / spatial
(geometry)
Genie 3 (long-horizon consistency)
2025 blurs into the robotics column —
4D-GS · Cosmos Transfer · Lyra 2 · VGGT / VGGT-Ω · MapAnything
2025-26 3D scenes + geometric foundation models (Slides 57-59)
Today we focus on the right column. The left column is moving fast too — and its data · architecture often crosses over (Genie 3 → sim asset, GameCraft engines → robot synth).
Ch 6 — World Models 52 / 77

Three modern (robotics-relevant) families

What "WM" actually points to in 2025-26 robotics papers — three branches.

Family 1

JEPA · latent prediction

predict representations, not pixels
  • V-JEPA 2 (Meta, 2025)
  • LeWorldModel (2026)
  • DINO-WM (ICLR 2025)
Bet: abstraction & efficiency. Don't waste compute reconstructing pixels you'll never plan over.
→ Slides 53-54
Family 2

Action-conditioned video WMs

(obs, action) → next pixels
  • DreamDojo (NVIDIA, 2026)
  • DreamZero (NVIDIA, 2026)
  • GR00T-Dreams / DreamGen
  • GAIA-1 · (Sora·Veo·Wan as data sources only)
Bet: predict the next pixels given an action — that's what makes it a WM, not video gen.
→ Slides 55-56 · 60
Family 3

3D & spatial WMs

explicit geometry
  • 4D Gaussian Splatting
  • Cosmos Transfer
  • Lyra 2 (NVIDIA, 2026)
  • VGGT / VGGT-Ω / MapAnything
Bet: robots live in 3D. Carry geometry, don't re-infer it every frame.
→ Slides 57-59
Ch 6 — World Models 53 / 77

Family 1 · V-JEPA 2 (Meta, 2025)

Predict masked representations, not pixels — 1.2 B params
context (visible) target (masked) Context Encoder Target Encoder EMA · stop-grad PREDICTOR (in latent space) conditioned on action a_t (for V-JEPA 2-AC variant) L1 loss in latent space no pixel reconstruction Downstream → action understanding (Something-Something v2) · robot manipulation

"Don't predict pixels you'll never plan over."

  • 1.2 B params · trained on internet video
  • SOTA on action understanding & anticipation
  • Action-conditioned variant for manipulation
LeCun's long-standing pitch finally shipped at scale — representation prediction beats pixel prediction on efficiency.
Ch 6 — World Models 54 / 77

Family 1 · 2026 JEPA-line — DINO-WM + LeWorldModel

Small, plannable WMs in latent space — not pixels back.

DINO-WM

NYU + Meta · Zhou et al. · ICLR 2025
arXiv:2411.04983
DINO-WM architecture: past frames o_{t-k}..o_t → DINOv2 encoder → latents z_{t-k}..z_t → dynamics head p_θ with action a_t → ẑ_{t+1}; test-time actions optimized via planning loss vs goal z_g

Frozen DINOv2 features + small dynamics head trained on (z, a) → ẑ'. At test-time, optimize action sequence by gradient descent on planning loss to goal latent.

dino-wm.github.io

LeWorldModel

Maes · LeCun · Balestriero · Mar 2026
arXiv:2603.19312
LeWorldModel vs PLDM / DINO-WM / Dreamer / TD-MPC comparison: addresses 6→1 hyperparameter, anti-collapse, end-to-end, task-agnostic, reconstruction-free
LeWorldModel architecture: two encoders process o_t and o_{t+1}; predictor takes (z_t, a_t) → ẑ_{t+1}; MSE loss between ẑ_{t+1} and z_{t+1}; SIGReg regularization on latents (Statistical Inverse Gaussian) prevents collapse via random projection normality tests
48×faster planning
15Mparams total
1hyperparameter

github.com/lucas-maes/le-wm

why JEPA
Small + plannable in latent space — cheapest WM-for-planning option when you don't need pixels back. LeWorldModel proves end-to-end JEPA is now stable.
Ch 6 — World Models 55 / 77

Family 2 · Video gen as a data & sim source

not WMs themselves — but the substrate for them

Sora·Veo·Wan·Cosmos·Genie are video generators: prompt → video. They become WMs only with action conditioning (Slide 56, 60).

OpenAI2024-10 · v2 2025-10
Sora 2
Text-to-video w/ native audio. Consumer app shut Apr 2026 — API only.
DeepMindv3.1 2025
Veo 3 / 3.1
1080p / 4K, 8 s clip, native audio & dialogue. Frames-to-video for control.
NVIDIA2025-01 · arXiv:2501.03575
Cosmos Predict
Physical-AI WFM — pretrained for sim-to-real & robot data gen.
DeepMind2025-08
Genie 3
24 fps / 720p / multi-minute consistency — long-horizon scene memory.
Historical anchor · ICLR'24 Outstanding  ·  arXiv:2310.06114 UniSim (Yilun Du et al.) — first paper to call video generation an interactive real-world simulator. The robotics WM ports (DreamDojo, GAIA-1, etc.) wrap these backbones with action conditioning — next slide.
Ch 6 — World Models 56 / 77

Family 2 · 2026 robotics WMs — DreamDojo + DreamZero

action-conditioned · not video gen

Sora·Veo·Wan are video generators. WMs predict next state given action. Two true 2026 WMs:

DreamDojo

NVIDIA · Feb 2026
arXiv:2602.06949 · dreamdojo-world.github.io
DreamDojo: Human-Video Pretraining → DreamDojo → Robot Post-Training (GR-1, G1, AgiBot, YAM) → Autoregressive Distillation → Applications (Unseen Env, Live Teleop, Policy Eval, Model-based Planning)
  • Pretrain on human videos (EgoDex, In-lab, DreamDojo-HV) → post-train on robot data — cross-embodiment WM
  • Action-conditioned: (obs, action) → next obs — supports policy eval, MPC, unseen-env deploy

DreamZero

NVIDIA · Feb 2026
arXiv:2602.15922 · dreamzero0.github.io · 14B WAM
  • Joint video × action generation — one 14B AR model emits next frames AND next actions in lockstep
  • Open weights · cross-embodiment in 30 min plays · revisited in Ch 7 as the canonical WAM
important distinction
Video gen models (Sora·Veo·Wan·Cosmos) make beautiful pixels but do not take an action input. They become WMs only when wrapped with explicit action conditioning — Slide 60 covers that bridge.
Ch 6 — World Models 57 / 77

Family 3 · 3D & spatial world models

Carry the geometry — don't re-infer it every frame.

4D Gaussian Splatting

HUST · Wu et al. · CVPR 2024
arXiv:2310.08528
4D Gaussian Splatting coarse-to-fine pipeline: Random Point Cloud → 3D Gaussian Initialization at Iter 3000 → 4D Gaussian Joint Optimization at Iter 20000

Explicit 3D Gaussians + per-Gaussian motion over time. Real-time render, novel-view eval — the geometric substrate Lyra 2 generates and Cosmos Transfer renders on.

hustvl/4DGaussians

Cosmos Transfer 1

NVIDIA · Mar 2025
arXiv:2503.14492
Cosmos Transfer architecture: simulated world + depth/segmentation/etc sensor modalities + text prompt → frozen Cosmos-1 foundation model + per-modality ControlNets with spatiotemporal control maps → output world

Structured inputs (depth / segmentation / edge) → photoreal video via frozen Cosmos-1 + per-modality ControlNets. The standard 2025-26 sim-to-real data pipe.

research.nvidia.com/labs/dir/cosmos1

data pipe
Sim engine generates structure. Cosmos Transfer paints pixels. Policy trains on the pixels. Lyra 2 (Slide 58) generates the 3D scene itself.
Ch 6 — World Models 58 / 77

Family 3 · NVIDIA Lyra 2

One image / prompt → navigable 3D Gaussian scene — a geometry-native WM.

Lyra 2.0 teaser — text/image → navigable 3D Gaussian-splat scene (user pans through GUI)

A scene-generation foundation model.

  • Built on WAN 2.1 video backbone — distilled into 3D-GS
  • Output is 3D Gaussian splats, not video — truly navigable
  • Plugs straight into Isaac Sim for robot rollouts
Why this matters for robotics: generates the environment the policy will train on — not just a 5-second clip. Synthetic data & evaluation in one shot.
research.nvidia.com/labs/sil/projects/lyra2 · github.com/nv-tlabs/lyra
Ch 6 — World Models 59 / 77

Family 3 · Geometric Foundation Models

"World model" = 3D reconstruction as a feed-forward foundation. N views → cameras, depth, geometry in one pass.

VGGT

FAIR + Oxford
CVPR'25 Best
arXiv:2503.11651
VGGT architecture: N input images → DINO + concat + camera token → Global Attention + Frame Attention (×L) → Camera Head + DPT → cameras, depth maps, point maps, tracks in one forward pass

N views → cameras + depth + pointmaps + tracks in one feed-forward pass. The foundation.

VGGT-Ω

Wang · Vedaldi et al.
CVPR'26 Oral
arXiv:2605.15195
VGGT-Omega architecture: introduces Register Attention as alternative to Global/Frame Attention, plus training-only Matching Loss and Point Loss; same I/O as VGGT but lighter compute and richer supervision

What's new vs VGGT: Register Attention (lighter), Matching + Point losses (training-only).

30%memory
15×data
+77%Sintel

MapAnything

Meta + CMU
3DV 2026
arXiv:2509.13414
MapAnything method: Visual Input N + optional Geometric Inputs (Ray Directions, Pose, Depth) → Multi-Modal Encoders with shared weights → Multi-View Transformer → MLP scaling factor + DPT Head + Pose Head → metric 3D scene

Universal feed-forward metric 3D — real-world units, not just relative scale.

JEPA carries latents. Video gen carries pixels. This line carries geometry itself — what WoRV uses today for data & eval.
Ch 6 — World Models 60 / 77

Action-conditioned world models

Without an action input, it's just video gen. With one — it's a robotics WM. This is the bridge to Ch 7.

GAIA-1 — driving WM

Wayve · Sep 2023 · 9B params
arXiv:2309.17080
GAIA-1 schematic: input video → image encoder, action input (speed, steering) → action encoder, text input → text encoder, all three streams → world model with autoregressive prediction → output tokens → video decoder → output video

Three input streams — video, action, text — into one AR transformer. Counterfactual rollouts: "what if I steer left?"

wayve.ai/thinking/introducing-gaia1

GR00T-Dreams / DreamGen

NVIDIA GEAR · May 2025
arXiv:2505.12705
GR00T-Dreams / DreamGen: initial frame → video world model → synthetic generated videos for robot learning with automatically extracted pseudo-actions â_{1:H}, used for contact-rich data augmentation, new behavior generalization, new environment generalization

Initial frame → WM → synthetic robot videos. Pseudo-actions â auto-extracted — used for new behaviors and unseen environments.

developer.nvidia.com/blog/r2d2

bridge to Ch 7
Once you can condition on action, you can also generate action. That's a WAM (World Action Model) — DreamZero (Slide 66).
Ch 6 — World Models 61 / 77

Why WMs matter for robotics

framework · NTU MARS survey arxiv:2605.00080

Three core capabilities of an actionable world model — foresight, imagination-planning, data amplification.

1
foresight

Anticipate consequences before executing

Predict next state under candidate actions — the policy can reason about contact, dynamics, and physical regularities language-only pretraining never captures.

Examples · LingBot-VA · SayDream · MOTUS · TC-IDM
2
planning

Imagine rollouts & pick the best

MPC / search inside the WM. Use the imagined future to compare candidate behaviors before acting.

DayDreamer (real A1 in 1h) · Dreamer V3 (Minecraft Diamond) · DINO-WM (latent planning) · CosmosPolicy
3
data

Synthesize trajectories at scale

Trained WM = generator of new (obs, action) pairs — replace expensive teleop with imagined rollouts.

GR00T-Dreams: 3 months → 36 h · DreamDojo · DreamGen · CosmosPredict
+

Zero-shot policy evaluation

Roll a policy inside the WM — no real-robot time. DeepMind's Gemini-in-Veo: 1,600+ real evals replaced. WorldEval (Midea) is the academic counterpart.

Now core to the learning loop

2026 trend (NTU MARS): WMs no longer auxiliary — VLA-RFT, WMPO, RynnVLA-002 co-evolve policy + WM in one loop.

NTU MARS survey landscape: temporal evolution of representative WM works for robotic policy learning — 'World Model for Policy' branch (UniPi, Gen2Act, VidMan, VPP, GR-1, UVA, UWA, FLARE, Vidar, WorldVLA, RynnVLA-002, DreamVLA, TriVLA, UniVLA, VideoPolicy, VideoVLA, UD-VLA, GE-ACT, Motus, F1, InternVLA-A1, Video2ACT, LVP, MimicVideo, CosmosPolicy, DreamZero, GigaWorld-Policy, Fast-WAM, LingBot-VA, BagelVLA, LDA-1B, FRAPPE, WoG, VLA-JEPA, JEPA-VLA, HALO, CoWVLA, TC-IDM, Say-Dream-ACT) + 'World Model as Simulator' branch (IRASim, GPC, World-Env, Ctrl-World, World in World, WorldEval, World4RL, VLA-RFT, DiWA, DreamPlan, WMPO, RISE, Giga-Brain-0.5M, WorldVLA-Loop, PlayWorld, VLAW, WoVR), color-coded by style
The landscape — NTU MARS survey

~60 models in 18 months across 2 branches, 7 styles.

  • For Policy: IDM → Single-Backbone → MoE/MoT → Unified VLA → Latent WM
  • As Simulator: validation → RL env → policy co-evolution
why this chapter is longest One trained WM addresses foresight · planning · data · eval · co-opt — five of the most expensive things in robotics.
Chapter 7 — World Action Models 62 / 77
07

World Action Models.

Does knowing the world’s dynamics turn into zero-shot policies?

Ch 7 — WAM 63 / 77

The IDM idea — label video backwards

Baker et al. · OpenAI VPT · 2022
(a) Supervised · teleop / contractor

Pay humans to label every frame

HUMAN contractor 270k h records FRAME ACTION (KEY+MOUSE) W+SPACE, dx=4, dy=-2 BC policy π(a|o) Cost: human-hour per label. Caps at ≈10⁵ frames.

Linear in dollars. Every new frame = another teleop hour. OpenVLA, DROID, OpenX all live here.

(b) IDM · infer action from 2 frames

Recover the missing label from raw video

tiny labeled set 2k h contractor trains the IDM ↓ WEB VIDEO 70k h, no labels YouTube Minecraft ot ot+1 IDM πinv(a | ot, ot+1) PSEUDO-LABEL <â = W+SPACE, dx=4> BC on web VPT diamond!

Sub-linear in dollars. 2k labeled hours teach the IDM → the IDM labels 70k unlabeled web hours → first foundation policy in Minecraft.

port to robotics
Same recipe runs in robot land: UniPi (NeurIPS'23) · GR-1 (ICLR'24, CALVIN 94.9) · VPP (ICML'25) · NovaFlow ('26).
Ch 7 — WAM 64 / 77

LAPA — latent actions from video

Ye et al. · ICLR 2025 · 🇰🇷 KAIST-led

Drop the hand-defined action label. Learn a latent action with VQ-VAE on inter-frame deltas.

Step 1

Latent action

VQ-VAE on (ot, ot+1)

Quantize inter-frame deltas → discrete latent z, no human label.

Step 2

Latent VLA pretrain

predict next z · web video

Train VLA to predict next latent action from raw video. Zero robot data.

Step 3

Align to real action

tiny robot finetune → a

Map latent z to real joint actions with a small labeled set.

Why it matters — the IDM idea, without hand-defined labels

+6.22%
over OpenVLA SOTA
30×
more pretrain-efficient
  • Generalizes across embodiments — the latent space is action-free, not joint-specific
  • Sets up the WAM intuition: video carries action information implicitly
  • Directly cited by DreamZero, InternVLA-A1, LingBot-VA as the latent-action ancestor

🇰🇷 First authors + both advisors (Kimin Lee · Minjoon Seo) at KAIST.

Ch 7 — WAM 65 / 77

From IDM / LAPA to joint video × action modelling

Three architectural styles — the bifurcation point splits Slides 66 (canonical) and 68 (hybrid).

STEP 1 · 2022

IDM

video → action

Recover the missing label using a tiny labeled set.

VPT · UniPi · GR-1
STEP 2 · 2024

LAPA

video → latent z → action

Learn the action as a VQ-VAE latent. Drop hand labels.

LAPA · UniSim-latent
STEP 3 · 2026

WAM

video ↔ action (joint)

Stop separating them. Generate future video AND action together.

DreamZero (canonical) + 12 hybrids
Step 3 architecture variants · NTU MARS survey
Three WAM architecture styles from NTU survey: (a) IDM-Style = Video Generation Model → Inverse Dynamics Model → action; (b) Single-Backbone-Style = shared backbone emits both observation tokens and action tokens; (c) MoT-Style = separate Video Expert and Action Expert with Joint Attention

(3) Canonical WAM

→ slide 66

Single-Backbone (b). One generative model over both video and action. AR diffusion. DreamZero, Cosmos-Policy.

(3a) VLA + WM hybrid

→ slide 68

MoT-Style (c). Video expert + action expert, joint attention. Video-pred as auxiliary loss. InternVLA-A1, LingBot-VA.

Ch 7 — WAM 66 / 77

DreamZero — “World Action Models are Zero-shot Policies”

NVIDIA GEAR · 2026-02

DreamZero overview · 14B AR video diffusion + action

62.2%
AgiBot · seen

vs VLA 27.4%

39.5%
AgiBot · unseen

vs from-scratch ≈0%

49%
DROID · unseen verbs

vs π0.5 33% · GR00T N1.6 31%

Why-it-works: VLA = semantic prior · WAM = physical prior. Generating the future video conditions the action on a real rollout.
Single-backbone video × action lineage · 1 year, 7 steps
2024 · ICLR
GR-1
ByteDance
2025 · ICML
VPP
Tsinghua
2025 · RSS
UVA
Stanford
2025 · RSS
UWM
UW WEIRD
2026 · ICLR
GE-Act
AgiBot
2026-01 · arXiv
Cosmos-Policy
NVIDIA · 2B
2026-02
DreamZero
NVIDIA · 14B

DreamZero didn't appear out of nowhere — a year+ of single-backbone work made it scalable. Cosmos-Policy (Jan 2026) was NVIDIA's own one-month-prior step.

Ch 7 — WAM 67 / 77

DreamZero deployment — 7 Hz at 14B · 30 min to a new embodiment

(a) System · running a 14B WAM at robot rate
14B AR video diffusion + action multi-step denoise distill DreamZero-Flash single-step distillation causal cache · KV reuse 7 Hz control rate CONTEXT · control freq for big VLMs RT-2 · 1 Hz OpenVLA · 5 DreamZero · 7 π0 · 50 Helix S1 · 200

14B normally implies seconds-per-step. Flash distillation + KV-cache reuse lands at robot-usable rate — not in the π0 zone, but enough for tabletop manip.

(b) Cross-embodiment · 30 min on YAM

YAM teddy transfer · 30 min of plays, new robot · one model

Model + deployment, advancing together.

The same generative prior that gives zero-shot tasks also gives fast embodiment adaptation.

Ch 7 — WAM 68 / 77

Concurrent VLA × WM hybrids — early 2026

same intuition · different recipes

InternVLA-A1

2026-01-05
Shanghai AI Lab + Humanoid Robot (Shanghai) Co. · 42 authors · lead Jia Zeng / Jiangmiao Pang
+26.7%
dynamic tasks
75.1%
avg / 12 real tasks

MoT (Mixture-of-Transformers) — one backbone, three experts: understanding, generation, action. 2-3B params, 692M pretraining frames.

arXiv:2601.02456

Causal World Modelling aka LingBot-VA

2026-01-29
Ant Group / Robbyant · Li / Zhang / Luo et al. · corresp. Yinghao Xu
LingBot-VA framework: Language Model + alternating Video Model (generates next frames Ô) + Action Model (predicts next actions Â), with shared latent space and async inference. Task prompt 'Unpack delivery', initial observation O_0, sequence O_1→O_2→O_3 interleaved with A_1, A_2, A_3.
92.9%
LIBERO · easy
91.6%
LIBERO · hard

AR diffusion over future frames + policy execution in shared latent space. SOTA on all 6 real-robot tasks vs π0.5; +8-9% at horizon 3.

arXiv:2601.21998
Same intuition, different design. Plus 12+ more in the NTU MARS survey · GE-Act · Motus (+45% over π0.5) · BagelVLA · FRAPPE · STARRY · WAV.
Ch 7 — WAM 69 / 77

WAM follow-ups — mid 2026 · four directions in ∼3 months

post-DreamZero
simplify architecture

Being-H0.7

2026-04-30
Peking U / BeingBeyond · Zongqing Lu

No future-video generation. Just a dual branch with learnable latent queries between perception and action. Pretrained on 200k h ego video + 15k h robot demos — the Ch3 ego pool, finally cashed.

99.2%
LIBERO
62.1%
RoboCasa
49.2%
GR1
arXiv:2605.00078 · ← Ch3 forward-ref
transfer efficiency

CKT-WAM

2026-05-07
Tsinghua + LivsynRobotics + Shanghai AI Lab

Teacher hidden states → compressors → routed adapters → student text-embedding space. Knowledge transfer between WAMs without touching the backbone.

86.1%
LIBERO-Plus
1.17%
trainable params
arXiv:2605.06247
object addressing

OA-WAM

2026-05
Object-Addressable WAM

Treat objects as first-class addressable entities inside the world model. Lets the policy refer to “the red cup” the way a VLA refers to a token — not a pixel patch.

97.8%
LIBERO
preprint · 2026-05
counter-narrative
do we need imagination?

Fast-WAM

2026-03

Test-time future imagination is unnecessary.” Skip the rollout, get SOTA at 4× speed — directly challenges DreamZero / Being-H0.7’s core premise.

190 ms
inference
faster

FFDC-WAM threads the needle with conditional imagination.

Ch 7 — WAM 70 / 77

WMs vs WAMs vs VLAs — 5-architecture view

NTU MARS survey · arXiv:2605.00080
Pattern 1

IDM-style

IDM-style architecture

Decouple: predict subgoal / latent, then act. VPT · UniPi · LAPA

Pattern 2

Single-backbone

Single-backbone architecture

One model, video × action jointly. UVA · UWM · DreamZero

Pattern 3

MoE / MoT

Mixture of Transformers / Experts architecture

Expert fusion / joint attention. LingBot-VA · GE-Act · BagelVLA

Pattern 4

Unified-VLA

Unified VLA architecture

VLA + foresight / video pred. as aux. GR-1 · InternVLA-A1 · UniVLA

Pattern 5

Latent-space

Latent / JEPA-style architecture

JEPA-style, no pixel gen. V-JEPA 2 · VLA-JEPA · FLARE · Being-H0.7

Every model in today’s talk lives in one of these 5 boxes. Patterns 2 and 3 are where the action is in 2026 — canonical WAM (slide 66) and VLA × WM hybrid (slide 68).
Chapter 8 — WoRV @ maum.ai 71 / 77
08

WoRV @ maum.ai

How we're working on all of this from Korea

Ch 8 — WoRV 72 / 77
WoRV

WoRV @ maum.ai

World model for Robotics and Vehicle control

maum.ai's physical-AI division.
We build foundation models that work in the physical world.

@ Pangyo IT Center #2 Member of the national WFM consortium — Korea Physical AI program (≈$25M scope, MSIT/IITP-supervised, 2026-)
Our bet — 2026 taxonomy

We bet on the WAM × dual-system line.

VLA
semantic-heavy
WM
world-heavy
WAM
★ WoRV position

(Taxonomy reused from Ch7 slide 70 — we build a Korean stack on the same line.)

Ch 8 — WoRV 73 / 77

What we do — three stacks

Data → Models → Deployment. End-to-end stack from teleop to B2B install.

PILLAR 01

Data Infrastructure teleop rigs · video pipelines · synthetic factory

CANVAS data pipeline: 48 hours and 219 kilometers of human-annotated navigation data collected via own teleop rigs
219 km
CANVAS nav data
48 h
human-annotated

ALOHA / UMI / egocentric video pipelines + Cosmos-Transfer synth factory.

CANVAS · HuggingFace maum-ai/CANVAS-S · maum-ai/COMMAND

PILLAR 02

Robotics Foundation Models in-house VLA · WAM · dual-system

VLA
backbone
WAM
action world model
S1+S2
dual system

Ch6 / Ch7 lines implemented in-house — our WAM × dual-system bet.

internal · targeted public release in 2026 H2

PILLAR 03

Tailored B2B from eval to deploy

-$27
CANVAS / run
-$35
LiDAR+GPS / run

CostNav measures navigation in real economic cost. I'm last author.

arXiv:2511.20216 · worv-ai.github.io/CostNav

Public HuggingFace orgs: maum-ai/CANVAS · maum-ai/CostNav · open-world-agents/D2E (next slide).
Ch 8 — WoRV 74 / 77

Selected research — recent & ongoing collabs

Three projects we're (co-)leading right now — data, world models, eval infra.

D2E

ICLR 2026 · arXiv:2510.05684
MAUM.AI × Stanford × SNU
D2E (Desktop-to-Embodied): Generalist Inverse Dynamics Model trained on desktop game data, then transferred to robotics including manipulation and navigation

Generalist Inverse Dynamics Model — trained on desktop game data (Brotato, Minecraft, BF6, …) → transfers to real-world manipulation (Meta-World, LIBERO) and navigation (CANVAS).

github.com/worv-ai/D2E · HF: open-world-agents/Generalist-IDM-1B

WorldCam

2026 · cvlab-kaist.github.io/WorldCam
Adobe × KAIST × MAUM.AI
WorldCam architecture: progressive autoregressive video transformer conditioned on camera poses with memory mechanisms; ingests gameplay frames + camera trajectories

Camera-controllable progressive AR video transformer — 3,000 minutes of gameplay frames with camera trajectories. The "WM as controllable simulator" research line.

github.com/cvlab-kaist/WorldCam

VLA-Eval-Harness

2026 · allenai/vla-evaluation-harness
AI2 × SNU × MAUM.AI
VLA-Evaluation-Harness: 47× speedup on LIBERO tasks via batch parallel evaluation, decoupling models from environments through standardized abstraction

Unified VLA eval across 18 benchmarks × 13 model servers. 47× faster LIBERO via batched parallel eval. Decouples model from environment.

github.com/allenai/vla-evaluation-harness · v0.2.0

collab footprint
D2E (SNU + Stanford + MAUM) · WorldCam (Adobe + KAIST + MAUM) · VLA-Eval-Harness (AI2 + SNU + MAUM). Always hiring — see next 2 slides.
Ch 8 — WoRV 75 / 77

Deployment — 5 verticals

Specific customers are confidential — verticals only

01

Agriculture

02

Construction

03

Maritime

04

Defense

05

Logistics &
Manufacturing

The model runs in 5 real industries — that fact alone is the point. Specific partners and impact figures stay verbal — confidentiality policy.
Ch 8 — WoRV 76 / 77

We are hiring

— people to solve this with us
Research2 positions
01Research Scientist / FellowVLA · WM · WAM
02Research Intern (World Model)min 2-month commitment
Engineering4 positions
03VLA Engineertraining · post-training
04AI SW Engineerproduction stack
05Robotics SW Engineeron-robot integration
06Simulation EngineerIsaac · Cosmos · WM rollouts
Business & PM2 positions + talent pool
07Project ManagerB2B vertical leads
08Technical Sales & Strategyvertical GTM
09Talent Pool · 상시채용always-open registration
QR — recruitment.worv.maum.ai
recruitment.worv.maum.ai
worv_hr@maum.ai
KR military alt-service — Research track KR military alt-service — Industrial track @ Pangyo IT Center #2
Q & A 77 / 77

Thank you.

Q & A
15 minutes
Contact
EMAIL sung@maum.ai
WEB alohays.github.io
TEAM WoRV @ maum.ai (Pangyo)
HIRING recruitment.worv.maum.ai
Further reading
"World Model for Robot Learning: A Comprehensive Survey"
Hou et al. · NTU MARS + Abbeel + Malik + Wu + Du · 2026-04
arXiv:2605.00080
Bookmark — for everything we didn't have time to cover.
QR — alohays.github.io/talks/hufs-2026-rfm
Deck
alohays.github.io/
talks/hufs-2026-rfm
QR — alohays.github.io (Yunsung Lee personal page)
My page
alohays.github.io
QR — recruitment.worv.maum.ai
Hiring
recruitment.
worv.maum.ai