HUFS LAI · 2026-05-27 Week 13 · Advanced 1 — Multimodal AI

Robotics
Foundation
Models

Trajectories and open problems

Yunsung Lee Head of Research · WoRV @ maum.ai

Intro 02 / 77

About Me

01 · Academic

Korea Univ MS · CMU Visiting

Multimodal learning · vision · diffusion. 10+ pubs at CVPR · NeurIPS · ICLR · ECCV. Visiting scholar at CMU LTI (2020).

02 · Industry journey

ScatterLab → Riiid → Wrtn → maum.ai

Iruda (LLM dialogue) · Santa TOEIC (vision tutor) · wrtn + crack (multimodal agents) · now WoRV (robotics FM).

03 · Now

Head of Research · WoRV @ maum.ai

World model for Robotics and Vehicle control

Korea-based RFM team — full stack: data · policy · world models. Returns in Ch8.

Intro 03 / 77

Today 8 chapters

CH 1

Robotics Foundation Models?

What "foundation" means in robotics

CH 2

A New Modality: Action

How do we tokenize action?

CH 3

Data Scaling

How do we collect and grow action data?

CH 4

VLAs — Vision · Language · Action

How does a VLM become a VLA?

CH 5

Diffusion Policy + Dual System

Speed and reasoning, at the same time

CH 6

World Models

Can a model learn the world's dynamics?

CH 7

World Action Models

Does "knowing" yield zero-shot policies?

CH 8

WoRV @ maum.ai

How we're working on this from Korea

Chapter 1 — Robotics Foundation Models? 04 / 77

01

Robotics Foundation Models?

What does "foundation" actually mean in robotics?

Ch 1 — RFM? 05 / 77

Why robotics AI — now

four forces converging · first time at once

Robotics has been "about to break out" for decades. The honest case for this wave is that four enablers are arriving in parallel.

01

FROM NLP / VISION SPILLOVER

Foundation models are ready

PaliGemma, DINOv3, SigLIP, CLIP — pretrained backbones now exist that can be fine-tuned for embodied tasks instead of trained from scratch.

Examples: RT-2 ← PaLI-X · OpenVLA ← Llama-2 · π0 ← PaliGemma

∆action grounding still doesn't inherit cleanly

02

REAL-WORLD ROBOT DATA

Data infrastructure exists

Open hardware (ALOHA, UMI) + open datasets (OpenX 1M+ episodes, DROID 76k) + synthetic pipelines (Cosmos-Transfer) — the field finally has a data stack.

Datasets: OpenX-Embodiment · DROID · BridgeData V2 · RoboCasa

∆still ~10⁶× short of LLM token counts

03

REAL HARDWARE AT REAL VOLUME

Humanoid + arm platforms shipped

The 2024-26 wave of physical platforms gives the field a shared body to train on — Figure F.03, Unitree G1/H1, 1X NEO, Tesla Optimus, Rainbow Robotics RB-Y1.

Platforms: Figure · 1X · Unitree · Tesla · Apptronik · Rainbow

∆cost & dexterity still bottleneck deployment

04

CAPITAL + TALENT REALLOCATION

Industry pull is real

$5B+ raised across Physical Intelligence, Skild, Figure, Wayve in 2024-25. National programs in KR, China, US, EU. Top NLP researchers explicitly pivoting.

Signal: Physical Intelligence ($400M) · Skild ($300M) · Figure ($675M)

∆signal-to-noise still messy; many bets won't survive

still unsolved

data wall · generalization across tasks & environments · embodiment-portability — that's why we spend the next 70 min on the field's 2024-26 attempts.

Ch 1 — RFM? 06 / 77

What "foundation" means in NLP / Vision

Bommasani et al. · Stanford CRFM · 2021

One model → many downstream tasks

Large-scale self-supervised pretraining
Downstream tasks need only light fine-tuning / zero-shot
Capability emergence backed by scaling laws

Already proven in NLP / Vision — next slide: what breaks when you copy this to robotics?

Ch 1 — RFM? 07 / 77

What "foundation" means for robots

Input: multimodal observation (vision + proprioception + language)

Output: physical action — both must generalize

Ch 1 — RFM? 08 / 77

Three axes of generalization

When we say a robot "generalizes," we always mean along one of these three axes.

AXIS 1

Task

new verb · same room

New verbs in a known environment. "fold the towel" vs "roll the towel."

flagship — RT-2 (semantic reasoning)

AXIS 2

Environment

same verb · new room

Same task in unseen rooms, lighting, objects. π0.5 hits 94% follow-rate in new homes.

flagship — π0.5 (OOD home)

AXIS 3

Embodiment

same brain · new body

Brand-new hardware, zero-shot. DreamZero adapts to a YAM arm with 30 min of play.

flagship — DreamZero · GR00T N1

Most recent papers bet on one of these three axes. No model generalizes across all three yet — that's the destination of "robotics foundation."

Ch 1 — RFM? 09 / 77

Why "scaling" alone isn't obvious here

Data scale comparison · log axis

"We don't have an internet of actions."

Unlike internet text, action data doesn't grow on its own. RT-1 ~130k episodes vs LLM ~trillions of tokens — a 5-6 order-of-magnitude gap.

So we need data tricks to scale — Ch3 is entirely about this problem.

Ch 1 — RFM? 10 / 77

The 2023 — 2026 unlock

2022

RT-1

130k episodes,
discretized actions

2023

RT-2

web knowledge
→ actions

2024

OpenVLA

7B open,
OpenX 970k

2024-10

π0

flow-matching
action expert

2025-02

Helix

S2 7B + S1 80M
@ 200 Hz

2025-03

GR00T N1

open humanoid
stack

2025-04

π0.5

unseen home
94% follow

2026-02

DreamZero

joint video +
action generation

Three years on one page — action tokenization → web knowledge transfer → open stack → dual system → OOD generalization → joint video × action generation.

Ch 1 — RFM? 11 / 77

The core thesis — Jim Fan's roadmap, this talk's evidence

aligned with Sequoia AI Ascent · May 2026

Jim Fan · NVIDIA · Sequoia AI Ascent 2026 · "Robotics: Endgame"

Fan's framework: "The Great Parallel" — robotics is replaying the GPT trajectory (pretrain → align → reason → auto-research). The three concrete unlocks he names for the pretrain stage are exactly the spine of this talk.

Jim Fan · May 2026 · paraphrased "Robotics' Endgame is on the GPT-parallel roadmap. Three unlocks get us through the pretrain stage."

His 3 unlocks — this talk's spine

①

Sensorized human dataegocentric video · UMI

→

Ch 3data scaling

②

Neural simulatorsDreamDojo · NVIDIA

→

Ch 6world models

③

Video-first WAMsDreamZero · vision/action as first-class

→

Ch 7world action models

+ 2 layers we add

+

S1/S2 dual-system convergenceHelix · π0.5 · GR00T

→

Ch 5diffusion + dual

+

Korea-built RFM stackdata, model, deploy — in-house

→

Ch 8WoRV @ maum.ai

Ch 1 — RFM? 12 / 77

Roadmap — 7 questions, 7 chapters

CH 2 Action modality How do we tokenize action?

CH 3 Data scaling How do we collect and grow that action data?

CH 4 VLAs How does a VLM become a VLA?

CH 5 Diffusion + Dual Speed and reasoning, simultaneously

CH 6 World Models Can a model learn the world's dynamics?

CH 7 World Action Models Does "knowing" turn into zero-shot policies?

CH 8 WoRV @ maum.ai How we're working on this from Korea — and who to talk to about joining

Further reading — for broader coverage of the field, see NTU MARS et al. arXiv:2605.00080 (2026-04).

Chapter 2 — A New Modality: Action 13 / 77

02

A New Modality: Action

Text, image, audio — you already know how to tokenize. What about action?

Ch 2 — Action Modality 14 / 77

Recap — the modalities you already know

Week 10 deck · Multimodal LLM fusion patterns

Modality	Canonical tokenizer	Example model
Text	BPE / SentencePiece	GPT, Llama
Image	ViT patches / VQ-VAE	CLIP, LLaVA, Chameleon
Audio	Mel-spec / EnCodec	Whisper, AudioLM
Video	Tubelets / latent frames	VideoMAE, Sora
ACTION	— today's question —	RT-1, π0, DreamZero

We are adding one more row to a table the Week 10 deck already filled in.

Original figure · CLIP · Radford et al. 2021

CLIP — contrastive pre-training + zero-shot prediction (Radford et al. 2021 Fig 1)

Image ↔ text tokens trained with contrastive loss — the canonical proof that any modality fits the "tokenize & predict" recipe.

Ch 2 — Action Modality 15 / 77

What is an "action" in a robot?

One control step · 7-DoF arm + 1 gripper

Franka Emika Panda 7-DoF arm with Joint 0 through Joint 6 labeled on each axis

A continuous vector — but embodiment-bound

Joint angles (Franka) vs end-effector pose (RT-1) vs motor torques (Atlas)
7-DoF single arm → 14-DoF bimanual ALOHA → 35-DoF Helix upper body
Grippers add a 1-D switch; dexterous hands add 20+ extra DoF
The "action space" is not portable — every robot speaks its own language

Looks like a simple continuous vector. Hides embodiment dependence — the bill we pay later.

Ch 2 — Action Modality 16 / 77

Why action is harder than text

1 · DISCRETENESS

Text: discrete by birth.
Action: continuous.

Cross-entropy "just works" on 50k tokens. Joint angles live in ℝⁿ; we must invent the alphabet.

2 · DETERMINISM

Text: next-token softmax.
Action: real physics.

The same command on the same scene yields a different outcome — mass, friction, contact, noise.

3 · DATA SOURCE

Text: scrape the internet.
Action: teleop, one hour at a time.

10¹³ tokens vs 10⁶ teleop episodes — no internet for tying shoelaces.

4 · REVERSIBILITY

Text: regenerate the answer.
Action: drop the cup, it breaks.

Mistakes have physical cost — safety, supervision, evaluation all become first-class problems.

Clip · Chelsea Finn @ YC AI Startup School

"No internet for robot actions" · 02:08–03:32

The whole rest of the talk is a response to cell #3. Ch3 fixes data; Ch5 fixes speed; Ch6–7 fix evaluation.

Ch 2 — Action Modality 17 / 77

Two tokenization strategies

STRATEGY A

Bin discretization

Slice each continuous dim into 256 bins, treat each bin as a vocab token, predict with cross-entropy — same loop as an LLM.

Reuses LLM loss, optimizer, decoder verbatim
Quantization error; per-dim factorization

flagship — RT-1, RT-2, OpenVLA

STRATEGY B

Continuous head

Keep the action continuous; bolt a separate diffusion / flow head on top of the VLM that denoises a whole action chunk at once.

Captures multimodal action distributions (no mode collapse)
Slower per call — but predicts a whole chunk at once

flagship — Diffusion Policy, ACT, π0

This is the #1 design choice behind every modern VLA. The next three slides take one example of each — and a 2025 hybrid that re-discretizes.

Ch 2 — Action Modality 18 / 77

Discrete — RT-1's action vocabulary

11 dims × 256 bins per dim

Clip · RT-1 supplementary video

EfficientNet → tokens → action tokens · 02:07–05:17

11 dims = 7 arm joints + 3 base velocities + 1 gripper
256 bins per dim → 11 tokens emitted per control step
Loss = next-token cross-entropy — identical to an LLM
Opens the door to VLM → VLA (Ch4)

Ch 2 — Action Modality 19 / 77

Continuous — action chunks

Predict k future actions in one shot

ACT — bimanual VAE-Transformer trained with action-chunk MSE
Diffusion Policy — same DDPM you saw in Week 12, but the target is an action sequence, not pixels
Chunking absorbs human teleop noise + handles multimodal action distributions

Clip · Russ Tedrake @ Princeton Robotics Seminar

"Diffusion policy = sequence of future actions" · 20:02–21:05

DDPM → policy. Same noise schedule, same ε_θ objective. Only the target tensor changes — from pixels to an action chunk.

Ch 2 — Action Modality 20 / 77

FAST — a compression-based tokenizer

Pertsch et al. Jan 2025 · Fig 2

FAST tokenization method (paper Fig 2) — 5-step pipeline: normalized action chunk → DCT → quantize → sparse matrix → flatten → BPE-compressed tokens

(1) chunk → (2) DCT → (3) quantize → (4) flatten low-freq first → (5) BPE compress

Quantize & Drop: DCT coefficients are scaled and rounded. High-frequency components (noise) collapse to zero and are omitted (dropped) prior to BPE encoding, yielding 5× compression.

Continuous → back to discrete, in a smarter basis.

DCT = Discrete Cosine Transform (same idea as JPEG image compression). Energy concentrates in the low-frequency components — the high-frequency tail is mostly noise.
BPE = Byte-Pair Encoding (same tokenizer family as GPT). Greedy merge of frequent integer pairs → compact vocabulary.
Result: ~5× shorter action sequences than RT-1-style 256-bin discretization → faster train + faster inference, same accuracy.

Big picture: FAST is Strategy C — the 2025 hybrid that quietly returns to discrete tokens, but in a basis where each token carries real information about the trajectory.

Ch 2 — Action Modality 21 / 77

Takeaway — action is the new token

Same table as slide 14 — with ACTION filled in

Modality	Canonical tokenizer	Example model
Text	BPE / SentencePiece	GPT, Llama
Image	ViT patches / VQ-VAE	CLIP, LLaVA
Audio	Mel-spec / EnCodec	Whisper
Video	Tubelets / latent frames	Sora
ACTION	(A) Bin discretization · 256/dim	RT-1, RT-2, OpenVLA
	(B) Continuous chunk + diffusion / flow	ACT, Diffusion Policy, π0
	(C) FAST · DCT → BPE re-discretize	π0-FAST

Action is now a modality that LM, diffusion, and flow can all consume. Next chapter — where do we get the data?

7 action-head architectures · Survey Fig 3

Survey Fig 3 — 7 sensorimotor architecture types for VLA (Transformer/VLM × Discrete/Diffusion/Flow)

Backbone (Transformer / VLM) × Action head (Discrete token / Diffusion / Flow / DiT) — 7 known combinations

Chapter 3 — Data Scaling 22 / 77

03

Data Scaling.

If actions don't grow on the internet, how do we scale them?

Ch 3 — Data Scaling 23 / 77

The action data wall

Robot-learning datasets · log scale

5-6 orders of magnitude short.

Every jump on this chart = a new collection method
Lab teleop < cross-lab union < in-the-wild < human video < sim
The rest of the chapter = four strategies to close the gap

Plus OpenX-Embodiment as the union of nearly everyone's lab data — see slide 29.

Ch 3 — Data Scaling 24 / 77

Strategy 1 — Better hardware (ALOHA family)

cheaper teleop → more data

ALOHA bimanual teleop hero — 6 dexterous tasks back-to-back · Zhao et al. 2023

ALOHA

2023 · Stanford

Open-source bimanual leader-follower rig, under $20k. Paired with ACT (Action Chunking Transformer).

Mobile ALOHA

2024 · Stanford

Wheeled base + ALOHA arms → whole-home tasks (cooking, laundry, elevator).

ALOHA 2 · ALOHA Unleashed

2024 · DeepMind

v2 hardware + sim · Unleashed = diffusion + lots of teleop → shoelace tying, gear insertion.

price ↓ = data ↑ · cheap rigs let any lab contribute.

Ch 3 — Data Scaling 25 / 77

Strategy 2 — Handheld in-the-wild (UMI → Sunday)

no robot needed at collection time

UMI

Chi et al. · RSS 2024

arXiv:2402.10329

GoPro + handheld gripper → robot-compatible data collected anywhere (home, restaurant, outdoors). Cuts the per-episode cost by an order of magnitude.

Sunday Glove · Memo

Sunday Robotics · Nov 2025

launch

Same UMI team productized it: a $200 wearable glove + Memo humanoid. 2,000+ gloves shipped, 10M household-chore episodes from 500 homes.

Narrative bridge — Tony Zhao + Cheng Chi (ALOHA · Mobile ALOHA) → spun out as Sunday. Same people, lower friction at every step: $20k rig → $400 GoPro rig → $200 glove.

Ch 3 — Data Scaling 26 / 77

Strategy 3 — Egocentric video (Ego4D · Ego-Exo4D)

massive · unlabeled · first-person

Ego4D

Grauman et al. · CVPR 2022

arXiv:2110.07058

3,670 hours of first-person video from 923 participants across 74 cities, 9 countries. The biggest single ego-video pool.

Ego-Exo4D

Grauman et al. · CVPR 2024

arXiv:2311.18259

Paired ego + exo third-person views of the same activity — supplies the alignment needed for body / hand transfer.

Human video has no action labels — but it is the only pool that already exists at scale. The next slide (EgoMimic / EgoVLA / HumanPlus) is about how we turn pixels into policy. Forward-ref → Ch7 LAPA recovers latent actions directly from these frames.

Ch 3 — Data Scaling 27 / 77

Strategy 3+ — Video → robot

embodiment bridge, not just supervision

EgoMimic

Georgia Tech · ICRA 2025

Aria glasses + bimanual robot, co-trained on paired human + robot data.

arXiv:2410.24221

EgoVLA

NVIDIA + UCSD · 2025

Pretrain a VLA on human video, fine-tune on a tiny robot set.

arXiv:2507.12440

HumanPlus

Stanford · CoRL 2024

Unitree H1 shadows humans from 3rd-person video. Boxing, piano, table tennis.

arXiv:2406.10454

Not just more supervision — an embodiment bridge. Human pixels transferred to robot policy, then to humanoid bodies.

Ch 3 — Data Scaling 28 / 77

Strategy 4 — Synthetic (sim + world-model dreams)

no humans in the loop

NVIDIA "From Dreams to Reality" — DreamGen / GR00T-Dreams synthetic trajectories

Isaac Lab

GPU-accelerated physics sim. Thousands of parallel envs, randomized scenes / lighting / textures.

RoboCasa

100k procedurally-generated kitchen tasks — appliances, layouts, AI textures.

DreamGen · GR00T-Dreams

arXiv:2505.12705

World model generates trajectories — 3 months → 36 hours of human data for GR00T N1.5.

Forward-ref → Ch6 World Models · Ch7 WAMs reinterpret generation itself as the data factory.

Ch 3 — Data Scaling 29 / 77

OpenX-Embodiment — the union

22 embodiments · 60+ labs · 1M+ episodes

OpenX 22-embodiment task montage — same skill, different bodies · Collaboration et al. 2023

"Let's pool what we have."

Not a new collection method — a convention that aligns 60+ labs' formats so they're trainable together.

embodiments

22

episodes

1M+

institutions

60+

skills

527

Spawned the dataset wave around it: DROID (76k traj, OpenVLA's training pool), BridgeData V2, ALOHA Unleashed, RoboCasa. Same field-wide pressure to grow the pool.

Ch 3 — Data Scaling 30 / 77

Recap — data > model

almost every SOTA jump came from a new data source

Almost every SOTA jump
came from new data,
not a bigger model.

RT-1 → RT-2 = web data (not bigger ViT)
OpenVLA = OpenX 970k (not new architecture)
π0.5 OOD homes = co-training with 22k web episodes
GR00T N1.5 = DreamGen synthetic (3 mo → 36 hr)

Next chapter — how do we actually pour this data into a model?

Chapter 4 — VLAs: Vision-Language-Action Models 31 / 77

04

VLAs.

If a VLM outputs text, what does it take to make it output an action?

Ch 4 — VLAs 32 / 77

Definition — swap the head, get a VLA

Same vision encoder. Same LM backbone. Different decoder.

A · Vision-Language Model (VLM)

LLaVA, PaLI, PaliGemma, Chameleon — the row in last week's table.

B · Vision-Language-Action Model (VLA)

RT-2, OpenVLA, π0, GR00T — the only structural change is the right-hand block.

Ch 4 — VLAs 33 / 77

RT-2 — web knowledge → robot actions

The paper that made co-fine-tuning on web data + robot data the default.

Official project-page video · robotics-transformer2.github.io

Robot picks the "extinct animal" toy · never trained on the word "dinosaur" (rt2_teaser.mp4)

Backbone

PaLI-X / PaLM-E

5B · 55B variants, frozen web pre-training kept.

Action head

Actions as text tokens

256-bin discretization, integers in the LM's own vocab.

Trick

Co-fine-tune

VQA + caption + robot trajectories in one batch → emergent semantic generalization.

Result: the robot can act on concepts it never saw in robot data — "pick the extinct animal" works because the LM knew the word and the action head spoke the same token language.

Ch 4 — VLAs 34 / 77

OpenVLA — opening the door

The moment academia could reproduce a SOTA VLA on its own GPUs.

Project site · openvla.github.io

7B params · LoRA fine-tunes on a single 24 GB GPU · multi-embodiment generalist

7B

parameters · open weights

970k

OpenX episodes

Recipe

SigLIP + DINOv2 → Llama‑2 7B → bin tokens

Same head style as RT-2 (256-bin discretization), but every part is open and swap-in/swap-out.

← Bridge back to Ch3 slide 29

OpenVLA is the first direct payoff of the OpenX union. 970k episodes from 22 embodiments — the dataset Ch3 ended on — trained in one pass on one open model.

Ch 4 — VLAs 35 / 77

Generalist policies — three takes on the same year

Same problem, different action head: diffusion vs. block-wise transformer vs. dual-system.

RDT-1B

Oct 2024

Diffusion transformer over bimanual ALOHA. 1B params. Multi-modal action distribution → cleaner mode separation than bin tokens.

Tsinghua · arXiv:2410.07864

Octo

May 2024

Block-wise transformer + diffusion action head. Trained on 800k OpenX trajectories — runs on 9 embodiments out of the box.

Berkeley · arXiv:2405.12213

GR00T N1

Mar 2025

Dual system: Eagle VLM (S2) + DiT action head (S1). First open humanoid foundation model — data pyramid web → human ego → teleop.

NVIDIA GEAR · arXiv:2503.14734

Three labs, twelve months apart, each picking a different action head — VLA is not a single recipe. Each design choice (diffusion / block transformer / dual-system) becomes a whole chapter of follow-up work (Ch5).

Ch 4 — VLAs 36 / 77

Open vs closed VLA landscape

2026 H1 · 22 models

Open weights

Closed / API-only

Academic

★ Open + Academic — "your semester project" 4

OpenVLA '24 Stanford RDT-1B '24 Tsinghua Octo '24 Berkeley LAPA '24 KAIST

Weights + code + dataset recipes · LoRA-finetune on one GPU.

Closed + Academic — rare ~0

Empty cell — if it comes from a university, the incentive is to release.

Industry / product

Open + Industry — "open core" is winning 10

π0 '24 PI · openpi π0-FAST '25 PI π0.5 '25 PI · Sep PyTorch GR00T N1 '25 NVIDIA GR00T N1.5 '25 NVIDIA GR00T N1.6 '26 NVIDIA GR00T N1.7 '26 NVIDIA DreamZero '26 NVIDIA InternVLA-A1 '26 Shanghai AI Lab LingBot-VLA '26 Ant Group

2025-26 shift — openpi (Feb '25) + π0.5 PyTorch (Sep '25) + GR00T Isaac repo (N1→N1.7) + DreamZero, InternVLA-A1, LingBot all open in 2026 H1. Industry releasing weights for ecosystem leverage.

Closed + Industry — product moat / tech report only 8

RT-1 '22 Google RT-2 '23 Google Helix '25 Figure Gemini Rob 1.5 '25 DeepMind π0.6 '25-11 PI · report only π0.7 '26-04 PI · report only GEN-1 '26 Generalist AI GENE-26.5 '26 Genesis AI

Paper or blog + demo videos — weights not released even when tech report is public (π0.6/0.7). Vertical-integrated bet: Helix, GENE-26.5.

2026 shift

Open column went from 4 → 14 models in 18 months. Closed bet now splits into vertical-integrated (Helix, GENE-26.5) vs tech-report-only (π0.6, π0.7).

Ch 4 — VLAs 37 / 77

What VLAs are still bad at

VLM limits + robotics-native limits = the four open problems of 2026.

FAIL 01 Precise contact ✕

Threading a needle, plugging a USB, inserting a key —
sub-millimeter force-modulated contact is where smooth VLA rollouts fall off a cliff.

VLM tokens have no haptic channel ·
tactile sensing not yet in the input pipe.

FAIL 02 Long horizon ✕

"Make breakfast" — 30 sub-tasks, recovery from a dropped egg, no global plan to fall back on.
VLAs drift after ~30s of autonomous rollout.

No explicit planner · error compounds
across token-by-token rollouts.

FAIL 03 Novel embodiment ✕

Trained on Franka + ALOHA → deployed on a new arm with different DoF, gripper, joint limits.
Zero-shot collapse is near-universal.

Action vocabulary is embodiment-specific ·
DreamZero needs 30 min of YAM plays to adapt.

FAIL 04 Speed ✕

RT-2 runs at ~1 Hz; OpenVLA ~5 Hz.
Reactive contact and dynamic motion need 30–200 Hz. Big VLM = smart but slow.

1 Hz vs 200 Hz · the smarter the model,
the slower it serves — Ch5's whole motivation.

Ch 4 — VLAs 38 / 77

Two responses — faster S1, smarter S2

The field splits along the Kahneman line. Both halves get their own chapter.

VLA today smart but slow · contact-fragile · embodiment-locked

↙ split along Kahneman line ↘

S1

Faster motor expert

Take the action head off the LM's critical path. A small, fast expert runs at 30–200 Hz; the VLM only steers it.

› Diffusion Policy · denoise an action chunk
› π0 flow-matching expert · 50 Hz
› Helix S1 · 80M params @ 200 Hz
› FAST tokens · shorter sequences, same head

→ CHAPTER 5

S2

Smarter high-level brain

Externalize reasoning. Give the VLM a world to roll forward, a plan to follow, a video to imagine.

› Embodied chain-of-thought · Gemini Robotics ER 1.5
› World models as simulators · rollout-as-reasoning
› World Action Models · joint video + action generation
› DreamZero · the canonical WAM

→ CHAPTERS 6 & 7

Ch 4 — VLAs 39 / 77

VLA timeline — 14 milestone models

Survey Fig 2 · bookmark for Ch5–7

Timeline of major VLA models — Kawaharazuka et al. Survey Fig 2

Y-axis: chronology · X-axis: architectural lineage (CNN → Transformer → VLM → Diffusion / DiT → Flow → Latent action → Hierarchical)

Read the chart left–right.

Each x-column is a family of decoder choices. Same vision input, completely different action head — that's why "VLA" isn't a single recipe.

Open-weight track — OpenVLA · Octo · RDT-1B · GR00T N1/N1.5 are your candidate list for hands-on work.
Backbone shrinks over time — PaLI-X 55B → Llama-2 7B → PaliGemma 3B → Eagle 2B. Smaller, faster, robot-tuned.
Right edge = today — Hierarchical (π0.5) + Latent action (LAPA) + DiT (GR00T) are the live frontier.

The two models in this talk that aren't yet on the survey chart: Helix (Figure AI, 2025-02 — closed) and DreamZero (NVIDIA GR00T N2, 2026 — covered in Ch7 as a WAM, not a VLA).

Chapter 5 — Diffusion + Dual System 40 / 77

05

Smart & Fast

DDPM you just learned, re-cast as a policy — and the dual-system pattern that became consensus in 2026.

Ch 5 — Diffusion + Dual 41 / 77

Diffusion as policy — original figures, RSS 2023

Chi et al. RSS 2023 · Fig 1

Diffusion Policy Fig 1 — Explicit / Implicit / Diffusion policy comparison

(a) Explicit: regression / GMM · (b) Implicit: energy-based · (c) Diffusion: learn the gradient field

Chi et al. RSS 2023 · Fig 2

Diffusion Policy Fig 2 — observation horizon → action chunk; CNN + Transformer variants

Observation chunk → ε_θ(O, A, k) denoises a T_p-step action chunk · CNN (FiLM) or Transformer

In one line

The same ε_θ that denoises image pixels now denoises an action chunk conditioned on observation history — sample from a multi-modal action distribution instead of regressing to the mean.

Ch 5 — Diffusion + Dual 42 / 77

Why "fast" matters — control frequency

FiS-VLA architecture · Chen et al. 2025 · Fig 2

FiS-VLA architecture (paper Fig 2) — System 2 (low-frequency planner) drives System 1 (high-frequency motor expert)

System 2 (low-freq plan, 1/n × step) drives System 1 (high-freq motor, every step)

Reported inference rate · log Hz

Model	Rate
RT-2 55B end-to-end	~1 Hz
OpenVLA 7B AR	~5 Hz
Diffusion Policy	~10 Hz
ACT chunked	50 Hz
π0 flow-matching	~50 Hz
GR00T N1 DiT expert	~50 Hz
FiS-VLA shared params	117.7 Hz
Helix S2 7B + S1 80M	200 Hz

Contact: 30-50 Hz · Force-control: 200 Hz+. Smart-single-loop models all fall below; red rows = dual-system.

Ch 5 — Diffusion + Dual 43 / 77

Dual system — the Kahneman analogy

A 20-year-old cognitive-science partition, now reified in silicon — in robotics and in LLMs.

HUMAN · Kahneman 2002

S2

Prefrontal · slow

speed slow · deliberate

effort serial · effortful

nature reasoning · planning

e.g. solving 17 × 24

S1

Motor · fast

speed fast · automatic

effort parallel · effortless

nature reflex · skill

e.g. driving on an empty road

Two modes communicate through a shared intent signal — S2 sets goals, S1 executes.

ROBOT · Helix · π0 · GR00T

model

S2 · slow

S1 · fast

Helix

7B VLM 7–9 Hz

80M expert 200 Hz

π0 / π0.5

PaliGemma 3B

300M flow expert

GR00T N1

VLM planner

DiT diffusion

Origin · cognitive science

Thinking, Fast and Slow

Kahneman 2002 Nobel lecture & 2011 best-seller. The S1 / S2 partition that the whole field is now copying.

Same pattern · non-robot AI

LLM reasoning — "think then answer"

OpenAI o1 / DeepSeek-R1 / Claude Extended Thinking all separate a long S2 reasoning trace from a fast S1 answer.

arXiv:2412.16720 (o1) · arXiv:2501.12948 (R1)

Ch 5 — Diffusion + Dual 44 / 77

π0 & π0.5 — flow-matching action expert in unseen homes

π0 architecture · Black et al. 2024 · Fig 1

pi0 architecture: pi-dataset + OXE + internet pre-training → SigLIP 400M + Gemma 2.6B pre-trained VLM → 300M action expert → 14/18/7-DoF embodiments

VLM + action expert mixture-of-experts. Flow matching: a single learned velocity field v_θ(a_t, t, z) — one short ODE solve at inference vs T DDPM steps.

π0.5 · never-seen homes

Fully autonomous Airbnb bedroom cleanup · 5× speed

94%

language-follow rate, unseen homes

100+

never-seen eval environments

10×

inference speedup over DP-style DDPM

Ch 5 — Diffusion + Dual 45 / 77

Helix — Figure AI's 7B over 80M, two robots, one weights

Two robots · full upper-body 35-DoF · shared weights

Helix grocery put-away — Figure 02 humanoids collaborate · Feb 2025

Frequency split · built from Helix blog text

single weights

Both robots run the same network — no role-specific finetuning, no per-robot models.

on-board inference

Both S1 and S2 run on an embedded GPU on the robot — no cloud.

closed-source

No paper, no weights — the architecture description is from the blog text. Figure 03 follow-up streamed live in May 2026.

Ch 5 — Diffusion + Dual 46 / 77

GR00T N1 → N1.5 → N2 — one company, one year, three generations

MAR 2025 · GEN 1

N1

first open humanoid stack

S2 VLM + S1 diffusion transformer — the GR00T template
Open weights — first downloadable generalist humanoid policy

arxiv:2503.14734

SEP 2025 · GEN 1.5

N1.5

synthetic data via GR00T-Dreams

Same dual-system arch, retrained on world-model-generated data
Novel-object generalization — pick objects never seen in teleop

research.nvidia.com/labs/gear/gr00t-n1_5

FEB 2026 · GEN 2 (DreamZero)

N2 a.k.a. DreamZero

video × action joint generation

Dual-system collapses into one — generation = policy
Cross-embodiment in 30 min of plays — we revisit in Ch 7

arxiv:2602.15922 · dreamzero0.github.io

One company, one year — open dual-system → world-model-generated data → joint video×action generation. Train cost on synthetic data fell from 3 months → 36 hours.

Ch 5 — Diffusion + Dual 47 / 77

2026 frontier — Fast-in-Slow + Gemini Robotics 1.5

Fast-in-Slow

share parameters

FiS-VLA · Chen et al. NeurIPS 2025

No latent vector handoff — S1 reuses S2's intermediate features directly
Reaches 117.7 Hz on a single NVIDIA 4090 with chunk size 8

arxiv:2506.01953 · fast-in-slow.github.io

Gemini Robotics 1.5

externalize reasoning

Google DeepMind · Sep 2025 + follow-ups

Communication channel is natural language — not a latent vector
Plans are inspectable, debuggable, and use external tools

deepmind.google/blog/gemini-robotics-15

2026 trend

Beyond separate S1 & S2 — share parameters (FiS-VLA) or externalize the reasoning channel (Gemini Robotics 1.5). Two opposite answers to "make the bridge thicker."

Chapter 6 — World Models 48 / 77

06

World Models.

Can a model learn the dynamics of the world — and use them?

Ch 6 — World Models 49 / 77

Origin — not a 2018 invention

The same partition appeared three times: model-based RL · recurrent controller · V·M·C decomposition.

1990 · ML Workshop

Dyna

Sutton
"Integrated Architectures for
Learning, Planning, and Reacting"

Model-based RL: train the model from real interaction, then plan / value-iterate inside the model. Same agent learns real + imagined experience.

Sutton, ML Workshop 1990

1990 · FKI-126 / 147 TR

RNN world model
+ controller

Schmidhuber
"Making the World Differentiable"
+ "An On-line Algorithm…"

Differentiable world model: gradients of future reward flow through M back into π. This is exactly the modern policy-via-rollout pattern.

Schmidhuber 1990, TR FKI-126/147

2018 · arXiv:1803.10122

World Models · V·M·C

Ha & Schmidhuber
deep V·M + tiny policy formula

Same partition, modernized: VAE for V, MDN-RNN for M, tiny CMA-ES controller. Agent never touches the real env during RL.

arXiv:1803.10122 · worldmodels.github.io

unifying
recipe

learn M̂(s, a) → ŝ' · use it to improve π.
Every WM since — Dreamer, V-JEPA, DreamDojo, DreamZero — is a re-mix of these three boxes with bigger backbones and richer data.

Ch 6 — World Models 50 / 77

The Dreamer family — one idea, five years

Hafner et al. · latent-imagination policy learning

2020 · arXiv:1912.01603

Dreamer V1

RSSM latent dynamics + actor-critic in imagination. Continuous control SOTA (DMC).

2021 · arXiv:2010.02193

Dreamer V2

Discrete latents (32 categoricals × 32 classes). First model-based agent to beat humans on Atari 200M.

2023 · arXiv:2301.04104

Dreamer V3

One unified config matches or beats tuned experts across 8 benchmark families — that's the V3 thesis. Diamond from scratch is the bonus.

Historical anchor · CoRL 2022

DayDreamer — Dreamer on real robots

Wu et al. trained a real A1 quadruped to walk from scratch in ~1 hour — physical-world WM-based RL, no sim. The earliest concrete bridge from V·M·C to robotics.

World model = imagination simulator the policy trains inside.

One family, three generations — same V·M·C recipe, scaled
RL framing dominated through 2023
DayDreamer proves transfer to a physical robot — bridge to Slide 61

Ch 6 — World Models 51 / 77

Two axes — methodology × purpose

Same word, different cells

Entertainment / Games

Robotics / Physical AI

Latent dynamics
(RNN/RSSM)

Dreamer V3 (Minecraft)

2023 diamond from scratch — in-game RL

DayDreamer

CoRL'23 quadruped, 1 h real-world walk

JEPA
(latent predict)

—

not a games line

V-JEPA 2 · LeWorldModel · DINO-WM

2025-26 action-conditioned representation prediction (Slides 53-54)

A-cond video WM
(obs,a)→obs'

Genie 1 · 2 · 3 · Oasis · Hunyuan-GameCraft · Mirage · MS Muse / WHAMM

2024-26 playable: keyboard/mouse = action

DreamDojo · DreamZero · GAIA-1 · GR00T-Dreams

2026 robot action conditioning (Slides 55-56, 60) · Sora·Veo·Wan = data sources, not WMs

3D / spatial
(geometry)

Genie 3 (long-horizon consistency)

2025 blurs into the robotics column —

4D-GS · Cosmos Transfer · Lyra 2 · VGGT / VGGT-Ω · MapAnything

2025-26 3D scenes + geometric foundation models (Slides 57-59)

Today we focus on the right column. The left column is moving fast too — and its data · architecture often crosses over (Genie 3 → sim asset, GameCraft engines → robot synth).

Ch 6 — World Models 52 / 77

Three modern (robotics-relevant) families

What "WM" actually points to in 2025-26 robotics papers — three branches.

Family 1

JEPA · latent prediction

predict representations, not pixels

V-JEPA 2 (Meta, 2025)
LeWorldModel (2026)
DINO-WM (ICLR 2025)

Bet: abstraction & efficiency. Don't waste compute reconstructing pixels you'll never plan over.

→ Slides 53-54

Family 2

Action-conditioned video WMs

(obs, action) → next pixels

DreamDojo (NVIDIA, 2026)
DreamZero (NVIDIA, 2026)
GR00T-Dreams / DreamGen
GAIA-1 · (Sora·Veo·Wan as data sources only)

Bet: predict the next pixels given an action — that's what makes it a WM, not video gen.

→ Slides 55-56 · 60

Family 3

3D & spatial WMs

explicit geometry

4D Gaussian Splatting
Cosmos Transfer
Lyra 2 (NVIDIA, 2026)
VGGT / VGGT-Ω / MapAnything

Bet: robots live in 3D. Carry geometry, don't re-infer it every frame.

→ Slides 57-59

Ch 6 — World Models 53 / 77

Family 1 · V-JEPA 2 (Meta, 2025)

Predict masked representations, not pixels — 1.2 B params

"Don't predict pixels you'll never plan over."

1.2 B params · trained on internet video
SOTA on action understanding & anticipation
Action-conditioned variant for manipulation

LeCun's long-standing pitch finally shipped at scale — representation prediction beats pixel prediction on efficiency.

Ch 6 — World Models 54 / 77

Family 1 · 2026 JEPA-line — DINO-WM + LeWorldModel

Small, plannable WMs in latent space — not pixels back.

DINO-WM

NYU + Meta · Zhou et al. · ICLR 2025

arXiv:2411.04983

$DINO-WM architecture: past frames o_{t-k}..o_t → DINOv2 encoder → latents z_{t-k}..z_t → dynamics head p_θ with action a_t → ẑ_{t+1}; test-time actions optimized via planning loss vs goal z_g$

Frozen DINOv2 features + small dynamics head trained on (z, a) → ẑ'. At test-time, optimize action sequence by gradient descent on planning loss to goal latent.

dino-wm.github.io

LeWorldModel

Maes · LeCun · Balestriero · Mar 2026

arXiv:2603.19312

LeWorldModel vs PLDM / DINO-WM / Dreamer / TD-MPC comparison: addresses 6→1 hyperparameter, anti-collapse, end-to-end, task-agnostic, reconstruction-free

$LeWorldModel architecture: two encoders process o_t and o_{t+1}; predictor takes (z_t, a_t) → ẑ_{t+1}; MSE loss between ẑ_{t+1} and z_{t+1}; SIGReg regularization on latents (Statistical Inverse Gaussian) prevents collapse via random projection normality tests$

48×faster planning

15Mparams total

1hyperparameter

github.com/lucas-maes/le-wm

why JEPA

Small + plannable in latent space — cheapest WM-for-planning option when you don't need pixels back. LeWorldModel proves end-to-end JEPA is now stable.

Ch 6 — World Models 55 / 77

Family 2 · Video gen as a data & sim source

not WMs themselves — but the substrate for them

Sora·Veo·Wan·Cosmos·Genie are video generators: prompt → video. They become WMs only with action conditioning (Slide 56, 60).

OpenAI2024-10 · v2 2025-10

Sora 2

Text-to-video w/ native audio. Consumer app shut Apr 2026 — API only.

DeepMindv3.1 2025

Veo 3 / 3.1

1080p / 4K, 8 s clip, native audio & dialogue. Frames-to-video for control.

NVIDIA2025-01 · arXiv:2501.03575

Cosmos Predict

Physical-AI WFM — pretrained for sim-to-real & robot data gen.

DeepMind2025-08

Genie 3

24 fps / 720p / multi-minute consistency — long-horizon scene memory.

Historical anchor · ICLR'24 Outstanding · arXiv:2310.06114 UniSim (Yilun Du et al.) — first paper to call video generation an interactive real-world simulator. The robotics WM ports (DreamDojo, GAIA-1, etc.) wrap these backbones with action conditioning — next slide.

Ch 6 — World Models 56 / 77

Family 2 · 2026 robotics WMs — DreamDojo + DreamZero

action-conditioned · not video gen

Sora·Veo·Wan are video generators. WMs predict next state given action. Two true 2026 WMs:

DreamDojo

NVIDIA · Feb 2026

arXiv:2602.06949 · dreamdojo-world.github.io

DreamDojo: Human-Video Pretraining → DreamDojo → Robot Post-Training (GR-1, G1, AgiBot, YAM) → Autoregressive Distillation → Applications (Unseen Env, Live Teleop, Policy Eval, Model-based Planning)

Pretrain on human videos (EgoDex, In-lab, DreamDojo-HV) → post-train on robot data — cross-embodiment WM
Action-conditioned: (obs, action) → next obs — supports policy eval, MPC, unseen-env deploy

DreamZero

NVIDIA · Feb 2026

arXiv:2602.15922 · dreamzero0.github.io · 14B WAM

Joint video × action generation — one 14B AR model emits next frames AND next actions in lockstep
Open weights · cross-embodiment in 30 min plays · revisited in Ch 7 as the canonical WAM

important distinction

Video gen models (Sora·Veo·Wan·Cosmos) make beautiful pixels but do not take an action input. They become WMs only when wrapped with explicit action conditioning — Slide 60 covers that bridge.

Ch 6 — World Models 57 / 77

Family 3 · 3D & spatial world models

Carry the geometry — don't re-infer it every frame.

4D Gaussian Splatting

HUST · Wu et al. · CVPR 2024

arXiv:2310.08528

4D Gaussian Splatting coarse-to-fine pipeline: Random Point Cloud → 3D Gaussian Initialization at Iter 3000 → 4D Gaussian Joint Optimization at Iter 20000

Explicit 3D Gaussians + per-Gaussian motion over time. Real-time render, novel-view eval — the geometric substrate Lyra 2 generates and Cosmos Transfer renders on.

hustvl/4DGaussians

Cosmos Transfer 1

NVIDIA · Mar 2025

arXiv:2503.14492

Cosmos Transfer architecture: simulated world + depth/segmentation/etc sensor modalities + text prompt → frozen Cosmos-1 foundation model + per-modality ControlNets with spatiotemporal control maps → output world

Structured inputs (depth / segmentation / edge) → photoreal video via frozen Cosmos-1 + per-modality ControlNets. The standard 2025-26 sim-to-real data pipe.

research.nvidia.com/labs/dir/cosmos1

data pipe

Sim engine generates structure. Cosmos Transfer paints pixels. Policy trains on the pixels. Lyra 2 (Slide 58) generates the 3D scene itself.

Ch 6 — World Models 58 / 77

Family 3 · NVIDIA Lyra 2

One image / prompt → navigable 3D Gaussian scene — a geometry-native WM.

Lyra 2.0 teaser — text/image → navigable 3D Gaussian-splat scene (user pans through GUI)

A scene-generation foundation model.

Built on WAN 2.1 video backbone — distilled into 3D-GS
Output is 3D Gaussian splats, not video — truly navigable
Plugs straight into Isaac Sim for robot rollouts

Why this matters for robotics: generates the environment the policy will train on — not just a 5-second clip. Synthetic data & evaluation in one shot.

research.nvidia.com/labs/sil/projects/lyra2 · github.com/nv-tlabs/lyra

Ch 6 — World Models 59 / 77

Family 3 · Geometric Foundation Models

"World model" = 3D reconstruction as a feed-forward foundation. N views → cameras, depth, geometry in one pass.

VGGT

FAIR + Oxford

CVPR'25 Best

arXiv:2503.11651

VGGT architecture: N input images → DINO + concat + camera token → Global Attention + Frame Attention (×L) → Camera Head + DPT → cameras, depth maps, point maps, tracks in one forward pass

N views → cameras + depth + pointmaps + tracks in one feed-forward pass. The foundation.

VGGT-Ω

Wang · Vedaldi et al.

CVPR'26 Oral

arXiv:2605.15195

VGGT-Omega architecture: introduces Register Attention as alternative to Global/Frame Attention, plus training-only Matching Loss and Point Loss; same I/O as VGGT but lighter compute and richer supervision

What's new vs VGGT: Register Attention (lighter), Matching + Point losses (training-only).

30%memory

15×data

+77%Sintel

MapAnything

Meta + CMU

3DV 2026

arXiv:2509.13414

MapAnything method: Visual Input N + optional Geometric Inputs (Ray Directions, Pose, Depth) → Multi-Modal Encoders with shared weights → Multi-View Transformer → MLP scaling factor + DPT Head + Pose Head → metric 3D scene

Universal feed-forward metric 3D — real-world units, not just relative scale.

JEPA carries latents. Video gen carries pixels. This line carries geometry itself — what WoRV uses today for data & eval.

Ch 6 — World Models 60 / 77

Action-conditioned world models

Without an action input, it's just video gen. With one — it's a robotics WM. This is the bridge to Ch 7.

GAIA-1 — driving WM

Wayve · Sep 2023 · 9B params

arXiv:2309.17080

GAIA-1 schematic: input video → image encoder, action input (speed, steering) → action encoder, text input → text encoder, all three streams → world model with autoregressive prediction → output tokens → video decoder → output video

Three input streams — video, action, text — into one AR transformer. Counterfactual rollouts: "what if I steer left?"

wayve.ai/thinking/introducing-gaia1

GR00T-Dreams / DreamGen

NVIDIA GEAR · May 2025

arXiv:2505.12705

$GR00T-Dreams / DreamGen: initial frame → video world model → synthetic generated videos for robot learning with automatically extracted pseudo-actions â_{1:H}, used for contact-rich data augmentation, new behavior generalization, new environment generalization$

Initial frame → WM → synthetic robot videos. Pseudo-actions â auto-extracted — used for new behaviors and unseen environments.

developer.nvidia.com/blog/r2d2

bridge to Ch 7

Once you can condition on action, you can also generate action. That's a WAM (World Action Model) — DreamZero (Slide 66).

Ch 6 — World Models 61 / 77

Why WMs matter for robotics

framework · NTU MARS survey arxiv:2605.00080

Three core capabilities of an actionable world model — foresight, imagination-planning, data amplification.

1

foresight

Anticipate consequences before executing

Predict next state under candidate actions — the policy can reason about contact, dynamics, and physical regularities language-only pretraining never captures.

Examples · LingBot-VA · SayDream · MOTUS · TC-IDM

2

planning

Imagine rollouts & pick the best

MPC / search inside the WM. Use the imagined future to compare candidate behaviors before acting.

DayDreamer (real A1 in 1h) · Dreamer V3 (Minecraft Diamond) · DINO-WM (latent planning) · CosmosPolicy

3

data

Synthesize trajectories at scale

Trained WM = generator of new (obs, action) pairs — replace expensive teleop with imagined rollouts.

GR00T-Dreams: 3 months → 36 h · DreamDojo · DreamGen · CosmosPredict

+

Zero-shot policy evaluation

Roll a policy inside the WM — no real-robot time. DeepMind's Gemini-in-Veo: 1,600+ real evals replaced. WorldEval (Midea) is the academic counterpart.

→

Now core to the learning loop

2026 trend (NTU MARS): WMs no longer auxiliary — VLA-RFT, WMPO, RynnVLA-002 co-evolve policy + WM in one loop.

NTU MARS survey landscape: temporal evolution of representative WM works for robotic policy learning — 'World Model for Policy' branch (UniPi, Gen2Act, VidMan, VPP, GR-1, UVA, UWA, FLARE, Vidar, WorldVLA, RynnVLA-002, DreamVLA, TriVLA, UniVLA, VideoPolicy, VideoVLA, UD-VLA, GE-ACT, Motus, F1, InternVLA-A1, Video2ACT, LVP, MimicVideo, CosmosPolicy, DreamZero, GigaWorld-Policy, Fast-WAM, LingBot-VA, BagelVLA, LDA-1B, FRAPPE, WoG, VLA-JEPA, JEPA-VLA, HALO, CoWVLA, TC-IDM, Say-Dream-ACT) + 'World Model as Simulator' branch (IRASim, GPC, World-Env, Ctrl-World, World in World, WorldEval, World4RL, VLA-RFT, DiWA, DreamPlan, WMPO, RISE, Giga-Brain-0.5M, WorldVLA-Loop, PlayWorld, VLAW, WoVR), color-coded by style

The landscape — NTU MARS survey

~60 models in 18 months across 2 branches, 7 styles.

For Policy: IDM → Single-Backbone → MoE/MoT → Unified VLA → Latent WM
As Simulator: validation → RL env → policy co-evolution

why this chapter is longest One trained WM addresses foresight · planning · data · eval · co-opt — five of the most expensive things in robotics.

Chapter 7 — World Action Models 62 / 77

07

World Action Models.

Does knowing the world’s dynamics turn into zero-shot policies?

Ch 7 — WAM 63 / 77

The IDM idea — label video backwards

Baker et al. · OpenAI VPT · 2022

(a) Supervised · teleop / contractor

Pay humans to label every frame

Linear in dollars. Every new frame = another teleop hour. OpenVLA, DROID, OpenX all live here.

(b) IDM · infer action from 2 frames

Recover the missing label from raw video

Sub-linear in dollars. 2k labeled hours teach the IDM → the IDM labels 70k unlabeled web hours → first foundation policy in Minecraft.

port to robotics

Same recipe runs in robot land: UniPi (NeurIPS'23) · GR-1 (ICLR'24, CALVIN 94.9) · VPP (ICML'25) · NovaFlow ('26).

Ch 7 — WAM 64 / 77

LAPA — latent actions from video

Ye et al. · ICLR 2025 · 🇰🇷 KAIST-led

Drop the hand-defined action label. Learn a latent action with VQ-VAE on inter-frame deltas.

Step 1

Latent action

VQ-VAE on (o_t, o_t+1)

Quantize inter-frame deltas → discrete latent z, no human label.

Step 2

Latent VLA pretrain

predict next z · web video

Train VLA to predict next latent action from raw video. Zero robot data.

Step 3

Align to real action

tiny robot finetune → a

Map latent z to real joint actions with a small labeled set.

Why it matters — the IDM idea, without hand-defined labels

+6.22%

over OpenVLA SOTA

30×

more pretrain-efficient

Generalizes across embodiments — the latent space is action-free, not joint-specific
Sets up the WAM intuition: video carries action information implicitly
Directly cited by DreamZero, InternVLA-A1, LingBot-VA as the latent-action ancestor

🇰🇷 First authors + both advisors (Kimin Lee · Minjoon Seo) at KAIST.

Ch 7 — WAM 65 / 77

From IDM / LAPA to joint video × action modelling

Three architectural styles — the bifurcation point splits Slides 66 (canonical) and 68 (hybrid).

STEP 1 · 2022

IDM

video → action

Recover the missing label using a tiny labeled set.

VPT · UniPi · GR-1

→

STEP 2 · 2024

LAPA

video → latent z → action

Learn the action as a VQ-VAE latent. Drop hand labels.

LAPA · UniSim-latent

→

STEP 3 · 2026

WAM

video ↔ action (joint)

Stop separating them. Generate future video AND action together.

DreamZero (canonical) + 12 hybrids

Step 3 architecture variants · NTU MARS survey

Three WAM architecture styles from NTU survey: (a) IDM-Style = Video Generation Model → Inverse Dynamics Model → action; (b) Single-Backbone-Style = shared backbone emits both observation tokens and action tokens; (c) MoT-Style = separate Video Expert and Action Expert with Joint Attention

(3) Canonical WAM

→ slide 66

Single-Backbone (b). One generative model over both video and action. AR diffusion. DreamZero, Cosmos-Policy.

(3a) VLA + WM hybrid

→ slide 68

MoT-Style (c). Video expert + action expert, joint attention. Video-pred as auxiliary loss. InternVLA-A1, LingBot-VA.

Ch 7 — WAM 66 / 77

DreamZero — “World Action Models are Zero-shot Policies”

NVIDIA GEAR · 2026-02

DreamZero overview · 14B AR video diffusion + action

62.2%

AgiBot · seen

vs VLA 27.4%

39.5%

AgiBot · unseen

vs from-scratch ≈0%

49%

DROID · unseen verbs

vs π0.5 33% · GR00T N1.6 31%

Why-it-works: VLA = semantic prior · WAM = physical prior. Generating the future video conditions the action on a real rollout.

Single-backbone video × action lineage · 1 year, 7 steps

2024 · ICLR

GR-1

ByteDance

→

2025 · ICML

VPP

Tsinghua

→

2025 · RSS

UVA

Stanford

→

2025 · RSS

UWM

UW WEIRD

→

2026 · ICLR

GE-Act

AgiBot

→

2026-01 · arXiv

Cosmos-Policy

NVIDIA · 2B

→

2026-02

DreamZero

NVIDIA · 14B

DreamZero didn't appear out of nowhere — a year+ of single-backbone work made it scalable. Cosmos-Policy (Jan 2026) was NVIDIA's own one-month-prior step.

Ch 7 — WAM 67 / 77

DreamZero deployment — 7 Hz at 14B · 30 min to a new embodiment

(a) System · running a 14B WAM at robot rate

14B normally implies seconds-per-step. Flash distillation + KV-cache reuse lands at robot-usable rate — not in the π0 zone, but enough for tabletop manip.

(b) Cross-embodiment · 30 min on YAM

YAM teddy transfer · 30 min of plays, new robot · one model

Model + deployment, advancing together.

The same generative prior that gives zero-shot tasks also gives fast embodiment adaptation.

Ch 7 — WAM 68 / 77

Concurrent VLA × WM hybrids — early 2026

same intuition · different recipes

InternVLA-A1

2026-01-05

Shanghai AI Lab + Humanoid Robot (Shanghai) Co. · 42 authors · lead Jia Zeng / Jiangmiao Pang

+26.7%

dynamic tasks

75.1%

avg / 12 real tasks

MoT (Mixture-of-Transformers) — one backbone, three experts: understanding, generation, action. 2-3B params, 692M pretraining frames.

arXiv:2601.02456

Causal World Modelling aka LingBot-VA

2026-01-29

Ant Group / Robbyant · Li / Zhang / Luo et al. · corresp. Yinghao Xu

LingBot-VA framework: Language Model + alternating Video Model (generates next frames Ô) + Action Model (predicts next actions Â), with shared latent space and async inference. Task prompt 'Unpack delivery', initial observation O_0, sequence O_1→O_2→O_3 interleaved with A_1, A_2, A_3.

92.9%

LIBERO · easy

91.6%

LIBERO · hard

AR diffusion over future frames + policy execution in shared latent space. SOTA on all 6 real-robot tasks vs π0.5; +8-9% at horizon 3.

arXiv:2601.21998

Same intuition, different design. Plus 12+ more in the NTU MARS survey · GE-Act · Motus (+45% over π0.5) · BagelVLA · FRAPPE · STARRY · WAV.

Ch 7 — WAM 69 / 77

WAM follow-ups — mid 2026 · four directions in ∼3 months

post-DreamZero

simplify architecture

Being-H0.7

2026-04-30

Peking U / BeingBeyond · Zongqing Lu

No future-video generation. Just a dual branch with learnable latent queries between perception and action. Pretrained on 200k h ego video + 15k h robot demos — the Ch3 ego pool, finally cashed.

99.2%

LIBERO

62.1%

RoboCasa

49.2%

GR1

arXiv:2605.00078 · ← Ch3 forward-ref

transfer efficiency

CKT-WAM

2026-05-07

Tsinghua + LivsynRobotics + Shanghai AI Lab

Teacher hidden states → compressors → routed adapters → student text-embedding space. Knowledge transfer between WAMs without touching the backbone.

86.1%

LIBERO-Plus

1.17%

trainable params

arXiv:2605.06247

object addressing

OA-WAM

2026-05

Object-Addressable WAM

Treat objects as first-class addressable entities inside the world model. Lets the policy refer to “the red cup” the way a VLA refers to a token — not a pixel patch.

97.8%

LIBERO

preprint · 2026-05

counter-narrative

do we need imagination?

Fast-WAM

2026-03

“Test-time future imagination is unnecessary.” Skip the rollout, get SOTA at 4× speed — directly challenges DreamZero / Being-H0.7’s core premise.

190 ms

inference

4×

faster

FFDC-WAM threads the needle with conditional imagination.

Ch 7 — WAM 70 / 77

WMs vs WAMs vs VLAs — 5-architecture view

NTU MARS survey · arXiv:2605.00080

Pattern 1

IDM-style

Decouple: predict subgoal / latent, then act. VPT · UniPi · LAPA

Pattern 2

Single-backbone

One model, video × action jointly. UVA · UWM · DreamZero

Pattern 3

MoE / MoT

Expert fusion / joint attention. LingBot-VA · GE-Act · BagelVLA

Pattern 4

Unified-VLA

VLA + foresight / video pred. as aux. GR-1 · InternVLA-A1 · UniVLA

Pattern 5

Latent-space

JEPA-style, no pixel gen. V-JEPA 2 · VLA-JEPA · FLARE · Being-H0.7

Every model in today’s talk lives in one of these 5 boxes. Patterns 2 and 3 are where the action is in 2026 — canonical WAM (slide 66) and VLA × WM hybrid (slide 68).

Chapter 8 — WoRV @ maum.ai 71 / 77

08

WoRV @ maum.ai

How we're working on all of this from Korea

Ch 8 — WoRV 72 / 77

WoRV @ maum.ai

World model for Robotics and Vehicle control

maum.ai's physical-AI division.
We build foundation models that work in the physical world.

@ Pangyo IT Center #2 Member of the national WFM consortium — Korea Physical AI program (≈$25M scope, MSIT/IITP-supervised, 2026-)

Our bet — 2026 taxonomy

We bet on the WAM × dual-system line.

VLA

semantic-heavy

WM

world-heavy

WAM

★ WoRV position

(Taxonomy reused from Ch7 slide 70 — we build a Korean stack on the same line.)

Ch 8 — WoRV 73 / 77

What we do — three stacks

Data → Models → Deployment. End-to-end stack from teleop to B2B install.

PILLAR 01

Data Infrastructure teleop rigs · video pipelines · synthetic factory

CANVAS data pipeline: 48 hours and 219 kilometers of human-annotated navigation data collected via own teleop rigs

219 km

CANVAS nav data

48 h

human-annotated

ALOHA / UMI / egocentric video pipelines + Cosmos-Transfer synth factory.

CANVAS · HuggingFace maum-ai/CANVAS-S · maum-ai/COMMAND

PILLAR 02

Robotics Foundation Models in-house VLA · WAM · dual-system

VLA

backbone

WAM

action world model

S1+S2

dual system

Ch6 / Ch7 lines implemented in-house — our WAM × dual-system bet.

internal · targeted public release in 2026 H2

PILLAR 03

Tailored B2B from eval to deploy

-$27

CANVAS / run

-$35

LiDAR+GPS / run

CostNav measures navigation in real economic cost. I'm last author.

arXiv:2511.20216 · worv-ai.github.io/CostNav

Public HuggingFace orgs: maum-ai/CANVAS · maum-ai/CostNav · open-world-agents/D2E (next slide).

Ch 8 — WoRV 74 / 77

Selected research — recent & ongoing collabs

Three projects we're (co-)leading right now — data, world models, eval infra.

D2E

ICLR 2026 · arXiv:2510.05684

MAUM.AI × Stanford × SNU

D2E (Desktop-to-Embodied): Generalist Inverse Dynamics Model trained on desktop game data, then transferred to robotics including manipulation and navigation

Generalist Inverse Dynamics Model — trained on desktop game data (Brotato, Minecraft, BF6, …) → transfers to real-world manipulation (Meta-World, LIBERO) and navigation (CANVAS).

github.com/worv-ai/D2E · HF: open-world-agents/Generalist-IDM-1B

WorldCam

2026 · cvlab-kaist.github.io/WorldCam

Adobe × KAIST × MAUM.AI

WorldCam architecture: progressive autoregressive video transformer conditioned on camera poses with memory mechanisms; ingests gameplay frames + camera trajectories

Camera-controllable progressive AR video transformer — 3,000 minutes of gameplay frames with camera trajectories. The "WM as controllable simulator" research line.

github.com/cvlab-kaist/WorldCam

VLA-Eval-Harness

2026 · allenai/vla-evaluation-harness

AI2 × SNU × MAUM.AI

VLA-Evaluation-Harness: 47× speedup on LIBERO tasks via batch parallel evaluation, decoupling models from environments through standardized abstraction

Unified VLA eval across 18 benchmarks × 13 model servers. 47× faster LIBERO via batched parallel eval. Decouples model from environment.

github.com/allenai/vla-evaluation-harness · v0.2.0

collab footprint

D2E (SNU + Stanford + MAUM) · WorldCam (Adobe + KAIST + MAUM) · VLA-Eval-Harness (AI2 + SNU + MAUM). Always hiring — see next 2 slides.

Ch 8 — WoRV 75 / 77

Deployment — 5 verticals

Specific customers are confidential — verticals only

01

Agriculture

02

Construction

03

Maritime

04

Defense

05

Logistics &
Manufacturing

The model runs in 5 real industries — that fact alone is the point. Specific partners and impact figures stay verbal — confidentiality policy.

Ch 8 — WoRV 76 / 77

We are hiring

— people to solve this with us

Research2 positions

01Research Scientist / FellowVLA · WM · WAM

02Research Intern (World Model)min 2-month commitment

Engineering4 positions

03VLA Engineertraining · post-training

04AI SW Engineerproduction stack

05Robotics SW Engineeron-robot integration

06Simulation EngineerIsaac · Cosmos · WM rollouts

Business & PM2 positions + talent pool

07Project ManagerB2B vertical leads

08Technical Sales & Strategyvertical GTM

09Talent Pool · 상시채용always-open registration

recruitment.worv.maum.ai

worv_hr@maum.ai

KR military alt-service — Research track KR military alt-service — Industrial track @ Pangyo IT Center #2

Q & A 77 / 77

Thank you.

Q & A

15 minutes

Contact

EMAIL sung@maum.ai

WEB alohays.github.io

TEAM WoRV @ maum.ai (Pangyo)

HIRING recruitment.worv.maum.ai

Robotics Foundation Models

About Me

Korea Univ MS · CMU Visiting

ScatterLab → Riiid → Wrtn → maum.ai

Head of Research · WoRV @ maum.ai

Today 8 chapters

Robotics Foundation Models?

A New Modality: Action

Data Scaling

VLAs — Vision · Language · Action

Diffusion Policy + Dual System

World Models

World Action Models

WoRV @ maum.ai

Robotics Foundation Models?

Why robotics AI — now

Foundation models are ready

Data infrastructure exists

Humanoid + arm platforms shipped

Industry pull is real

What "foundation" means in NLP / Vision

One model → many downstream tasks

What "foundation" means for robots

Three axes of generalization

Task

Environment

Embodiment

Why "scaling" alone isn't obvious here

"We don't have an internet of actions."

The 2023 — 2026 unlock

The core thesis — Jim Fan's roadmap, this talk's evidence

Roadmap — 7 questions, 7 chapters

A New Modality: Action

Recap — the modalities you already know

What is an "action" in a robot?

A continuous vector — but embodiment-bound

Why action is harder than text

Text: discrete by birth.Action: continuous.

Text: next-token softmax.Action: real physics.

Text: scrape the internet.Action: teleop, one hour at a time.

Text: regenerate the answer.Action: drop the cup, it breaks.

Two tokenization strategies

Bin discretization

Continuous head

Discrete — RT-1's action vocabulary

Continuous — action chunks

FAST — a compression-based tokenizer

Continuous → back to discrete, in a smarter basis.

Takeaway — action is the new token

Data Scaling.

The action data wall

5-6 orders of magnitude short.

Strategy 1 — Better hardware (ALOHA family)

Strategy 2 — Handheld in-the-wild (UMI → Sunday)

Strategy 3 — Egocentric video (Ego4D · Ego-Exo4D)

Strategy 3+ — Video → robot

Strategy 4 — Synthetic (sim + world-model dreams)

OpenX-Embodiment — the union

"Let's pool what we have."

Recap — data > model

VLAs.

Definition — swap the head, get a VLA

RT-2 — web knowledge → robot actions

OpenVLA — opening the door

Generalist policies — three takes on the same year

RDT-1B

Octo

GR00T N1

Open vs closed VLA landscape

What VLAs are still bad at

Two responses — faster S1, smarter S2

Faster motor expert

Smarter high-level brain

VLA timeline — 14 milestone models

Read the chart left–right.

Smart & Fast

Diffusion as policy — original figures, RSS 2023

Why "fast" matters — control frequency

Dual system — the Kahneman analogy

π0 & π0.5 — flow-matching action expert in unseen homes

Robotics
Foundation
Models

Text: discrete by birth.
Action: continuous.

Text: next-token softmax.
Action: real physics.

Text: scrape the internet.
Action: teleop, one hour at a time.

Text: regenerate the answer.
Action: drop the cup, it breaks.

RNN world model
+ controller