NVIDIA Physical AI in May 2026 is no longer a research ambition — it’s a shipping stack with named products, named customers, and named ship dates. Isaac GR00T N1.7 is open-source on Hugging Face. Cosmos World Foundation Models generate synthetic robot training data at industrial scale. Jetson AGX Thor computers ship inside humanoid robots from Figure, Agility, AGIBOT, and a dozen industrial robotics leaders. The pieces compose into the first credible end-to-end developer platform for foundation-model-driven robots, and the gap between “interesting demo” and “production deployment” closed significantly in the last six months.
This guide is the developer playbook for that stack. We walk every layer — the GR00T foundation models, the Cosmos generative world models, the Isaac Sim and Isaac Lab simulation environments, the Jetson Thor edge compute — with hands-on tutorials, copy-paste configurations, performance trade-offs, and real customer deployments. By the end, you’ll know what to build, how to build it, where the gotchas live, and what’s coming over the next twelve months. Whether you’re a robotics engineer evaluating the platform, an ML engineer pivoting into embodied AI, or an executive trying to understand why Physical AI is the topic at every robotics conference in 2026, this is the deep dive that pays back the read.
Chapter 1: The Physical AI Thesis — Why Robots Need Foundation Models
Robotics for sixty years has been an engineering discipline of bespoke control. Every new robot — every new task — required custom motion planning, custom perception pipelines, custom hand-tuned reward functions, and months of integration work to make any of it survive contact with the real world. The result: robots got narrowly excellent at narrow tasks and remained brittle outside them. A pick-and-place robot in a factory could not, with any reasonable amount of work, learn to fold laundry. A surgical robot couldn’t unload a dishwasher. The capability didn’t generalize because the methods didn’t generalize.
Foundation models change the structural assumption. Instead of training a custom policy per task, you train a single large model on many tasks across many environments, and the resulting behavior generalizes — partially, imperfectly, but meaningfully — to tasks the model never saw during training. The thesis behind NVIDIA’s Physical AI investment is that the same scaling laws that produced GPT-class language models, when applied to robot perception and action data, will produce a generalist robot foundation model that can be specialized to particular robots and tasks with far less engineering than today’s approach.
The empirical evidence in 2026 supports the thesis cautiously. GR00T N1.7 zero-shots tasks across humanoid robot platforms it never saw during training, with ~50% success rates on reasonably-defined manipulation benchmarks. Five years ago that would have been unthinkable. Three years ago that would have required a PhD-team-quarter of effort per task. Today it’s a single model invocation. The capability isn’t robust enough for unsupervised production deployment in most settings yet, but the trajectory is unambiguous.
Three architectural shifts make this possible. First, vision-language-action (VLA) models unify perception, language understanding, and motor output into a single trained network — eliminating the lossy interface boundaries between formerly-separate modules. Second, generative world models like Cosmos produce vast quantities of synthetic training data that’s diverse, labeled, and physics-consistent — solving the data scarcity problem that has historically limited robot learning. Third, specialized edge compute like the Jetson AGX Thor delivers data-center-class inference at robot-scale power budgets — making sophisticated foundation models actually deployable on autonomous machines.
The deeper implication for builders: robotics development cycles are shifting from “collect data, hand-engineer policy, debug for months” to “fine-tune a foundation model with simulation data, validate in sim-to-real, deploy.” That shift is what made the difference between custom-tuned ML in 2014 and pre-trained-then-fine-tuned ML in 2024. Robotics is fifteen to twenty years behind the language modeling curve, but it’s unmistakably on the same trajectory.
For the rest of this guide, the working assumption is that you accept the thesis enough to want to know how to build on the stack. The detailed mechanics of training generalist robot policies are an active research area; we’ll touch them where relevant, but the focus is the developer-facing platform NVIDIA ships and how to use it productively. By Chapter 12 you’ll have the practical knowledge to evaluate whether NVIDIA’s Physical AI stack is the right foundation for your project, and what alternatives exist if it isn’t.
Chapter 2: The NVIDIA Physical AI Stack at a Glance
The Physical AI stack has five layers. Each layer is a discrete product with its own SDK, documentation, and release cadence. They compose, but you don’t need every layer to build something useful — and choosing only the layers you need keeps complexity down.
Layer 1: Robot Foundation Models (GR00T family). Open-source vision-language-action models published on Hugging Face. The current production model is GR00T N1.7, a 3B-parameter VLA. The roadmap calls for GR00T N2 by end of 2026 with materially better generalization. These are the brains — the layer that maps “user intent + camera input” to “joint commands or end-effector trajectories.”
Layer 2: World Foundation Models (Cosmos). Generative AI systems that produce synthetic robot training data. Two relevant variants: Cosmos Predict (for generating future-frame video given an initial state and an action sequence — used for synthetic data and forward modeling) and Cosmos Reason (a vision-language reasoning model that bridges between high-level instructions and the physical world). Both are open-source with permissive licenses.
Layer 3: Simulation (Isaac Sim and Isaac Lab). Physics-accurate robot simulation built on Omniverse. Isaac Sim is the GUI-driven scene editor and full simulator; Isaac Lab is the headless reinforcement-learning training environment optimized for high-throughput parallel simulation on GPUs. Both are required for the standard “train in sim, deploy on real robot” workflow.
Layer 4: Synthetic Data Generation (Replicator and Cosmos integrations). Tools to programmatically generate diverse, labeled simulation data — varying lighting, textures, object placements, robot configurations. The labeled data feeds back into model training. Replicator is the deterministic-randomization tool; Cosmos is the generative-model approach. Most production pipelines use both.
Layer 5: Edge Compute (Jetson AGX Thor). The on-robot computer that runs the foundation model inference at deployment time. Jetson Thor delivers ~2,000 TOPS of AI compute at ~130W power draw, enabling on-board GR00T-class inference for real-time robot control. Deployments without Thor route inference through tethered or networked compute, with latency penalties.
| Layer | Product | License | Released | Key role |
|---|---|---|---|---|
| Foundation Models | Isaac GR00T N1.7 | Apache 2.0 (open weights) | April 2026 | VLA reasoning + action policy |
| World Models | Cosmos Predict, Cosmos Reason 2 | Permissive open | March 2026 | Synthetic data + scene reasoning |
| Simulation | Isaac Sim 4.5, Isaac Lab | Free for development | Continuous | Sim-to-real training pipeline |
| Synthetic data | Replicator, Omniverse OpenUSD | Free | Continuous | Domain-randomized labeled data |
| Edge Compute | Jetson AGX Thor | Hardware (commercial) | Q1 2026 | On-robot inference |
The minimum viable use of the stack: Isaac Sim + GR00T N1.7. You can prototype a humanoid robot policy, validate it in simulation, and demonstrate behavior on a real robot — all using free / open components — without buying any NVIDIA hardware. You only need Jetson Thor when you want fully on-board autonomous deployment. This makes the platform unusually approachable; getting started costs nothing but compute time.
The complete production setup (training cluster + simulation farm + per-robot Jetson Thor + supporting infrastructure) for a small humanoid robot fleet runs $200K-$500K in 2026, depending on scale. Compare against custom-stack robotics development, which historically required $2M-$10M of engineering before any robot moved. The platform’s economic story is as compelling as its technical one.
Chapter 3: Isaac GR00T N1.7 — Architecture and Capabilities
GR00T N1.7 is the current production-grade open foundation model for humanoid robots in the NVIDIA stack. Released in April 2026, it builds on the N1 baseline (released early 2025) with substantial improvements in reasoning, longer-horizon task execution, and cross-embodiment generalization. Understanding what’s inside the model and what it can / can’t do is the foundation for everything else in this guide.
Architecture. N1.7 is a 3B-parameter dual-system model loosely inspired by Daniel Kahneman’s “thinking fast and slow” framing. System 2 is a slower vision-language-reasoning module that processes the camera input and language instruction, producing a high-level action plan. System 1 is a faster motor-control module that converts the plan into continuous joint commands at robot-control rates (typically 30-100Hz). The two systems share representations through a learned interface; neither operates in isolation. The hybrid design lets the model think when reasoning is needed and react when latency is critical.
Inputs. The model takes (1) RGB or RGB-D camera streams from one or more cameras mounted on the robot, (2) the robot’s proprioceptive state (joint positions, velocities, end-effector pose), and (3) a natural-language instruction from the user. Cross-embodiment training means the same model accepts inputs from different robot bodies — Figure 02 humanoids, Agility Digit, AGIBOT A2 — with embodiment-specific tokenization that the model learned to handle.
Outputs. Continuous action vectors at the robot’s control rate. For a humanoid with 30+ joints, that’s a 30+-dimensional action vector emitted at 30Hz. The action is interpreted by the robot’s downstream low-level controller, which handles joint-level current commands and physical actuation.
Training data. A mix of teleoperation demonstrations from real robots (~5% of training tokens), simulation rollouts in Isaac Sim with domain randomization (~60%), and Cosmos-generated synthetic data (~35%). The exact proportions are a research lever; the public release uses the configuration NVIDIA reports works best across their internal benchmarks.
Capability profile. N1.7 zero-shots manipulation tasks (pick-and-place, simple tool use, basic dexterous manipulation) at ~50-65% success rates depending on environment complexity. With task-specific fine-tuning on 100-1,000 demonstrations, success rates rise to 85-95% on the target task while maintaining most of the zero-shot capability on other tasks. Cross-embodiment generalization works for similar humanoid morphologies; deploying to wildly different robots (industrial arms, mobile bases) requires more adaptation.
What it can’t do (yet). Long-horizon tasks (more than 30-60 seconds of coherent behavior) degrade. Tasks requiring fine dexterity (threading a needle, peeling a label) are mostly out of reach. Tasks involving deformable objects (folding clothes, manipulating soft materials) work intermittently. Tasks requiring physical reasoning beyond what the training data covers (using a tool the model has never seen) usually fail. Each of these is an active research direction; expect meaningful progress in N2 (end of 2026).
# Loading GR00T N1.7 from Hugging Face (Python)
from transformers import AutoProcessor, AutoModelForVisionLanguageAction
processor = AutoProcessor.from_pretrained("nvidia/GR00T-N1.7")
model = AutoModelForVisionLanguageAction.from_pretrained(
"nvidia/GR00T-N1.7",
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Inference loop (simplified)
import torch
def step(model, processor, image, instruction, robot_state):
inputs = processor(
images=image,
text=instruction,
robot_state=robot_state,
return_tensors="pt",
).to(model.device)
with torch.inference_mode():
action = model.generate_action(**inputs)
return action.cpu().numpy()
# Typical use:
# action_vector = step(model, processor, camera_frame,
# "pick up the red block", current_joint_positions)
The model card on Hugging Face documents the supported embodiments, the recommended fine-tuning recipes, and the benchmark scores. Bookmark it; check it before every project — NVIDIA updates the recommendations as community feedback comes in.
Chapter 4: Cosmos World Foundation Models — Synthetic Data at Scale
The data problem in robotics is more severe than the data problem in language modeling ever was. There’s no internet of robot demonstrations to scrape. Real-robot demonstrations cost thousands of dollars per hour to collect. Sim demonstrations are free but suffer from the sim-to-real gap. The path forward, increasingly, is generative world models that produce synthetic data with the right statistics.
Cosmos Predict. A video generation model conditioned on initial frames + an action sequence. Given “here’s a robot arm in this scene; here’s the action to execute,” Cosmos Predict generates the resulting video — predicting what the world will look like after the action runs. For training data: generate millions of (initial state, action, resulting state) triples covering vastly more environments than you could collect in reality. For forward modeling at deployment: predict the consequence of a candidate action before executing it, enabling safer planning.
Cosmos Reason 2. A vision-language model trained specifically for physical-world reasoning. Asks like “is the cup full?”, “is the robot blocking the path?”, “what would happen if the robot picked up the box from the side?” — these get sharply better answers from Cosmos Reason than from general-purpose VLMs. Reason 2 powers the high-level reasoning component of GR00T’s System 2, but it’s also useful standalone for robot perception and verification tasks.
Cosmos as a data engine. The standard pipeline: define a task and a robot embodiment, use Cosmos Predict to roll out thousands of diverse trajectories under domain-randomized conditions, label each trajectory with success / failure based on automated checks, then use the labeled data to fine-tune GR00T or train a specialized policy. The generated data is physics-consistent (Cosmos was trained with constraint penalties for impossible trajectories), diverse (controllable randomization across textures, lighting, object placements), and labeled (the generation process knows the ground truth).
Practical Cosmos invocation. The Cosmos models are accessed either through the open-source repository (heavyweight, requires GPU compute) or through NVIDIA’s NIM-hosted inference endpoints (no infrastructure, pay per use). For quick experimentation, the NIM endpoint is the fastest start:
# Cosmos Predict via NVIDIA NIM API
import requests
import os
API_KEY = os.environ["NVIDIA_API_KEY"]
ENDPOINT = "https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/cosmos-predict-7b"
payload = {
"initial_image": "",
"action_sequence": [[0.1, 0.0, -0.05, 0.0, 0.0, 0.0]] * 30, # 30 steps
"num_predicted_frames": 16,
"frame_height": 256,
"frame_width": 256,
}
response = requests.post(
ENDPOINT,
headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"},
json=payload,
timeout=60,
)
predicted_video_b64 = response.json()["predicted_frames"]
# Save or feed back into your training loop
Pricing in May 2026 for NIM-hosted Cosmos: roughly $0.03 per 16-frame prediction. At scale, self-hosting on your own GPU cluster is cheaper, but the operational overhead isn’t worth it for teams generating less than a few hundred thousand predictions per month.
The sim-to-real reality. Synthetic data is not a magic bullet. Models trained purely on Cosmos-generated data underperform models trained on a mix of real, simulated, and Cosmos data. The right ratio is empirical and project-specific; for most production projects, real-world demonstrations remain the most-valuable-per-token, with simulation and Cosmos data providing scale that real demos can’t match. Plan for all three sources, not a single source.
Chapter 5: Isaac Sim and Isaac Lab — The Simulation Layer
Simulation is the keystone of the Physical AI workflow. You train policies in sim because real-robot training is too slow, too expensive, and too dangerous. You validate sim-to-real transfer because policies that work in sim don’t always work on real robots. You iterate fast because changing sim parameters takes minutes versus changing real-world setups takes weeks.
Isaac Sim 4.5. The flagship simulator: GPU-accelerated physics, photoreal rendering via Omniverse, OpenUSD scene description, and Python-scriptable everything. The interactive UI is a full robotics IDE — load scenes, drop in robots, set up camera/sensor configurations, run physics, and tweak parameters in real time. The headless mode is what you use for actual training: spawn many parallel environments, run them at simulation rates of 10-100x real-time, and aggregate experience into your training pipeline.
Hardware requirements: an RTX-class GPU at minimum (RTX 4060 works for development; RTX 4090 is the sweet spot; H100/H200/B200 if you need parallel-environment scale). Linux is officially supported; Windows works for development but production training pipelines typically run on Linux clusters.
Isaac Lab. A purpose-built reinforcement learning training environment that runs on top of Isaac Sim with the GUI stripped out. Provides a clean API for defining environments, reward functions, observation/action spaces, and standard RL algorithms (PPO, SAC, DreamerV3 in the latest release). The killer feature: thousands of parallel environments running on a single GPU, dramatically increasing the sample throughput for RL training compared to traditional CPU-based simulators.
# Minimum Isaac Lab environment definition
from isaaclab.envs import RLTaskEnvCfg
from isaaclab.utils import configclass
from isaaclab.assets import RigidObjectCfg, ArticulationCfg
from isaaclab.scene import InteractiveSceneCfg
@configclass
class HumanoidPickPlaceEnvCfg(RLTaskEnvCfg):
decimation = 2 # Run physics at 60Hz, control at 30Hz
episode_length_s = 20.0
num_observations = 64
num_actions = 30 # 30-DoF humanoid
scene = InteractiveSceneCfg(num_envs=4096, env_spacing=3.0)
robot = ArticulationCfg(prim_path="{ENV_REGEX_NS}/Robot", ...)
target_object = RigidObjectCfg(prim_path="{ENV_REGEX_NS}/Cube", ...)
# Train with PPO
# python train.py --task HumanoidPickPlace --headless --num_envs 4096
Running 4,096 parallel environments on a single H100 trains policies that took weeks on classical CPU simulators in hours. The throughput is the unlock. Most modern robot policy training pipelines use Isaac Lab specifically because of this scaling.
Sim-to-real techniques baked in. Isaac Sim ships with first-class support for the techniques that close the sim-to-real gap: domain randomization (vary lighting, textures, masses, friction at every reset), system identification (calibrate sim parameters to match observed real-robot behavior), and curriculum learning (start with easy environments, progressively harden). The combination delivers sim-trained policies that transfer to real robots at acceptable success rates — not 100%, but high enough that fine-tuning on a small number of real demonstrations closes the remaining gap.
OpenUSD as the scene format. All scene definitions in Isaac use OpenUSD (Universal Scene Description). This is the same format used in film/VFX and increasingly the default for digital twin platforms. The advantage: scenes are interoperable across tools (Blender, Houdini, Unity, Unreal Engine all read USD), and the format supports collaborative workflows. The catch: USD has a learning curve. Plan for a week of getting comfortable with the format before you’re productive in scene authoring.
Chapter 6: Cosmos Reason — Vision-Language Reasoning for Robots
Cosmos Reason 2 is the vision-language model that gives GR00T (and standalone applications) physical-world understanding. It’s the difference between “I see an object” and “I understand that this is a glass of water on the edge of a table that’s about to tip over.” For most robotics applications, the second kind of reasoning is what makes the system useful.
Architecture. Reason 2 is a 7B-parameter VLM trained on a massive corpus of physical-world video paired with text annotations describing physics, causality, object properties, spatial relationships, and possible actions. Compared to general-purpose VLMs (GPT-4o, Claude Opus, Gemini Ultra), Reason 2 has a noticeably better grasp of physical concepts: gravity, support relationships, material properties, kinematic constraints, and what would happen if a particular action were taken.
Where it fits. In the GR00T architecture, Reason 2 powers System 2 — the high-level reasoning module. Standalone, Reason 2 is useful for: scene understanding for robot perception (“describe the workspace”), action verification (“did the robot complete the task correctly?”), planning (“what’s a reasonable action sequence to achieve this goal?”), and human-robot interaction (“the human pointed at this object — what do they probably want?”).
# Calling Cosmos Reason 2 via NIM
import requests, base64, os
with open("workspace.jpg", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
response = requests.post(
"https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/cosmos-reason-2",
headers={"Authorization": f"Bearer {os.environ['NVIDIA_API_KEY']}"},
json={
"image": img_b64,
"prompt": "Describe the objects on the table and which ones are within reach of a humanoid robot standing 1m from the table.",
},
timeout=30,
)
print(response.json()["completion"])
Practical use cases. Three patterns dominate production deployments:
Pre-action verification. Before the robot executes a planned action, ask Reason 2 “is this action safe to perform in this scene?” The model catches obvious failure modes — items that would tip, paths that would collide, instructions that would damage objects — that the lower-level policy might miss. The latency cost is acceptable (50-200ms per check) for most tasks.
Post-action verification. After the robot acts, ask “did the action achieve the intended goal?” The visual-and-language grounding lets Reason 2 evaluate task success at a higher level than rule-based checks. Used as a reward signal for RL training, this dramatically reduces the engineering burden of writing reward functions by hand.
Conversational interaction. Reason 2 makes “what should I do next?” a sensible question for a human to ask the robot. The model integrates the current scene context with natural-language reasoning to suggest actions, ask for clarification, or report obstacles. Most demo videos showing humanoids “having conversations” with humans are using Reason-class models for the language understanding.
Limits. Reason 2 is good at general physical-world reasoning but not domain-expert. For specialized domains (surgical robotics, deep-sea operations, microscale manipulation), a fine-tuned variant or domain-specific model performs better. Reason 2’s reasoning is also still subject to the standard VLM failure modes — confident hallucination on edge cases, susceptibility to misleading visual context, occasional misunderstanding of spatial relationships when the camera angle is unusual. Use it as a tool, not an oracle.
Chapter 7: Jetson AGX Thor — Edge Compute for Production Robots
A robot that requires a tethered network connection to a data center is not really an autonomous robot. The ambition of Physical AI is fully on-board inference — the robot has all the compute it needs to run its foundation model on its body, in real time, without external dependencies. Jetson AGX Thor is NVIDIA’s hardware answer to that ambition.
Specifications. Thor delivers approximately 2,000 TOPS of AI compute at FP4 / sparse INT4. Memory is 128GB of unified LPDDR5X shared between CPU and GPU. Power draw is configurable from 60W (low-power mode) to 130W (full performance). The CPU is a 14-core Arm Neoverse cluster. The form factor is roughly the size of a paperback book — small enough to integrate into the torso of a humanoid robot or the chassis of a mobile platform.
What can run on Thor. A GR00T N1.7 model (3B parameters) at ~30Hz control rate with substantial headroom for additional perception models, sensor fusion, and local logic. Cosmos Reason 2 (7B parameters) at ~5Hz for verification and high-level reasoning. Multiple camera streams plus LiDAR processing concurrently. The 128GB of unified memory means you can keep multiple foundation models resident simultaneously and switch between them with negligible overhead.
Deployment patterns. Three configurations are common in 2026:
Single-Thor humanoid. One Thor in the torso runs all the AI workloads. Used by Figure, Agility, AGIBOT for their commercial humanoids. Cost: roughly $4,000 per unit at NVIDIA’s commercial pricing.
Dual-Thor for redundancy. Two Thors in the body, one as primary and one as failover. Used in safety-critical deployments (medical, industrial) where a Thor failure must not stall the robot. Cost: roughly $8,500 per unit when including the redundancy hardware.
Edge cluster. Multiple Thors per robot for high-throughput perception (multiple cameras + LiDAR + radar fusion) plus inference. Used by autonomous-vehicle developers and large industrial robot platforms. Cost varies; typically $15K-$30K per platform.
# Deploying GR00T to Jetson Thor with TensorRT-LLM
# (run on the Thor itself; assumes JetPack 7+ installed)
# 1. Convert GR00T to TensorRT-LLM optimized format
trtllm-build \
--model_dir nvidia/GR00T-N1.7 \
--output_dir /opt/groot-trt \
--gemm_plugin float16 \
--max_batch_size 1 \
--max_input_len 4096 \
--paged_kv_cache enable
# 2. Run optimized inference
python -m trtllm.serve \
--model /opt/groot-trt \
--port 9001 \
--max_concurrency 1
# 3. From the robot's main control loop, hit the local endpoint
import httpx
r = httpx.post("http://localhost:9001/infer", json={...}, timeout=0.5)
Power and thermal envelope. 130W in a humanoid torso is non-trivial heat to dissipate. Production deployments use active cooling — small fans, heat pipes, or in tight enclosures, liquid loops. The thermal design needs to handle sustained 130W for the duration of the robot’s operating time, which for a humanoid running an 8-hour shift means cooling has to actually work, not just survive a 5-minute demo.
The latency story. Local Thor inference for a GR00T forward pass is 25-35ms. Comparable inference routed through a 5G connection to a data-center GPU adds 30-100ms of round-trip latency depending on the link quality. For tasks requiring 30Hz control, on-board inference is the difference between “the robot moves smoothly” and “the robot stutters.” For applications where 5-10Hz is acceptable (slower manipulation, navigation), networked inference is viable.
Chapter 8: Building Your First GR00T Pipeline (Hands-On Tutorial)
Theory enough. This chapter walks the complete workflow for getting GR00T N1.7 running in simulation against a humanoid task — from environment setup through evaluation. By the end you’ll have a working pipeline you can adapt to your own tasks.
Step 1: Environment setup. A Linux machine with an RTX 4090 or better is the minimum; an H100 makes everything faster. Install the Isaac Sim 4.5 omnipack:
# Install Omniverse Launcher and Isaac Sim
# (download from developer.nvidia.com/isaac-sim)
# Install Isaac Lab
git clone https://github.com/isaac-sim/IsaacLab.git
cd IsaacLab
./isaaclab.sh --install
# Verify installation
./isaaclab.sh -p source/standalone/tutorials/00_sim/create_empty.py
Step 2: Pull GR00T N1.7.
# Authenticate with Hugging Face (one-time)
huggingface-cli login
# Pull the model and processor
python -c "
from transformers import AutoProcessor, AutoModel
processor = AutoProcessor.from_pretrained('nvidia/GR00T-N1.7')
model = AutoModel.from_pretrained('nvidia/GR00T-N1.7', torch_dtype='bfloat16')
print('Model loaded:', sum(p.numel() for p in model.parameters()) / 1e9, 'B params')
"
Step 3: Load a humanoid in Isaac Sim. The official Isaac Lab environments include a parametric humanoid (G1, the Unitree humanoid configuration). Spawn one in a simple workspace:
# Inside Isaac Lab, run a basic humanoid environment
./isaaclab.sh -p source/standalone/environments/zero_agent.py \
--task Isaac-Humanoid-G1-PickPlace-v0 \
--num_envs 4
# This opens the Isaac Sim viewer with 4 humanoid+table environments side by side
Step 4: Wire GR00T to drive the humanoid. Replace the zero-action controller with GR00T inference:
# groot_runner.py — replaces the default controller
import torch
from transformers import AutoProcessor, AutoModel
class GR00TController:
def __init__(self, device="cuda"):
self.processor = AutoProcessor.from_pretrained("nvidia/GR00T-N1.7")
self.model = AutoModel.from_pretrained(
"nvidia/GR00T-N1.7", torch_dtype=torch.bfloat16
).to(device).eval()
self.device = device
@torch.inference_mode()
def step(self, image, instruction, joint_state):
inputs = self.processor(
images=image, text=instruction, robot_state=joint_state,
return_tensors="pt"
).to(self.device)
action = self.model.generate_action(**inputs)
return action.float().cpu().numpy().flatten()
# In your env loop:
controller = GR00TController()
for _ in range(env.episode_length):
obs = env.reset()
image = obs["camera_rgb"][0]
state = obs["joint_pos"][0]
action = controller.step(image, "pick up the red cube", state)
env.step(action)
Step 5: Evaluate and iterate. Run the controller against the task for 100 episodes. Log success rate, mean episode length, mean reward. Expect ~50-65% success on a clean pick-and-place task with N1.7 zero-shot. For higher success, fine-tune on task-specific demonstrations (Chapter 9).
This is a working baseline. From here, you scale by varying instructions, adding more diverse objects, swapping the humanoid for a different embodiment, or fine-tuning the model. The point of the walkthrough is that you can get to “robot doing the right thing” in an afternoon — a milestone that took six engineer-months a few years ago.
Chapter 9: Fine-Tuning GR00T for Custom Tasks
Zero-shot capability is the demo; fine-tuning is the production reality. Most actual deployments fine-tune GR00T N1.7 on task-specific data to push success rates from 50-65% (zero-shot) to 85-95% (specialized). The fine-tuning recipe is straightforward and well-documented; this chapter is the operational playbook.
Data collection. The minimum viable dataset is 100 demonstrations of the target task. Each demonstration is a (camera frames, robot state, action) trace from start to successful completion. You collect demonstrations one of three ways:
- Teleoperation on the real robot. Highest-quality data, slowest to collect, expensive at scale. Use for the gold standard and validation set.
- Teleoperation in simulation. Medium quality (sim-to-real gap), faster than real, free. Good for the bulk of training data.
- Cosmos-generated synthetic. Lower quality but unlimited scale. Use for diversity (object variations, lighting, environments).
Recommended mix. 100 real-robot teleoperation demos for the validation set, 1,000-5,000 simulation teleoperation demos for primary training, 10,000+ Cosmos-synthesized demos for variety. The mix balances quality (real) against scale (sim + Cosmos).
The fine-tuning command. Hugging Face Transformers PEFT (LoRA) makes this straightforward:
from peft import LoraConfig, get_peft_model
from transformers import AutoModel, TrainingArguments, Trainer
base_model = AutoModel.from_pretrained("nvidia/GR00T-N1.7", torch_dtype=torch.bfloat16)
# LoRA config — keeps trainable parameters small
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="VLA",
)
model = get_peft_model(base_model, lora_config)
training_args = TrainingArguments(
output_dir="./groot-finetune-pickplace",
num_train_epochs=5,
per_device_train_batch_size=8,
gradient_accumulation_steps=4,
learning_rate=1e-4,
bf16=True,
save_strategy="epoch",
logging_steps=50,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset, # Your custom DataLoader-compatible dataset
eval_dataset=val_dataset,
)
trainer.train()
Training time: 4-12 hours on a single H100, depending on dataset size. The LoRA adapter is small (~50MB), so deployment is cheap — load the base model once, swap in task-specific adapters as needed.
Evaluation discipline. Don’t trust training-loss numbers. Evaluate on a held-out set of real-robot tasks, ideally on the actual hardware. The metric that matters is success rate on the deployment robot, not loss on a simulation validation set. Many fine-tuning runs look great in sim and fall apart on real hardware; the cure is real-hardware evaluation as a regular cadence, not just a final check.
Multi-task fine-tuning. If you need the robot to handle multiple tasks (pick-and-place + tool use + manipulation of different objects), fine-tune on a mixed dataset rather than training separate adapters per task. The shared model preserves general capability and avoids the inference overhead of switching adapters at runtime. The trade-off: per-task success rate is slightly lower than a single-task adapter would deliver. For most production deployments, the operational simplicity of one model wins.
Chapter 10: Performance, Cost, and Hardware Trade-Offs
Building Physical AI systems involves tradeoffs at every layer — model size, simulation throughput, edge compute power budget, network latency. This chapter compiles the trade-offs that come up most often in real projects.
Model size vs latency. GR00T N1.7 (3B parameters) runs at 30Hz on Jetson Thor with comfortable headroom. A hypothetical N2 at 7B parameters would run at ~15Hz on the same hardware — borderline for some applications. Larger models give more capability; smaller models run faster. For control-rate-sensitive applications, the model size has to be picked with the deployment hardware in mind, not just capability.
Simulation parallelism vs fidelity. Isaac Lab can run 4,096 parallel humanoid environments on an H100 at moderate physics fidelity. Drop to 1,024 environments and you can crank fidelity (full collision detection, rigid contact dynamics, accurate sensor simulation). The trade-off shapes your training pipeline: lower-fidelity high-parallelism for early exploration, higher-fidelity lower-parallelism for sim-to-real transfer training. Most projects use both at different phases.
Synthetic data quantity vs quality. Cosmos Predict can produce thousands of trajectories per hour on a single GPU. Quality varies — the generated frames are physics-consistent but not perfect; some trajectories have artifacts. The right ratio of “raw generation rate” to “post-filter clean trajectories” depends on how strict your filtering is. Plan for ~30-50% of generated trajectories to be usable after quality filtering.
| Decision | Option A | Option B | When A wins | When B wins |
|---|---|---|---|---|
| Model size | 3B (N1.7) | 7B (N2 / Reason) | Real-time control needed | Reasoning quality > speed |
| Edge compute | Single Jetson Thor | Dual Thor or Cloud | Cost-sensitive deployment | Safety-critical or large model |
| Training data | Real demos only | Sim + Cosmos heavy | Small task, narrow domain | Wide-coverage generalist policy |
| Sim parallelism | Many envs, low fidelity | Fewer envs, high fidelity | Early-stage exploration | Pre-deployment validation |
| Inference precision | BF16 | FP8/FP4 quantized | Development / max accuracy | Production / max throughput |
| Fine-tuning | Full model | LoRA adapter | Specialized embodiment | Standard humanoid + custom task |
Cost summary. A representative Physical AI project budget for a small humanoid pilot (single robot, single task, full pipeline):
- Robot platform (Figure 02 or equivalent): $150,000 – $250,000
- Jetson Thor (already in robot, if newer model): included
- Training infrastructure (4× H100 cluster for 6 months): $40,000 (cloud) or $200,000 (owned)
- Software / platform fees: $0 (open-source) to $50,000 (NVIDIA AI Enterprise + NIM credits)
- Engineering time (3 engineers × 6 months): $400,000 – $600,000
- Real-robot data collection (teleoperation): $20,000 – $80,000
- Total: $610,000 – $1,280,000 for a working pilot
That’s not cheap, but compare against pre-foundation-model robotics projects, which routinely ran $5M-$20M for similar capability. The economics are dramatically better — and getting better quarterly.
Chapter 11: Real Customer Deployments — Three Case Studies
Three teams using the NVIDIA Physical AI stack in 2026. Each picked a different shape of project, and each surfaced lessons that transfer.
Case Study 1: Figure 02 in BMW manufacturing.
Figure deployed humanoid robots at a BMW assembly plant in early 2026 for repetitive lifting and component-placement tasks. The deployment uses fine-tuned GR00T N1.7 running on Jetson AGX Thor for real-time control, with task-specific LoRA adapters for the three operations the robots perform on the line.
Architecture: each robot runs entirely on-board (single Thor, no network dependence for control). A facility-wide management system coordinates which task each robot performs at any time and pushes adapter updates over Wi-Fi when policies are revised. The robots interact with humans in an unstructured environment, requiring continuous safety reasoning via Cosmos Reason 2 in the verification loop.
What surprised the team: the sim-to-real gap was smaller than expected, but the human-interaction modeling was harder than expected. Workers approached the robots in ways the training data hadn’t covered — leaning over them, gesturing at them, occasionally bumping them — and the safety policies needed multiple iteration rounds to handle these gracefully.
Lessons: the standard sim-to-real pipeline works for manipulation; human-interaction edge cases need separate attention with real-environment data collection. Plan a multi-month “hardening phase” of real-world refinement on top of the simulation-trained baseline.
Case Study 2: Agility Digit warehouse fleet.
Agility’s Digit humanoids are deployed at a major logistics customer’s distribution center for tote-handling and palletizing. Fleet size: 20+ robots running coordinated workflows. The Physical AI stack underpins the autonomy: GR00T for manipulation, Isaac Sim for ongoing policy refinement, Jetson Thor on each robot for inference.
The architecture twist: a centralized scheduler running outside the robots assigns tasks (this tote needs to move from there to here), and each robot’s local Thor handles the autonomous execution. The split between centralized planning and on-robot autonomy is the design point that lets the fleet scale — adding robots is mostly mechanical (procurement + integration + commissioning) rather than requiring per-robot ML engineering.
What surprised the team: data collection across the fleet became a major operational pillar. Every robot’s session logs feed back into the training pipeline, producing a continuously-improving fleet policy. The cost: significant infrastructure for log ingestion, indexing, and training pipeline orchestration. The benefit: policy improvements deploy weekly across the fleet with documented success-rate gains.
Lessons: fleet deployments unlock a data flywheel that single-robot deployments can’t. Plan for the data infrastructure from the start; bolting it on later is painful.
Case Study 3: AGIBOT A2 home-care pilot.
AGIBOT (a Chinese humanoid robot maker) deployed A2 robots in a small home-care pilot — assisting elderly residents with basic household tasks (fetching items, picking up dropped objects, simple conversational support). The pilot is small (12 robots in 12 homes) but the workflow is the most challenging of the three case studies because the environment is fully unstructured.
The architecture: GR00T N1.7 fine-tuned heavily on home-environment demonstrations, Cosmos Reason 2 for natural conversation and high-level planning, Jetson Thor on-robot, and a tablet interface for residents to give instructions. Network connectivity is used for telemetry and software updates, but never for real-time control.
What surprised the team: language understanding was the bottleneck, not motor control. Elderly residents speak in patterns the model’s training data didn’t fully cover (regional dialects, halting speech, contextual ambiguity). Reason 2 needed substantial fine-tuning on senior-care-specific dialogues before performance was acceptable.
Lessons: foundation models work best in domains where their training data resembles the deployment environment. Home environments and elderly users represent a domain gap that pure simulation can’t close. Real-environment data collection — even small amounts — was disproportionately valuable here.
Chapter 12: Pitfalls, Roadmap, and What’s Next
Eight months of community experience deploying the Physical AI stack has surfaced consistent failure modes and consistent next-frontier topics. This closing chapter distills both — the pitfalls to avoid in your project, and the trajectories to watch over the next twelve months.
Common pitfalls.
Pitfall 1: Underestimating sim-to-real transfer. Policies that work flawlessly in simulation often degrade significantly on real hardware. The gap is most severe for visual perception (rendering doesn’t perfectly match real cameras) and contact dynamics (sim physics misses subtle real-world friction and compliance effects). Plan for a real-world fine-tuning phase as a normal part of the pipeline, not an afterthought.
Pitfall 2: Not collecting enough real demonstrations. Synthetic data is amazing for scale but cannot replace real-world examples for the long tail of edge cases. Budget for at least 100 real-robot demonstrations per target task, ideally more. Teams that try to skip this step consistently underperform at deployment.
Pitfall 3: Ignoring safety until late in the project. Foundation-model-driven robots fail in subtle ways — they execute plausible-looking actions that are actually wrong. Safety verification (Cosmos Reason in the loop, hard limits on action magnitudes, e-stop systems) needs to be designed in from day one, not bolted on before deployment.
Pitfall 4: Over-relying on zero-shot capability. GR00T’s zero-shot performance is impressive demo material but rarely production-grade. Plan for fine-tuning on your specific tasks; budget the data collection and training time accordingly.
Pitfall 5: Ignoring data infrastructure. The data flywheel is the long-term advantage. If your project scales beyond a single robot or a single task, you need robust data ingestion, labeling, training, and deployment pipelines. This infrastructure is harder to build than the model training itself; budget for it.
Pitfall 6: Treating the simulator as ground truth. Isaac Sim is excellent but not perfect. Validation that runs only in sim misses real-world failure modes. Real-hardware testing has to be a regular cadence, not a one-time gate.
Roadmap watch.
GR00T N2 (end of 2026). Expected to more than double zero-shot success rates on novel tasks compared to N1.7. Larger model (likely 7B parameters), better generalist capability, longer-horizon planning. The biggest near-term capability jump in the stack.
Cosmos generative improvements. Cosmos Predict is improving its physics consistency quarterly. Expect synthetic data quality to approach real-data quality for many tasks within 18 months — at which point the data scarcity problem is structurally solved for most applications.
Edge compute progression. Jetson Thor is one generation. The Thor successor (rumored for 2027 announcement) is expected to deliver 2x compute at similar power, enabling larger on-robot models.
Cross-embodiment generalization. Current GR00T models work well on humanoid platforms; cross-embodiment to industrial arms or mobile robots is partial. Research is converging on truly embodiment-agnostic policies that transfer across radically different morphologies. Production availability: likely 2027-2028.
Safety standards. ISO and IEC working groups are developing standards for foundation-model-driven robots. The standards will matter for regulated deployments (medical, certain industrial). First drafts likely in 2026 H2; standards-track adoption probably 2027-2028.
What’s next for builders.
The Physical AI stack is now production-ready for many applications, with continuing improvements in data, models, hardware, and software shipping every quarter. For teams considering investment: start now, plan for iteration. The capabilities will get materially better over the next 12-18 months, but waiting for the perfect platform leaves you behind on the operational learnings that compound. Build a small pilot in 2026 with the current stack, develop the team and infrastructure capability, and you’ll be positioned to deploy at scale as the platform matures into 2027 and 2028.
The deeper observation: robotics is going through the same transition that NLP went through 2018-2022. Foundation models are replacing custom-engineered pipelines. Open-weight base models are becoming the substrate for fine-tuned specialists. The economics of building robotic systems are shifting from “expensive custom engineering” to “cheap iteration on shared infrastructure.” That shift compounds. The teams that adapt fastest will define the next decade of robotics.
Chapter 13: Multi-Robot Coordination Patterns
A single robot is a project; a fleet is a system. Once you cross from “one robot doing one task” to “many robots collaborating across an environment,” the engineering problem changes shape. The NVIDIA stack supports fleet operation, but it requires deliberate architecture choices. This chapter walks the three coordination patterns that actually work in production and the design trade-offs each one imposes.
Pattern 1: Centralized scheduler, autonomous execution. Most production fleets use this pattern. A central scheduler — running on a server, not on the robots — assigns tasks to individual robots based on availability, location, and capability. Each robot runs its assigned task autonomously using its on-board GR00T model and Jetson Thor compute. Coordination between robots happens at the scheduler level, not the policy level: the scheduler ensures two robots aren’t sent to the same workspace at the same time, but doesn’t manage joint trajectories or instantaneous motion.
Used by: Agility’s warehouse deployments, most industrial humanoid pilots. The clean split between centralized planning and on-robot autonomy makes the system scale linearly — adding robots is a procurement and commissioning task, not an ML engineering task.
# Sketch of a scheduler API
class FleetScheduler:
def assign_task(self, task: Task) -> Robot:
candidates = [r for r in self.robots if r.is_idle and r.can_perform(task)]
if not candidates: return None
# Pick the closest available robot
return min(candidates, key=lambda r: r.distance_to(task.location))
def heartbeat(self, robot_id: str, status: RobotStatus):
self.robots[robot_id].update(status)
if status == "completed":
self.assign_next_task_for(robot_id)
Pattern 2: Peer-to-peer coordination via shared world model. A more decentralized pattern: robots communicate state to each other (via a shared message bus or a fleet-level event store) and each robot’s policy considers the state of nearby robots when making decisions. The advantage: more graceful handling of dynamic environments where centralized scheduling can’t keep up. The disadvantage: harder to reason about, harder to debug, more prone to emergent behaviors.
Used by: research deployments, swarm robotics applications. Production deployments tend to start centralized and add peer-to-peer coordination only when the centralized pattern hits hard limits — which is rare for current robot densities (under 100 robots in a single space).
Pattern 3: Master-slave with leader robot. One robot acts as the leader, planning the multi-robot operation, and delegates sub-tasks to follower robots. The leader has higher-level reasoning capability (can run Cosmos Reason continuously); followers run smaller policies focused on execution. This pattern is common in construction-robotics and complex manipulation scenarios where one robot needs to oversee others.
Used by: certain construction robotics deployments, surgical assistance settings. Less common than the centralized pattern in commercial settings as of 2026.
The communication substrate. All three patterns need a low-latency, reliable communication layer between robots and / or scheduler. Common choices: ROS 2 with DDS for tightly-coupled real-time communication, MQTT for less-latency-sensitive coordination, gRPC for structured task assignment. NVIDIA’s Isaac ROS provides a curated set of ROS 2 nodes that integrate cleanly with the rest of the stack — it’s the path of least resistance for most fleet projects.
Failure modes specific to fleets. When one robot fails, the fleet has to absorb the failure gracefully. The scheduler reassigns the failing robot’s tasks; nearby robots may need to adjust their plans if they were depending on the failing robot’s contribution. Plan failure handling at design time. Regular chaos-engineering exercises (deliberately fail a robot, watch the fleet recover) catch issues before they manifest in production.
The data flywheel. A fleet of N robots generates N times the operational data of a single robot. That data, fed back into the training pipeline, produces continuously-improving policies. The asymmetry is structural: solo-robot deployments don’t get this benefit. Plan from the start to ingest fleet telemetry into a training data lake; the compounding gains over months are substantial.
Chapter 14: Sim-to-Real Transfer Techniques in Depth
Closing the sim-to-real gap is the single most important engineering challenge in foundation-model robotics. A policy that performs flawlessly in Isaac Sim and fails on the real robot is the canonical failure mode. The good news: a well-developed toolkit of techniques largely solves the problem. This chapter covers the techniques worth knowing.
Technique 1: Domain Randomization. The foundational technique. During simulation training, randomize every parameter that varies in the real world: lighting (intensity, color, direction), textures (surfaces, objects), masses, friction coefficients, sensor noise, camera intrinsics, robot link properties. The policy learns to be robust to all these variations, which translates into robustness to the real-world variations it encounters at deployment.
Implementation in Isaac Sim is well-supported via the Replicator API. Key parameters to randomize for manipulation tasks: object friction (0.1-1.0), object mass (50%-150% of nominal), camera position (small perturbations), lighting (full diversity), workspace floor texture (full diversity).
# Domain randomization in Isaac Lab
import isaaclab.sim as sim_utils
from isaaclab.managers import EventTermCfg as EventTerm
events = {
"randomize_object_mass": EventTerm(
func=mdp.randomize_rigid_body_mass,
params={"asset_cfg": SceneEntityCfg("object"),
"mass_distribution_params": (0.5, 1.5),
"operation": "scale"},
mode="reset",
),
"randomize_friction": EventTerm(
func=mdp.randomize_rigid_body_material,
params={"asset_cfg": SceneEntityCfg("object"),
"static_friction_range": (0.1, 1.0),
"dynamic_friction_range": (0.1, 1.0),
"restitution_range": (0.0, 0.3),
"num_buckets": 64},
mode="reset",
),
"randomize_lighting": EventTerm(
func=randomize_dome_light,
params={"intensity_range": (300.0, 3000.0),
"color_range": ((0.7, 1.0), (0.7, 1.0), (0.7, 1.0))},
mode="reset",
),
}
Technique 2: System Identification. The reverse complement to domain randomization: measure the real-robot dynamics carefully and tune simulation parameters to match. The technique works well when the real-world variation is small (a single robot, well-characterized) and produces sim-trained policies that transfer to the calibrated robot with minimal domain randomization. Combine with domain randomization for the best of both: calibrate the mean, randomize around it.
Technique 3: Real-World Fine-Tuning. After sim training, collect a small number of real-robot rollouts (50-200 demonstrations) and fine-tune the policy on the real data. The fine-tune is typically lightweight (LoRA adapter, a few epochs) and dramatically improves real-world performance. The trade-off: fine-tuning specializes the policy to the specific deployment environment, sacrificing some of the broad generalization capability the foundation model provided.
Technique 4: Photorealistic Rendering. Isaac Sim’s Omniverse-powered rendering produces images that are dramatically closer to real-camera output than older simulators. This matters for vision-based policies: training on rendered images that look like real images means the model doesn’t need to bridge a large visual domain gap at deployment. Always train with photorealistic rendering enabled, even though it’s slower than fast-rendering modes.
Technique 5: Action-Space Engineering. The choice of action space affects sim-to-real transfer. Joint-position commands are sensitive to inertia and contact dynamics that simulators get imperfectly. End-effector position commands (Cartesian-space targets that downstream controllers convert to joint commands) are more robust. For most manipulation tasks, end-effector action spaces transfer better, even though they introduce more dependence on the downstream controller.
Technique 6: Sensor Noise Injection. Real cameras are noisier than sim cameras. Real depth sensors have specific failure modes (specular surfaces, dark scenes, range limits) that sim doesn’t model perfectly. Inject realistic noise into sim sensor outputs during training. The policy learns to handle the noise, and real-world performance improves.
Technique 7: Iterative Sim-Real Loop. The most powerful technique: train in sim → deploy on real → identify failure modes → update sim to capture those failure modes → retrain → redeploy. Each iteration closes specific gaps. Over 3-5 iterations, the gap typically shrinks from “30% real-world success rate after sim training” to “85%+ real-world success rate after iterative training.”
What doesn’t work as well as you might think. Pure synthetic-data-only training tends to underperform mixed training, even with the best generative models. Deploying a sim-trained policy without any real-world adaptation rarely works. Assuming high-fidelity simulation alone closes the gap — it helps, but it’s not sufficient. The techniques compose; no single technique is enough.
Chapter 15: The Data Pipeline — Collection, Labeling, Curation
Talk to any team that’s deployed Physical AI to production for more than six months and they’ll tell you the same thing: the model is the easy part; the data pipeline is the hard part. This chapter walks the data infrastructure that mature deployments build, broken into the three operational phases — collection, labeling, and curation.
Collection: where the data comes from. Production deployments draw from four data sources:
- Real-robot teleoperation. A human operator drives the robot through demonstrations of the task. Highest-quality data, slowest to collect (one demo = task duration), expensive at scale ($50-$200 per demo when including operator time and equipment). Used for the gold standard.
- Real-robot autonomous rollouts. Once a basic policy works, deploy it and capture every rollout — successes and failures. The successes provide reinforcement; the failures provide exactly the edge cases the policy is weak on. Captures more data per dollar than teleoperation, but requires a working baseline policy.
- Simulation rollouts. Run the policy in Isaac Sim with domain randomization, capture every trajectory. Cheap (essentially free at the marginal level), unlimited scale, but suffers from sim-to-real gap.
- Cosmos-generated synthetic. Use Cosmos to generate diverse trajectories beyond what was actually executed. Cheap at moderate scale, exceptional for diversity, but quality is lower than real or sim sources.
The right blend: 10-20% real, 50-70% sim, 20-30% Cosmos. Adjust based on task complexity — narrower tasks benefit from a higher real-data fraction; broader generalization benefits from more synthetic.
Labeling: ground truth assignment. Each data point needs labels for training to work: did the rollout succeed? At what step? Was the action correct? The labeling process is where data pipelines often bottleneck.
- For sim and Cosmos data, labels are automatic — the simulator knows whether the goal state was reached. Free, instant, accurate.
- For real-robot data, labels typically come from sensor verification (checking final state with a depth camera or weight scale), human review (a screen showing the rollout, a “success / fail / unclear” button), or LLM-based verification (Cosmos Reason 2 evaluating “did the robot complete the task?”). Each has trade-offs.
- For partial-success rollouts, labeling is harder. A pick-and-place where the robot picked up the object but dropped it — was that a success or failure? Define your label taxonomy explicitly; ambiguous taxonomies produce inconsistent training data.
Curation: filtering and weighting. Not every collected data point should go into training. Bad data hurts. The curation pipeline filters and weights the dataset to maximize signal.
- Filter degenerate trajectories. Sim rollouts that exploit physics bugs, real demonstrations where the robot hardware glitched, Cosmos generations with obvious physics violations.
- Filter trivial trajectories. Rollouts where the task is too easy contribute less than rollouts that exercise edge cases. Bias the sampler toward harder examples.
- Weight by source quality. Real demonstrations might be weighted 2-3x higher than synthetic during training. The exact weights are hyperparameters worth tuning.
- Maintain a holdout set. Never train on the validation set. Use holdouts to measure real performance, not just training-loss progress.
Storage and retrieval. A mature fleet’s data lake easily reaches petabyte scale within a year. The infrastructure pattern: cloud object storage (S3, GCS, Azure Blob) for raw trajectories, columnar data warehouse (BigQuery, Snowflake, Databricks) for metadata and labels, dedicated GPU storage tier (NVMe, parallel filesystem) for active training datasets. Pull frequently-accessed data into the GPU tier; archive cold data to cheaper storage.
Schemas: standardize on a trajectory format from the start. Common choices: HDF5 for individual trajectories, Parquet for metadata, the Hugging Face Datasets format for distribution. Don’t reinvent the format; the community has converged.
Operational rhythm. Mature data pipelines run continuously: data flows in from the fleet, gets labeled within minutes (for automated labels) or hours (for human review), enters the curation pipeline overnight, and contributes to weekly model retraining. The “model improvement cadence” of a healthy operation is once per week — measurable improvement in policy success rate every Monday from the prior week’s collected data.
Chapter 16: Comparing NVIDIA’s Stack with Alternatives
The NVIDIA Physical AI stack is the most complete platform in 2026, but it’s not the only one. Several credible alternatives address pieces of the stack or provide end-to-end alternatives. Knowing what’s available — and what trade-offs each option implies — informs better build-versus-buy decisions.
Physical Intelligence (PI). A robotics startup with its own foundation models (PI-0, PI-1) and a SaaS-style API for deploying robot policies. Closed-source, proprietary architecture, but strong customer references in the early months of 2026. The pitch: turnkey deployment with PI handling the model infrastructure and customers focusing on robot integration. Competing directly with the openness story of NVIDIA’s stack; wins for teams that prioritize “no model training needed” over “full control.”
Skild AI. A foundation-model-for-robots startup with a different architectural bet — a single very large generalist policy intended to handle radically different embodiments. Closed-source, early commercial deployments. Best fit for teams that need cross-embodiment generalization beyond what GR00T currently delivers.
Boston Dynamics + Spot AI. Boston Dynamics ships its Spot robot with proprietary autonomy stack, with growing third-party AI extension support. Vertically integrated — robot hardware and AI come together — with the trade-off being significantly less flexibility than the NVIDIA-on-arbitrary-robot pattern. Wins for teams that want a complete robot product, not just a platform.
Open-source alternatives. Several open initiatives compete with parts of NVIDIA’s stack. RoboFlamingo and OpenVLA are open VLA models. MuJoCo and PyBullet are open simulators. The Open X-Embodiment dataset offers cross-embodiment training data. The composability story is rougher than NVIDIA’s integrated stack, but the “no vendor lock-in” story is compelling for some.
| Option | Foundation model | Sim | Edge compute | Openness | Best for |
|---|---|---|---|---|---|
| NVIDIA Physical AI stack | GR00T (open) | Isaac Sim (free) | Jetson Thor (NVIDIA HW) | Open weights + free sim, NVIDIA HW | Most teams; full-stack flexibility |
| Physical Intelligence (PI) | PI-0/1 (closed) | Hosted | Customer-supplied | Closed | Teams wanting turnkey API |
| Skild AI | Skild generalist (closed) | Hosted | Customer-supplied | Closed | Cross-embodiment-heavy use cases |
| Boston Dynamics + Spot AI | Proprietary | Internal | Built into robot | Closed (robot + AI bundled) | Teams buying a complete robot product |
| Open-source patchwork (OpenVLA + MuJoCo + custom) | OpenVLA / RoboFlamingo | MuJoCo / PyBullet | Customer-chosen | Fully open | Research; teams avoiding vendor lock-in |
The pragmatic recommendation. For most commercial deployments in 2026, the NVIDIA stack is the right starting point. It’s the most mature, most documented, with the largest community and the deepest set of partners. The openness of the foundation models means you’re not vendor-locked even though you’re using NVIDIA tooling. The integration story is meaningfully cleaner than open-source patchworks. Pick alternatives only when you have specific needs that NVIDIA’s stack doesn’t address — and validate that the alternative actually addresses your need before committing.
The exception: if you’re building a research project or a thesis that needs full reproducibility and full ownership, the open-source patchwork is reasonable despite the higher integration burden. For commercial deployments where time-to-deployment matters, NVIDIA’s integrated stack wins on velocity by enough margin to be the obvious answer.
Chapter 17: Cost-of-Ownership Model for a Production Fleet
What does it actually cost to operate a Physical AI fleet for a year? This chapter builds a representative TCO model — a 50-robot warehouse fleet doing tote-handling — to show the cost structure. Numbers are anchored to real 2026 pricing; adjust to your scenario.
Capital expenditures (year 0):
- 50 humanoid robots at $200K each: $10M
- Training infrastructure (4× H100 cluster owned): $200K
- Networking and facility integration: $300K
- Initial software licenses (NIM credits, AI Enterprise): $150K
- Engineering NRE for fleet customization: $800K
- Total capex: $11.45M
Annual operating expenditures:
- Robot maintenance (8% of robot capex): $800K
- Power and connectivity (50 robots × ~500W avg × 8,760 hrs × $0.10/kWh): $22K
- Cloud infrastructure (training, monitoring, data pipeline): $400K
- Software licenses (annual renewal): $300K
- Engineering team (3 ML engineers + 2 robotics engineers + 1 ops): $1.4M
- Data labeling and curation services: $200K
- Insurance (AI coverage, business operations): $250K
- Total annual opex: $3.37M
Three-year TCO and per-task cost. Capex amortized over 3 years: $3.82M/year. Total annual cost: $7.19M. Three-year total: $21.6M. The fleet handles approximately 5M tasks per year (50 robots × 274 working days × 16 hours × ~25 tasks/hour). Effective cost per task: $1.44.
Compare against the human cost for equivalent throughput: 100 warehouse workers at $50K/year fully-loaded would cost $5M/year, handling roughly the same task volume. Year-one robot economics don’t beat human economics. By year three, with capex paid down and policy continuously improving, the math shifts: $3.37M opex/year for 5M tasks = $0.67/task, better than human cost on a per-task basis. The break-even point is somewhere between years 2 and 3.
Sensitivity analysis. Key levers and their impact on three-year TCO:
- Robot price drop to $150K each. Saves $2.5M of capex. Most likely single source of cost improvement over time.
- Cloud-based training instead of owned cluster. Saves $200K capex, adds $150K/year opex. Net wash; choose based on capacity stability needs.
- Skip NIM licensing. Saves $300K/year opex. Trade-off is more in-house engineering time on inference optimization.
- Fleet size grows to 100 robots. Capex roughly doubles, opex grows ~70% (some scale economies). Per-task cost drops ~12% from amortizing fixed costs over more tasks.
- Task complexity increases (longer, harder tasks). Per-task time grows, fewer tasks per robot per day, per-task cost rises proportionally. The sensitivity here is the largest single variable in the model.
What the model gets wrong on purpose. The model assumes the robots achieve 90% uptime in year one — aggressive for new deployments. It assumes policy quality is sufficient to do useful work from month two — also aggressive. It doesn’t include the “long tail” of integration costs (custom end-effectors, environment modifications, safety review) that often add 10-30% to project costs. Use the model as a directionally-correct first-pass; do detailed bottom-up estimates for actual procurement decisions.
The strategic question. Robot fleet economics in 2026 are competitive with human labor for many tasks but not all. The trajectory is clearly toward favorable economics across more task categories — robots get cheaper, policies get better, integration cost falls. The strategic question for buyers: at what point does deployment become advantaged enough to lock in long-term? For most warehouse and structured-industrial use cases, that point is now or within 12 months. For unstructured environments (homes, retail, hospitality), the answer is likely 2027-2028. Plan accordingly.
Chapter 18: The Sim-to-Real Iteration Loop in Practice
Beyond individual transfer techniques, mature Physical AI deployments establish an iteration cadence that compounds improvements over months. This chapter walks the operational playbook: how to structure a sim-real iteration loop, what cadence works in practice, and how to know when you’re converging.
The weekly cadence. Most production teams converge on a one-week iteration cycle. Monday-Tuesday: deploy the latest policy to a small subset of real robots (10-20% of fleet for fleets, single robot for solo deployments), collect rollouts. Wednesday: triage failures, identify the top 3-5 failure modes that account for the bulk of failed rollouts. Thursday: update simulation to capture those failure modes (new domain-randomization parameters, new scenario variations, new edge cases). Friday: kick off retraining over the weekend. Following Monday: deploy and repeat.
The cadence works because the cycle time matches human attention spans (engineers can hold the context for a week’s worth of issues), GPU training time (a weekend of training fits a meaningful retraining job), and operational reality (humans need weekends; robots tolerate the same).
Failure-mode taxonomy. Categorizing failures is the most important diagnostic skill. Rollouts that fail can be tagged into:
- Perception failures. The model misidentified what it was looking at. Camera produced an image the policy mishandled.
- Planning failures. The model identified the scene correctly but chose the wrong action. Reasoning broke.
- Execution failures. Plan was right but motor execution missed. Often a control or hardware issue, not a model issue.
- Environmental failures. Something unexpected happened in the world (object moved, human entered, hardware glitched). Rare and hard to fix.
Tag every failure. Track the distribution by category. The biggest category is where to focus the next iteration’s improvements. Teams that don’t categorize tend to spread effort across all categories equally — which is rarely the right allocation.
The iteration metrics. Five metrics worth tracking week-over-week:
- Real-robot success rate on the target task (the headline number).
- Sim success rate on the same task (should track real success rate; divergence signals sim-real gap regression).
- Per-failure-category counts (which categories are improving, which are stuck).
- Mean episode length on success (longer episodes = more efficient policy).
- Real-robot data volume (how many new rollouts collected this week, feeding the next training).
Plot these over time. The curves tell the story. Plateauing real-robot success rate while sim success rate continues to climb is a clear signal that you’ve maxed out what sim can teach you and need more real-world data. Steady real-rate climb with occasional plateaus is normal; plot trends, not raw weekly numbers.
When to declare convergence. A policy is converged when: real-robot success rate has plateaued for 4+ consecutive weeks, the failure mode distribution is dominated by genuinely hard cases (not bugs), and additional iterations produce noise rather than signal. At that point, the policy is as good as the current data and methods can make it; further improvements require either more data, a better base model, or different architecture.
Don’t over-iterate. Once converged, the engineering effort shifts to other tasks or to architectural improvements. Teams that keep grinding on a converged policy waste effort.
The version-control discipline. Every policy that goes to a real robot needs a version, stored with: the training data set version, the model checkpoint, the simulation environment version, the evaluation results. When something regresses in production, you need to know what changed. Git for code, plus dedicated experiment tracking (MLflow, Weights & Biases, NVIDIA Experiment Manager) for the data and model lineage. Lacking this infrastructure is the single most common reason teams can’t reproduce or roll back from regressions.
Bug bash culture. Quarterly, run a focused bug bash on the deployed policy. The full team takes 2-3 days to deliberately break the policy — present it with edge cases, weird scenes, adversarial instructions. The failures discovered fuel a focused improvement sprint. This finds long-tail issues that the regular iteration cycle misses because they’re not in the natural rollout distribution.
Chapter 19: Safety, Certification, and Regulatory Standards
Physical AI systems interact with the physical world. They can hurt people. Regulatory frameworks for AI-driven robotics are still developing, but the broad strokes are clear and the operational implications are increasingly serious. This chapter covers what every team building Physical AI needs to know about safety engineering and the emerging certification landscape.
The functional safety baseline. Most jurisdictions require robotic systems that operate around humans to comply with established functional-safety standards: ISO 13849 (machinery safety), ISO 10218 (industrial robots), ISO/TS 15066 (collaborative robots), or domain-specific standards (IEC 62304 for medical robots, ISO 26262 for automotive). These standards predate AI-driven robotics and were written for deterministic systems with predictable behavior. They still apply to AI-driven robots — and applying them is harder than applying them to traditional robots.
The core requirement: a safety system that operates independently of the AI policy and prevents the policy from taking actions that violate safety constraints. In practice this means: hard speed and force limits enforced at the controller level, geometric safety zones the robot cannot enter regardless of what the AI says, hardware emergency stops that cut motor power on demand, watchdog timers that halt the robot if the AI hasn’t issued a command in time. These are not optional. They are the deterministic guard rails inside which the AI is allowed to operate.
The architecture pattern. Safety-engineered Physical AI systems separate the AI policy from the safety system at the architectural level:
- The AI policy proposes actions (computed by GR00T or equivalent).
- The safety supervisor checks proposed actions against safety constraints (limits on velocities, forces, joint angles, geometric zones).
- Safe actions pass through to the motor controllers; unsafe actions are filtered, modified, or halted.
- If the AI policy goes silent, the safety supervisor brings the robot to a safe stop.
The safety supervisor itself runs on a separate compute path — often a dedicated safety-certified microcontroller, not the same Jetson Thor that runs the AI. This separation is critical for certification: if the AI is non-deterministic, the safety system that contains it has to be deterministic and verifiable independent of the AI.
Verification challenges. Traditional safety certification relies on formal verification or exhaustive testing of system behavior. AI policies are not formally verifiable in the strict sense, and exhaustive testing of foundation-model-driven robots is intractable. Certification of AI-driven robots is therefore moving toward a hybrid approach: certify the safety supervisor (deterministic, verifiable), characterize the AI policy with statistical methods (success rates, failure modes, expected vs unexpected behaviors), and require operational monitoring that catches drift in production.
The standards bodies — ISO/IEC working groups, IEEE — are converging on this approach. Drafts of “AI-augmented robotics safety” standards are expected to publish in late 2026 and 2027. Production deployments today are operating under interim guidance that anticipates the formal standards; teams that align with the interim guidance will have an easier path to formal certification when the standards finalize.
Domain-specific considerations.
- Industrial settings. Most regulated; longest-established standards. Compliance is a known problem with known solutions, even for AI-driven systems.
- Medical settings. FDA/EMA approval is required and the bar is high. Plan for 18-36 months from prototype to deployment, with substantial verification work along the way.
- Home / consumer settings. Less mature regulatory framework, but liability exposure is significant. Consumer-product safety standards (UL, IEC) cover the basics; specific AI-related standards are emerging.
- Public spaces / commercial settings. Mixed jurisdiction-specific regulations, often requiring local permits and compliance assessments.
Insurance and liability. The new “AI Coverage” insurance products entering the market in 2026 specifically cover liability gaps for AI-driven systems — including hallucinations, unexpected actions, and edge-case failures. For commercial Physical AI deployments, AI-specific coverage is becoming a normal part of the operational stack alongside traditional product-liability and general-liability insurance.
The operational discipline. Beyond design-time certification, ongoing operations matter. Maintain detailed logs of every robot action and every safety-system intervention. Review unusual events promptly. When a safety-system intervention fires, treat it as a potential leading indicator — the AI was trying to do something the safety system disallowed, and that’s worth understanding even if the safety system did its job. Operational discipline is what catches drift before it becomes incidents.
Chapter 20: Strategic Planning for Builders
You’ve made it through the technical depth. The closing chapter is about strategy — how to think about Physical AI as a long-term capability rather than a one-time project, what the next 24 months look like, and the decisions builders should be making now to position for them.
The capability curve. GR00T N1.7 represents real but limited capability. The deployment economics work for narrow tasks in structured environments. Over the next 18 months, three things will improve:
- Foundation models will get materially better. GR00T N2 (end of 2026) is expected to roughly double zero-shot success rates and significantly extend long-horizon planning. Beyond N2, a generation every 12-18 months is the likely cadence.
- Synthetic data quality will rival real-data quality for many tasks, making large-scale data collection cheap.
- Edge compute will keep improving on the same Moore’s Law plus AI accelerator curve we’ve seen for the past five years.
Each axis compounds with the others. The trajectory is for Physical AI deployment economics to become favorable for progressively unstructured environments — homes, retail, public spaces — over the late 2020s. Not at any specific predictable date, but unmistakably trending that direction.
The build-now-or-wait calculus. A common question: should we build a Physical AI capability today, or wait for the technology to mature further? The answer depends on what you’re trying to accomplish.
For commercial deployments at scale, waiting is rarely correct. The operational learnings — data infrastructure, fleet operations, safety engineering, customer integration — take 12-18 months to build regardless of model capability. Teams starting now will have those capabilities mature when the next model generations land. Teams waiting will be 12-18 months behind on the operational side regardless of how good the future models are.
For research projects, the calculus depends on the research question. Pure capability research benefits from waiting for better models. Methodology research (how to deploy, how to validate, how to scale) benefits from starting now with current models. Read the research question literally to make the call.
For early-stage startups, the answer is almost always “start now.” Markets reward the team that proves a use case first; the technology will catch up to the ambition.
For enterprises evaluating Physical AI, the answer is “pilot now, scale later.” A small pilot with current technology builds organizational capability and surfaces the integration challenges your specific environment will face. The cost of a pilot is small relative to the cost of being unprepared when the technology matures further.
Talent strategy. Physical AI engineering is a hybrid discipline — robotics + ML + systems engineering. Pure roboticists need to learn ML; pure ML engineers need to learn physical-system constraints. The talent market in 2026 is tight; expect to invest in training existing engineers rather than hiring fully-formed Physical AI experts. The investment pays back: trained-up internal engineers stay longer and integrate better than external hires.
Partnership strategy. NVIDIA’s stack benefits from a broad partner ecosystem. Identify the partners relevant to your use case: robot platform vendors (Figure, Agility, AGIBOT, Universal Robots, etc.), data labeling vendors, integration consultancies, simulation specialists. Engaging the right partners cuts months off your timeline and de-risks specific subsystems. Don’t build everything yourself; the integration story is good enough that partner-on-non-core-pieces, build-on-core-differentiators is the right strategy for most teams.
Long-term investment in data. The single most durable competitive advantage in Physical AI is your dataset. Foundation models are commodity-trending; data collected from your specific environment, your specific tasks, your specific customers is not. Invest in the data infrastructure as if it’s the most important asset you’re building, because over a 5-year horizon, it is.
Watching the regulatory horizon. Engage with standards bodies and regulatory frameworks early. Companies that participate in standards development have meaningful influence over the standards that emerge. Companies that ignore standards until they’re finalized have to retrofit their systems to comply. The time investment in standards engagement is small relative to the operational cost of belated compliance.
Closing thought. Physical AI in 2026 is at the same kind of inflection point that NLP was at in 2021. The technology works well enough for real applications. The infrastructure is open and accessible. The economics are improving quarterly. The teams that get serious now will define the field over the next decade. The teams that wait for “the technology to be ready” will discover that “ready” never has a clean threshold; it’s a gradient, and you’re always either on the leading edge or chasing it.
The NVIDIA Physical AI stack — Isaac GR00T, Cosmos, Isaac Sim, Jetson Thor — is the most accessible path to that leading edge today. Use it. Build something. Iterate. The next twelve months will reward the builders who stopped planning and started shipping.
Chapter 21: Five Recipes That Work — Battle-Tested Workflow Patterns
Twenty chapters of context eventually have to compress into actionable recipes. This chapter compiles five concrete workflow patterns that production teams use repeatedly. Each is a tested recipe — copy, adapt, ship.
Recipe 1: Single-task fine-tune with sim-collected demonstrations. The standard recipe for taking GR00T from “zero-shot 50%” to “production 90%” on a specific task. Steps:
- Define the task precisely. Pick-and-place a specific object class to a specific destination zone, with measurable success criteria.
- Set up an Isaac Sim environment that matches your real workspace as closely as practical. Include domain randomization on lighting, textures, and object placement.
- Collect 1,000 simulation demonstrations using a teleoperation interface or a scripted policy. Verify each demonstration is a clean success (no failed picks, no near-misses).
- Generate 5,000 additional demonstrations using Cosmos Predict for diversity (lighting variations, viewing angles, object variations).
- Collect 100 real-robot demonstrations as the gold-standard validation set.
- Fine-tune GR00T N1.7 with LoRA, train for 5 epochs at learning rate 1e-4.
- Evaluate on the real-robot validation set. Iterate on data collection if success rate is below target.
Time budget: 3-4 weeks for a fluent team, 6-8 weeks if learning the stack from scratch. End result: 85-95% success on the target task.
Recipe 2: Multi-task generalist policy. When the robot needs to handle a family of related tasks rather than a single one. Use the same base recipe but combine training data across tasks:
# Combine task-specific datasets with task identifiers in the prompt
combined_dataset = ConcatDataset([
PickPlaceDataset(prompt_prefix="pick the {object} and place it on the {destination}"),
AssemblyDataset(prompt_prefix="assemble the {part_a} onto the {part_b}"),
HandoffDataset(prompt_prefix="hand the {object} to the human"),
])
# Train one model on the combined data
trainer = Trainer(
model=peft_model,
train_dataset=combined_dataset,
args=training_args,
)
trainer.train()
The natural-language instruction at runtime tells the policy which task it’s executing. Per-task success drops 5-10 percentage points compared to single-task fine-tunes, but the operational simplicity (one model, one deployment) often wins.
Recipe 3: Sim-to-real bootstrapped real-data collection. When you have a real robot but no demonstrations, and you want to start collecting real demos efficiently. Bootstrap with a policy trained in sim, deploy on the real robot to collect autonomous rollouts (some succeed, some fail), use Cosmos Reason 2 to label success/failure, feed back into training.
- Train a baseline policy purely in sim with aggressive domain randomization.
- Deploy on the real robot. Run it 100-500 times on the target task. Capture all rollouts.
- Label each rollout: Cosmos Reason 2 evaluates “did the robot accomplish the goal?” with the goal stated as natural language.
- Filter to successful rollouts. Add to training data.
- Fine-tune the policy. Re-deploy. Iterate.
The flywheel: each iteration adds more real-world data, the policy gets better, more rollouts succeed, more data flows in. Three to five iterations typically converges to production-grade success rates.
Recipe 4: Safety-supervised deployment. The recommended pattern for any production deployment around humans. The AI policy proposes actions; a separate safety supervisor approves or filters them.
class SafetySupervisor:
def __init__(self, max_velocity, max_force, geometric_zones):
self.max_v = max_velocity
self.max_f = max_force
self.zones = geometric_zones
def filter(self, proposed_action, robot_state):
# 1. Check velocity limits
if any(abs(v) > self.max_v for v in proposed_action.joint_velocities):
return self.scale_to_velocity_limit(proposed_action)
# 2. Check force limits
if estimate_force(proposed_action) > self.max_f:
return self.reject("force too high")
# 3. Check geometric safety zones
if any(self.would_violate_zone(z, proposed_action, robot_state) for z in self.zones):
return self.reject("geometric zone violation")
# 4. Default: allow
return proposed_action
# Main control loop
while running:
action = groot_policy.step(camera, instruction, joint_state)
safe_action = supervisor.filter(action, robot_state)
if safe_action is not None:
robot.execute(safe_action)
else:
robot.stop()
log_safety_event(action)
The supervisor runs on a separate compute path from the AI — typically on the safety-certified microcontroller in modern collaborative robots. This separation is what lets safety-certified deployments include AI policies; the certification applies to the supervisor, not the AI.
Recipe 5: Continuous policy improvement from fleet operations. The pattern for operating a fleet at scale and improving the policy over time. Every robot’s rollout data flows back to a central training infrastructure. Weekly retraining incorporates the latest data. Improved policies deploy fleet-wide.
- Each robot logs every rollout (camera frames, joint states, actions, success/failure).
- Logs upload nightly to a central data lake (S3, GCS, or equivalent).
- Daily ETL pipeline curates new logs: filter degenerate rollouts, label outcomes, format for training.
- Weekly retraining job picks up the latest curated data, fine-tunes the deployed policy.
- New policy is validated on a holdout test set. If success rate matches or exceeds the deployed policy, promote.
- Promoted policy rolls out to a small canary fleet (5% of robots) for 24 hours of validation.
- If canary performance is healthy, full fleet rollout over the next 48 hours.
The infrastructure to run this at scale is non-trivial — data pipelines, training orchestration, model registry, deployment automation, monitoring. But it’s the operational pattern that produces continuously-improving fleets. Teams that invest in this infrastructure see 1-3 percentage points of success-rate improvement weekly for the first six months of operation, before policies plateau.
What ties the recipes together. Each recipe is incremental. Master Recipe 1 first; it’s the foundation. Layer Recipe 4 (safety supervision) once you’re deploying around humans. Add Recipe 5 (fleet learning) when you have multiple robots and want compounding improvements. Recipes 2 and 3 are project-specific — use them when the project demands.
The teams that ship Physical AI well aren’t the ones with the most clever architectures. They’re the ones who execute the standard recipes cleanly, instrument operations well, and iterate disciplined. Pick a recipe, ship it, learn, ship the next one. The compounding improvements over months are what wins.
Frequently Asked Questions
Is Isaac GR00T really open-source?
Yes. The GR00T N1.7 weights are released under Apache 2.0 on Hugging Face, the model code is open-source, and there are no commercial-use restrictions. NVIDIA’s strategy is to grow the ecosystem — the open release of foundation models pulls developers into the broader Physical AI stack (Cosmos, Isaac Sim, Jetson Thor) where NVIDIA monetizes through hardware and enterprise software.
Can I use the stack without buying NVIDIA hardware?
For development and simulation, yes — Isaac Sim runs on any RTX-class GPU, GR00T runs on any modern GPU with sufficient VRAM, and Cosmos is accessible via NIM-hosted API. For on-robot deployment, you need a Jetson AGX Thor or comparable edge accelerator. There’s no hard lock-in to NVIDIA — competing edge accelerators exist and improve quarterly — but the integration story is meaningfully cleaner with NVIDIA hardware.
How does this compare to PI’s research robotics stack?
PI (Physical Intelligence) ships a competing foundation-model platform with its own VLA models and customer base. The PI approach is a closed-source SaaS-style API; NVIDIA’s is open-source plus hardware. For most teams, NVIDIA’s open-source-first model wins on flexibility and avoids vendor lock-in; PI’s API may win for teams that want a turnkey integration without hosting model infrastructure themselves.
What’s the realistic timeline from “decided to build” to “robot doing useful work”?
A small pilot (single robot, single task) is achievable in 4-6 months with a small team (2-4 engineers). Production deployment with multiple tasks and reliability sufficient for real customers typically takes 12-18 months. Both timelines are dramatically faster than pre-foundation-model robotics projects.
Is the data collection process really the bottleneck?
For ambitious deployments, yes. Synthetic data scales but doesn’t fully substitute for real demonstrations. Plan to collect demonstrations from day one, build the infrastructure to manage them, and treat the data pipeline as a first-class engineering deliverable. Teams that defer this consistently regret it.
Should we wait for GR00T N2?
No. Start with N1.7 now, plan the upgrade path to N2 when it ships. The platform infrastructure and team capability you build with N1.7 transfer cleanly to N2. Waiting forfeits the operational learnings and gives competitors a head start.
How do safety regulators look at foundation-model-driven robots?
Cautiously, with active engagement. ISO/IEC working groups are developing standards specifically for AI-driven robotics. For deployments in regulated industries (medical, automotive, certain industrial categories), engage with the relevant standards body early — the standards are evolving and the relationship is more cooperative than adversarial in 2026.
What’s the realistic team size for a Physical AI deployment?
For a single-task pilot, 2-4 engineers with mixed ML and robotics backgrounds for 4-6 months. For a small production deployment of one task on multiple robots, 4-6 engineers for 12 months including ramp-up. For a multi-task production fleet, 8-15 engineers spanning ML, robotics, software, infrastructure, and ops for ongoing operation. These are smaller teams than pre-foundation-model robotics projects required by a factor of 2-3x. The leverage from foundation models compounds across the team.
Can I run GR00T on a non-NVIDIA edge accelerator?
Technically yes — the GR00T weights are open and the model is standard PyTorch. Practically, expect significant integration work to optimize for non-NVIDIA hardware. The Jetson Thor pathway has the cleanest software story; alternatives require more engineering and typically deliver lower performance per watt. For most projects, the time saved by using NVIDIA hardware is worth the price premium.