One policy — backflips, expressive motion, and locomotion — grounded to whatever terrain the robot encounters.
A single raw-reference policy deployed on a Unitree G1. Each clip pairs a flat-ground human-motion command with a different, randomly placed terrain. Click any clip to expand.
The same policy deployed outdoors on real-world urban terrain.
The operator performs a motion-capture command on flat ground while the robot executes it over randomly placed terrain — isolating the operator–environment mismatch.
Humanoid behavior foundation models aim to acquire reusable whole-body control policies from broad human motion priors, enabling a single controller to produce diverse and expressive behaviors. However, existing motion-centric foundation policies largely assume that the reference motion is already physically compatible with the robot's surroundings. This assumption breaks when the demonstrator, operator, and robot inhabit different environments: a human motion may specify the intended behavior, but not the footholds, clearance, body height, or contact timing required by the robot's local terrain.
We introduce Perceptive Behavior Foundation Model (Perceptive BFM), a terrain-aware humanoid control framework that grounds human motion priors in robot-centric perception. The model preserves raw kinematic motion references as the behavioral interface, while using local terrain observations to adapt contacts, posture, and timing.
To provide scalable terrain supervision, we develop terrain-conformal reference synthesis (TCRS), which converts locomotion-oriented human motion clips into terrain-consistent references through contact-aware foothold construction, foot-geometry-aware swing optimization, support-aware root reconstruction, collision repair, and multi-point inverse kinematics. We then train a blind adapted-reference teacher and transfer its terrain-conformal behavior to a deployed raw-reference student through target-frame action alignment. The student is implemented as an identity-gated Transformer tracker, where terrain features enter through residual pathways initialized to preserve the motion-tracking prior and trained to produce local corrections only when needed.
Across controlled simulation and qualitative real-robot rollouts, a single policy tracks a wide range of behaviors — locomotion, stylistic motions, acrobatic maneuvers, and motion-capture teleoperation — and is exercised on stairs, slopes, sparse supports, recessed obstacles, grass, and irregular indoor/outdoor terrain. The results indicate that robot-centric perception can transform human motion priors into terrain-compatible whole-body behavior without changing the raw motion command interface.
A teleoperator in a control room — or a motion-capture clip recorded on flat ground — conveys intent and style, but not a terrain-valid trajectory in the robot's world. The robot must resolve contacts, body height, balance, and timing from its own perception.
Perceptive BFM is trained with a staged Perceptive Motion Tracking (PMT) algorithm. The key contract: the raw motion reference remains the deployment command — terrain-conformal references are used only to supervise learning, never supplied to the final policy at test time.
Raw human-motion clips + sampled height fields are converted into terrain-consistent supervision through contact-aware foothold construction, foot-geometry-aware swing optimization (MPPI in a mid-foot frame), support-aware root reconstruction, collision repair, and multi-point inverse kinematics.
A blind Transformer actor–critic is trained with PPO to track terrain-conformal references, exposing privileged state but no perception. This teacher captures terrain-aware behavior at the action level.
The deployed student receives raw references plus a robot-centric height map. Vision is fused via residual pathways with identity gates tanh(α) initialized to 0, so the network reproduces the motion-tracking prior at init and learns only the corrections terrain demands.
Teacher actions live in the adapted-reference frame; the student acts around the raw reference. We distill the teacher's effective PD target re-expressed relative to the raw reference, a* = (q_reftcrs + μtea) − q_refraw, so distillation is meaningful across reference frames.
ut; a terrain branch (Map CNN → query-conditioned MapTransformer) produces a terrain latent zvis; identity-gated fusion updates the intent (u′ = u + tanh(αu)·Δu) and adds an action-mean residual (μ = μbase + tanh(αa)·r), with α initialized at zero so the policy starts as a pure raw-reference tracker. (C) Objective: PPO + value + entropy with Huber auxiliary losses on velocity, anchor, and foot trajectory.All variants share the same task, reward, observation contract, and 48×A800-GPU budget. Removing terrain perception is by far the most damaging change — the blind variant collapses an order of magnitude below every other configuration — confirming that the gains are perceptual and architectural, not from capacity.
TCRS is a kinematic synthesizer: it builds contact-consistent, style-preserving references without solving contact-rich dynamics, and assumes a static, rigid, observable height field — so it does not model deformable, granular, or slippery media.
Because adaptation is foot-centric while the upper-body command is preserved as-is, the arms and torso can strike nearby obstacles. Future work: collision-aware upper-body adaptation and quantitative deployment rollouts.
If you find this work useful, please consider citing:
@inproceedings{anonymous2026perceptivebfm,
title = {Perceptive Behavior Foundation Model:
Adapting Human Motion Priors to Robot-Centric Terrain},
author = {Anonymous Authors},
booktitle = {Conference on Robot Learning (CoRL)},
year = {2026},
note = {Under review}
}