Perceptive Behavior Foundation Model
Adapting Human Motion Priors to Robot-Centric Terrain

Anonymous Authors
Anonymous Affiliation

One policy — backflips, expressive motion, and locomotion — grounded to whatever terrain the robot encounters.

A single Perceptive BFM policy across eight maneuver–terrain pairs.
One policy, eight maneuver–terrain pairs. A single Perceptive BFM accepts diverse flat-ground human-motion commands — a one-leg backflip, a stair dance, an arm-waving run, a free-arm walk, a sideways stair walk, a backward step over obstacles, and a turning gait — and adapts each to a different, randomly placed terrain. The raw motion specifies intent; robot-centric perception resolves the footholds, swing clearance, posture, and contact timing the reference never specifies.
🧩

The mismatch

Human motion priors encode what to do, but a flat-ground reference never says where the footholds, clearance, or contact timing are on the robot's actual terrain.

🗺️

TCRS supervision

An offline synthesizer converts raw clips + sampled height fields into terrain-conformal references — used only to teach, never queried at deployment.

👁️

Perceptive tracking

An identity-gated vision student keeps the raw command and adds terrain corrections through zero-initialized residual pathways — only where the terrain demands it.

Try It Live — In-Browser Demo

The deployed policy runs entirely in your browser via MuJoCo WASM + ONNX Runtime — no install. Switch between motion commands and watch the height-scan-conditioned policy adapt to terrain in real time.

Interactive MuJoCo demo preview

Launch the full-screen simulation

Opens in a new tab · desktop browser recommended

Open Demo ↗

Abstract

Humanoid behavior foundation models aim to acquire reusable whole-body control policies from broad human motion priors, enabling a single controller to produce diverse and expressive behaviors. However, existing motion-centric foundation policies largely assume that the reference motion is already physically compatible with the robot's surroundings. This assumption breaks when the demonstrator, operator, and robot inhabit different environments: a human motion may specify the intended behavior, but not the footholds, clearance, body height, or contact timing required by the robot's local terrain.

We introduce Perceptive Behavior Foundation Model (Perceptive BFM), a terrain-aware humanoid control framework that grounds human motion priors in robot-centric perception. The model preserves raw kinematic motion references as the behavioral interface, while using local terrain observations to adapt contacts, posture, and timing.

To provide scalable terrain supervision, we develop terrain-conformal reference synthesis (TCRS), which converts locomotion-oriented human motion clips into terrain-consistent references through contact-aware foothold construction, foot-geometry-aware swing optimization, support-aware root reconstruction, collision repair, and multi-point inverse kinematics. We then train a blind adapted-reference teacher and transfer its terrain-conformal behavior to a deployed raw-reference student through target-frame action alignment. The student is implemented as an identity-gated Transformer tracker, where terrain features enter through residual pathways initialized to preserve the motion-tracking prior and trained to produce local corrections only when needed.

Across controlled simulation and qualitative real-robot rollouts, a single policy tracks a wide range of behaviors — locomotion, stylistic motions, acrobatic maneuvers, and motion-capture teleoperation — and is exercised on stairs, slopes, sparse supports, recessed obstacles, grass, and irregular indoor/outdoor terrain. The results indicate that robot-centric perception can transform human motion priors into terrain-compatible whole-body behavior without changing the raw motion command interface.

The Operator–Environment Mismatch

A teleoperator in a control room — or a motion-capture clip recorded on flat ground — conveys intent and style, but not a terrain-valid trajectory in the robot's world. The robot must resolve contacts, body height, balance, and timing from its own perception.

Real-robot motion-capture mismatch
Real-robot mocap mismatch. (a) A human mocap motion captured on flat ground; (b,c) the robot tracks that same command over robot-side terrain; (d) a separate walk-and-dance motion deployed in the wild. The command is identical to the flat-ground capture — only perception bridges it to the terrain.

Method

Perceptive BFM is trained with a staged Perceptive Motion Tracking (PMT) algorithm. The key contract: the raw motion reference remains the deployment command — terrain-conformal references are used only to supervise learning, never supplied to the final policy at test time.

PMT system overview — four stages
System overview. Stage 1 — TCRS synthesizes terrain-conformal references offline. Stage 2 — a blind Transformer teacher learns adapted-reference tracking with PPO. Stage 3 — a vision student receives the raw reference + a robot-centric terrain scan and is distilled via target-frame action alignment. Stage 4 — PPO fine-tuning with identity-gated terrain residuals. TCRS is never queried at deployment.
1

Terrain-Conformal Reference Synthesis (TCRS)

Raw human-motion clips + sampled height fields are converted into terrain-consistent supervision through contact-aware foothold construction, foot-geometry-aware swing optimization (MPPI in a mid-foot frame), support-aware root reconstruction, collision repair, and multi-point inverse kinematics.

2

Blind Adapted-Reference Teacher

A blind Transformer actor–critic is trained with PPO to track terrain-conformal references, exposing privileged state but no perception. This teacher captures terrain-aware behavior at the action level.

3

Identity-Gated Vision Student

The deployed student receives raw references plus a robot-centric height map. Vision is fused via residual pathways with identity gates tanh(α) initialized to 0, so the network reproduces the motion-tracking prior at init and learns only the corrections terrain demands.

4

Target-Frame Action Alignment

Teacher actions live in the adapted-reference frame; the student acts around the raw reference. We distill the teacher's effective PD target re-expressed relative to the raw reference, a* = (q_reftcrs + μtea) − q_refraw, so distillation is meaningful across reference frames.

TCRS — Terrain-Conformal Reference Synthesis

TCRS foot-trajectory synthesis
TCRS trajectory synthesis. The blue ghost is the raw reference placed on terrain; the opaque robot is the TCRS output. Foot traces compare the sampling-based MPPI foot-end optimization used in TCRS (yellow), cubic interpolation (blue), and direct terrain-height z-lifting (black). TCRS replans a mid-foot swing trajectory and reconstructs the body so feet clear vertical stair faces rather than penetrating them.

Policy Architecture

Detailed PMT network architecture
Vision-augmented PMT actor–critic. (A) Inputs: proprioception, history, command window, terrain map, and supervision targets. (B) A Transformer motion-tracking backbone produces a motion intent ut; a terrain branch (Map CNN → query-conditioned MapTransformer) produces a terrain latent zvis; identity-gated fusion updates the intent (u′ = u + tanh(αu)·Δu) and adds an action-mean residual (μ = μbase + tanh(αa)·r), with α initialized at zero so the policy starts as a pure raw-reference tracker. (C) Objective: PPO + value + entropy with Huber auxiliary losses on velocity, anchor, and foot trajectory.

Quantitative Results

54.6 vs 3.6
Full PMT vs. the blind (no-vision) variant — perception, not capacity, drives terrain grounding.
+5–8
Reward points the Transformer backbone gains over MLP / GRU / CNN variants under matched compute.
+4.5
Reward points from target-frame distillation vs. removing it (54.6 → 50.1).
−56.6%
TCRS foot-terrain penetration depth vs. a Z-offset baseline (5.48 → 2.38 cm); clearance violation −48.3%.

Training-Time Ablations

All variants share the same task, reward, observation contract, and 48×A800-GPU budget. Removing terrain perception is by far the most damaging change — the blind variant collapses an order of magnitude below every other configuration — confirming that the gains are perceptual and architectural, not from capacity.

PMT ablation reward curves
Training reward diagnostics. Full PMT (purple) vs. PMT w/o distillation, Flat MLP, MLP-GRU, Split MLP, Split CNN, and PMT w/o vision (collapsed, bottom). Mean reward over the last 1k of 10k iterations.

Limitations

TCRS is a kinematic synthesizer: it builds contact-consistent, style-preserving references without solving contact-rich dynamics, and assumes a static, rigid, observable height field — so it does not model deformable, granular, or slippery media.

Because adaptation is foot-centric while the upper-body command is preserved as-is, the arms and torso can strike nearby obstacles. Future work: collision-aware upper-body adaptation and quantitative deployment rollouts.

Representative failure — arm strikes an obstacle
Representative failure. The upper-body command is collision-unaware, so arms or torso can strike obstacles.

Paper

BibTeX

If you find this work useful, please consider citing:

@inproceedings{anonymous2026perceptivebfm,
  title  = {Perceptive Behavior Foundation Model:
            Adapting Human Motion Priors to Robot-Centric Terrain},
  author = {Anonymous Authors},
  booktitle = {Conference on Robot Learning (CoRL)},
  year   = {2026},
  note   = {Under review}
}