Building Behavioral Foundation Models (BFMs) for humanoid robots has the potential to unify diverse control tasks under a single, promptable generalist policy. However, existing approaches are either exclusively deployed on simulated humanoid characters, or specialized to specific tasks such as tracking. We propose BFM-Zero, a framework that learns an effective shared latent representation that embeds motions, goals, and rewards into a common space, enabling a single policy to be prompted for multiple downstream tasks without retraining. This well-structured latent space in BFM-Zero enables versatile and robust whole-body skills on a Unitree G1 humanoid in the real world, via diverse inference methods, including zero-shot motion tracking, goal reaching, and reward optimization, and few-shot optimization-based adaptation. Unlike prior on-policy reinforcement learning (RL) frameworks, BFM-Zero builds upon recent advancements in unsupervised RL and Forward-Backward (FB) models, which offer an objective-centric, explainable, and smooth latent representation of whole-body motions. We further extend BFM-Zero with critical reward shaping, domain randomization, and history-dependent asymmetric learning to bridge the sim-to-real gap. Those key design choices are quantitatively ablated in simulation. A first-of-its-kind model, BFM-Zero establishes a step toward scalable, promptable behavioral foundation models for whole-body humanoid control.
Approach
An overview of the BFM-Zero: After the pre-training stage, BFM-Zero forms a latent space that can be used for zero-shot inference and few-shot adaptation.
Objective: Learn a unified latent representation that embeds tasks (e.g., target motions, rewards, goals) into a shared space and a promptable policy that conditions on this representation to perform diverse tasks without retraining.
BFM-Zero can zero-shot perform tasks including motion tracking, goal reaching, and reward optimization.
BFM-Zero supports few-shot adaptation to quickly adapt to specific requirements with minimal additional training.
How it works: Unsupervised RL and Zero-shot inference
In pre-training, BFM-Zero aims to learn a latent space $z\in Z$, a forward $\boldsymbol{F}(s, a, z)$ and a backward $\boldsymbol{B}(s)$ representations, and a $z$-conditioned policy $\pi_z$ such that:
Starting from $\{s,a\}$ and follow $\pi_z$, the visitation probability of $s'$ is approximately $\boldsymbol{F}(s, a, z)^\top$$\boldsymbol{B}(s')$.
The Q-function of $\pi_z$ is given by $\boldsymbol{F}^\top $$z$ but not learned with supervision from any task-related rewards.
In inference time, for various objectives (goal reaching, tracking, reward optimization), we can use following fomulas to compute $z$ in a zero-shot way:
🔑 Remark: Compared to other frameworks, we don't give the model any specific(task-related) reward in the training, i.e., it is an unsupervised RL problem. Moreover, the learned representation $\boldsymbol{F}$, $\boldsymbol{B}$ and latent $Z$ are aware of humanoid dynamics.
ALL demos are real-time (except for the tracking highlight) and from the same policy.
Goal Reaching
Click image for real-world goal pose, click ▶ for natural transition, and drag the progress bar to view frame by frame.
00:00.00
We also test diverse initial on the ground pose to a standing pose. Follow the gallery to see the full description of the performance.
Target: T-Pose
BFM-Zero enables a natural and smooth transition from the ground to T-Pose.
Target: T-Pose
A brief running-like adjustment occurs before achieving full stability.
Target: Hands-on-hips posture
Smooth and natural transitions to Hands-on-hips posture.
Target: Hands-on-hips posture
Rapid stabilization occurs when the initial stand-up is unsteady.
Target: Hands-on-hips posture
Successfully recovers even after the first failed attempt to stand.
Target: Hands-on-hips posture
If the initial pose is not natural, it prioritizes standing up quickly before fine adjustments.
Target: Hands-on-hips posture
Besides, BFM-Zero maintains a high rate of efficient goal reaching...
Target: Hands-on-hips posture
...a high rate of efficient goal reaching...
Target: Hands-on-hips posture
...a high rate of efficient goal reaching.
Target: Hands-on-hips posture
Robust even under severe wrist breakage.
Motion Tracking
Check our highlight tracking video in dancing with safe, natural, and gentle recovery even when confronted with unexpected falls.
More motion tracking demos including styled locomotion, boxing, playing ball games and dancing. All demos come from a single, continuous video shooting with the same policy.
Walk and Turn
Ball Games
Dancing (Including Fall and Recovery)
Styled Walking (Salute!)
Boxing
Walking (Small Steps and Large Steps)
Dancing
Reward Optimization
We do not have any labeled(task-related) rewards in training. The reward inference (optimization) happens in the test time. Users can prompt in any type of reward function w.r.t. the robot states, and the policy zero-shot output the optimized skills without retraining. ($R$ below is the rewards.)
Basic Locomotion
We enable the robot to perform basic locomotion tasks including standing still, walking forward/backward/sideways, turning left/right.
Maintains stable standing posture without movement.
Here, w is the corresponding weight for the reward function, we express low as "l", high as "h", "arm-l-h" means the right wrist is low and the left wrist is high.
Note: All demos are from a continuous video shooting with the same policy.
▶
backward & arms-h-h
▶
strafe-right & arms-h-h
▶
strafe-left & arms-h-l
▶
forward & arms-h-l
▶
backward & arms-h-l
▶
strafe-right & arms-h-l
▶
forward & arms-l-h
▶
backward & arms-l-h
▶
strafe-right & arms-l-h
▶
forward & arms-l-l
▶
strafe-left & arms-l-l
▶
backward & arms-l-l
▶
strafe-right & arms-l-l
▶
turn anticlockwise & arms-l-l
▶
turn clockwise & arms-l-l
▶
turn-anticlockwise & arms-h-l
▶
turn-anticlockwise & arms-l-h
▶
strafe-left & arms-h-h
Note: Right/Left is relative to the robot; the reward functions are for illustration purposes, in the inference time, we also have some soft constraints and regularization terms.
Natural Recovery from Large Disturbance
We demonstrate the robustness and flexibility of BFM-Zero: The policy enables the humanoid robot to recover gracefully from various disturbances, including heavy pushes, torso kicks, ground pulls, and leg kicks.
Highlight: Natural recovery from pulling to the ground
Highlight: Emergent behavior (running) from heavy pushes
Few-shot Adaptation
We demonstrate BFM-Zero's few-shot adaptation capability, the smooth structure of our latent space enables efficient search-based optimization in simulation to discover a better latent instead of directly using zero-shot inference within a short time.
When the robot carries a 4kg payload on its torso, we can perform adaptation to let the robot keep single-leg standing for longer time.
Before Adaptation
Adaptation in Sim
for less than 2 minutes
→
After Adaptation
Space Interpolation
The structured nature of the learned space enables smooth interpolation between latent representations. We can leverage Spherical Linear Interpolation to generate intermediate latent vectors along the geodesic arc between the two end-points.