Humanoid Policy ~ Human Policy

Ri-Zhao Qiu*             Shiqi Yang*          Xuxin Cheng*          Chaitanya Chawla          Jialong Li          Tairan He         
Ge Yan          David J. Yoon          Ryan Hoque          Lars Paulsen          Ge Yang         
Jian Zhang          Sha Yi          Guanya Shi          Xiaolong Wang



Humanoid (adj.): having human form or characteristics. — Merriam-Webster, 2025.

Abstract

Training manipulation policies for humanoid robots with diverse data enhance their robustness and generalization across tasks and platforms. However, learning solely from robot demonstrations is labor-intensive, requiring expensive tele-operated data collection which is difficult to scale. This paper investigates a more scalable data source, egocentric human demonstrations, to serve as cross-embodiment training data for robot learning. We mitigate the embodiment gap between humanoids and humans from both the data and modeling perspectives. We collect an egocentric task-oriented dataset (PH2D) that is directly aligned with humanoid manipulation demonstrations. We then train a human-humanoid behavior policy, which we term Human Action Transformer (HAT). The state-action space of HAT is unified for both humans and humanoid robots and can be differentiably retargeted to robot actions. Co-trained with smaller-scale robot data, HAT directly models humanoid robots and humans as different embodiments without additional supervision. We show that human data improves both generalization and robustness of HAT with significantly better data collection efficiency.

Collecting Human and Humanoid Demonstrations

We developed Vision OS App and OpenTV to collect data with 3D hand gestures and diverse cameras.

Scalable: we collected ~27k demonstrations with language annotations in dozens of hours.


With diverse demonstrations, HAT performs human-humanoid co-training without additional visual post-processing.
(Visualizations are slightly not synchronized due to plotly frontend)

Pick up a straw in a milk tea shop
Pour water in front of a drink vending machine

Pack a bottle of beer in a hallway
Pour green tea in a sushi restaurant

Picking up an orange in front of an outdoor grill
Grasp and place a metal bottle in a university office

Unified Human-Centric State-Action Space

Our policy is agnostic of robots with a human-centric state-action representation, which can be retargeted to robots.

Robot Observations
Human Observations

Human-Humanoid Co-training Improves Generalizabilty

Few-shot Cross-Humanoid Generalization (CMU H1)


20 CMU demos
Co-trained with human data

Object Generalization (UCSD H1)


Seen in robot data: blue pepsi bottle
Seen in human data: red coke can

Seen in human data: ZED camera box
Seen in human data: cardboard box

Object Placement Generalization (UCSD H1)


Baseline: Trained on 60 robot demos (50 picking right, 10 picking left)
Ours: co-trained with human data with diverse placements

Background Generalization (UCSD H1)


Seen in robot data: colored paper background
Seen in human data: red cardboard

Seen in human data: wooden table
Seen in human data: green cardboard

We open-source code to replay human/humanoid recordings in Mujoco to demonstrate human-centric representation.

BibTeX


@article{,
title={},
author={},
journal={},
year={}
}