DEMOĀ³

Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning

AdriĆ  LĆ³pez Escoriza1,2,Ā  Nicklas Hansen1,ā€‚Stone Tao1,3,ā€‚ Tongzhou Mu1,ā€‚ Hao Su1,3


UC San Diego, ETH ZĆ¼rich,ā€‚Hillbot

Ā Abstract

Long-horizon tasks in robotic manipulation present significant challenges in reinforcement learning (RL) due to the difficulty of designing dense reward functions and effectively exploring the expansive state-action space. However, despite a lack of dense rewards, these tasks often have a multi-stage structure, which can be leveraged to decompose the overall objective into manageable sub-goals. In this work, we propose DEMOĀ³, a framework that exploits this structure for efficient learning from visual inputs. Specifically, our approach incorporates multi-stage dense reward learning, a bi-phasic training scheme, and world model learning into a carefully designed demonstration-augmented RL framework that strongly mitigates the challenge of exploration in long-horizon tasks. Our evaluations demonstrate that our method improves data-efficiency by an average of 40% and by 70% on particularly difficult tasks compared to state-of-the-art approaches. We validate this across 16 sparse-reward tasks spanning four domains, including challenging humanoid visual control tasks using as few as five demonstrations.

Learning to predict dense rewards from demonstrationsĀ 

Our method uses small number of demonstrations to (i) pretrain a policy, (ii) train a world model and (iii) learn a multi-stage reward function to guide exploration.

Multi-Stage Sparse Rewards

DEMOĀ³ is designed for long-horizon tasks with a multi-stage structure. The agent receives semi-sparse rewards upon completing each stage. For example, in the manipulation tasks below, we break the task into three distinct sub-tasks, each with a clearly identifiable success criterion.

Stage 1: Grasp

Stage 2: Align with hole

Stage 3: Insert

Stage 1: Grasp

Stage 2: Hover over cube

Stage 3: Stack

Stage 1: Take box

Stage 2: Move box

Stage 3: Place box

Stage 1: Grab apple

Stage 2: Move to bowl

Stage 3: Place in bowl

Benchmarking

Given a handful of demonstrations, our method achieves high success rates in challenging visual manipulation tasks with sparse rewards, far exceeding previous state-of-the-art methods.

Qualitative comparisons

We compare against state-of-the-art methods in three challenging manipulation benchmarks: ManiSkill3, Meta-World, and Robosuite. We additionally test humanoid manipulation tasks in the ManiSkill3 benchmark. Our method is the only one to successfully solve all tasks in the given interaction budget.

DEMOĀ³

MoDem

LaNE

TD-MPC2

Sample Efficiency

DEMOĀ³ manages to converge to robust solutions in less than 100K steps in difficult manipulation tasks and 0.5M in extremely randomized environments. In some cases, with less than 5 expert demonstrations.Ā 

Demonstration Efficiency

Our approach is significantly more demonstration-efficient than previous methods. We evaluate the convergence of various techniques under a decreasing number of demonstrations in highly challenging tasks. Our results show that DEMOĀ³ is the only method capable of consistently solving tasks with extreme randomization using fewer than 10 demonstrations in under 0.5 million steps. Furthermore, our method scales effectively with the number of demonstrations.

Citation

If you find our work useful, please consider citing the paper as follows:

@misc{escoriza2025multistagemanipulationdemonstrationaugmentedreward,

Ā Ā Ā Ā Ā Ā title={Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning},Ā 

Ā Ā Ā Ā Ā Ā author={AdriĆ  LĆ³pez Escoriza and Nicklas Hansen and Stone Tao and Tongzhou Mu and Hao Su},

Ā Ā Ā Ā Ā Ā year={2025},

Ā Ā Ā Ā Ā Ā eprint={2503.01837},

Ā Ā Ā Ā Ā Ā archivePrefix={arXiv},

Ā Ā Ā Ā Ā Ā primaryClass={cs.LG},

Ā Ā Ā Ā Ā Ā url={https://arxiv.org/abs/2503.01837},Ā 

}