In this work, we present Text-Aware Diffusion for Policy Learning (TADPoLe), which utilizes a large-scale pretrained, frozen, text-conditioned diffusion model to generate a dense reward signal for policy learning. We demonstrate that TADPoLe enables the zero-shot learning of policies that are flexibly and accurately conditioned on natural language inputs, across different robot configurations and environments, for both goal-achievement and continuous locomotion tasks. Furthermore, we observe that behaviors learned through TADPoLe are qualitatively appealing due to alignment with natural priors summarized from large-scale pretraining.
An illustration of the TADPoLe pipeline, which computes text-conditioned rewards for policy learning through a pretrained, frozen diffusion model. At each timestep, the subsequent frame is rendered through the environment and corrupted with a sampled Gaussian source noise vector. The diffusion model is then used to predict the source noise that was added, conditioned on a desired text prompt. The reward is designed to be large when the selected action produces frames well-aligned with the text prompt.
We perform comprehensive experiments to demonstrate the ability of TADPoLe to learn a variety of novel behaviors directly from text conditioning, across a host of different robotic state configurations and visual environments. We apply TADPoLe to both goal-achieving tasks, such as striking a particular pose, as well as continuous locomotion tasks; we showcase these results on the Dog and Humanoid environments from the DeepMind Control Suite, both of which are known to be advanced in difficulty due to their large action space and complex transition dynamics. We also demonstrate performance of TADPoLe on robotic manipulation tasks from the MetaWorld suite.
We showcase text-conditioned goal-reaching tasks learned via TADPoLe, for both Dog and Humanoid environments, and compare it against other text-to-reward approaches. TADPoLe is able to successfully learn a variety of behaviors, from standing upright to doing splits to kneeling.
We further explore the ability of TADPoLe to learn continuous locomotion behaviors conditioned on natural language specifications. Such tasks are often difficult to learn purely from static external description, as there is no canonical pose or goal frame that if reached, would denote successful achievement of the task. We propose Video-TADPoLe, which leverages large-scale pretrained text-to-video generative models, as a promising direction forward, and demonstrate how it outperforms using a language-video alignment model (ViCLIP) instead.
We investigate how well TADPoLe can be applied to learn robotic manipulation tasks through dense text-conditioned feedback. We do so by replacing the manually-designed ground-truth dense reward for each Meta-World task with TADPoLe's text-conditioned reward. We perform thorough comparisons between TADPoLe and VLM-RM by evaluating them on a diverse set of selected Meta-World tasks, in which we observe that TADPoLe provides meaningful zero-shot dense supervision that enables success across a variety of robotic manipulation tasks through text-conditioning.
We explore whether or not TADPoLe is sensitive to subtle variations of the input prompt. We therefore change the conditioning phrase from "a person standing" to "a person standing with hands above head", and observe that the person indeed endeavours to hold their hands above their head. We also showcase results for the prompt of "a person standing with hands on hips", and demonstrate that the resulting policy indeed learns a humanoid that seeks to place their hands on their hips. We take this as evidence that TADPoLe is capable of respecting fine-grained details and subtleties of the input prompts when learning text-conditioned policies.
@article{luo2024text,
title={Text-Aware Diffusion for Policy Learning},
author={Luo, Calvin and He, Mandy and Zeng, Zilai and Sun, Chen},
journal={arxiv preprint arXiv:2407.01903},
year={2024}
}