Video generative models demonstrate great promise in robotics by serving as visual planners or as policy supervisors. When pretrained on internet-scale data, such video models intimately understand alignment with natural language, and can thus facilitate generalization to novel downstream behavior through text-conditioning. However, they may not be sensitive to the specificities of the particular environment the agent inhabits. On the other hand, training video models on in-domain examples of robotic behavior naturally encodes environment-specific intricacies, but the scale of available demonstrations may not be sufficient to support generalization to unseen tasks via natural language specification. In this work, we investigate different adaptation techniques that integrate in-domain information with large-scale pretrained video models, and explore the extent to which they enable novel text-conditioned generalization for robotic tasks, while also considering their independent data and resource considerations. We successfully demonstrate across robotic environments that adapting powerful video models with small scales of example data can successfully facilitate generalization to novel behaviors. In particular, we present a novel adaptation strategy, termed Inverse Probabilistic Adaptation, that not only consistently achieves strong generalization performance across robotic tasks and settings, but also exhibits robustness to the quality of adaptation data, successfully solving novel tasks even when only suboptimal in-domain demonstrations are available.
We explore three different adaptation techniques: Subject Customization, Probabilistic Adaptation, and Direct Finetuning. Subject Customization only modifies the image and text encoder, rather than the motion module, and is lightweight in terms of data requirements: it only utilizes pairs of static images and text annotated with a special identifier. Probabilistic Adaptation learns a small in-domain model from paired video data, which is then used through score composition with a large-scale video model that is kept frozen. The small in-domain model can be flexibly parameterized to consider available training resources. Direct Finetuning seeks to update the motion module of the large-scale video model with in-domain paired video data.
We evaluate how adapted video models can enable text-conditioned generalization via two approaches: visual planning and policy supervision. For visual planning, the adapted video model synthesizes a text-conditioned video plan into the future, which is then converted into actions to follow. In policy supervision, the adapted video model is used in a discriminative manner to evaluate frames achieved by the policy; these are converted into text-conditioned rewards, which the policy is optimized to maximize. Below we visualize the actual rollouts during environment interaction.
We discover that our proposed Inverse Probabilistic Adaptation can serve as a strong adaptation technique across different task and evaluation settings, which remains robust when only suboptimal demonstrations are available.
We visualize the free-form video generated by adapted video models, conditioned on a novel text-prompt ("a dog jumping") that was unseen during adaptation. When using the adapted video model for policy supervision (simply as a critic that provides text-conditioned rewards), we showcase that it can successfully supervise a downstream Dog agent to behave according to this novel text specification in a zero-shot manner.
@inproceedings{
luo2025solving,
title={Solving New Tasks by Adapting Internet Video Knowledge},
author={Calvin Luo and Zilai Zeng and Yilun Du and Chen Sun},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025}
}