Self-Adapting Improvement Loops for Robotic Learning

1Brown University, 2Harvard University

Abstract

Video generative models trained on expert demonstrations have been utilized as performant text-conditioned visual planners for solving robotic tasks. However, generalization to unseen tasks remains a challenge. Whereas improved generalization may be facilitated by leveraging learned prior knowledge from additional pre-collected offline data sources, such as web-scale video datasets, in the era of experience we aim to design agents that can continuously improve in an online manner from self-collected behaviors. In this work we thus propose the Self-Adapting Improvement Loop (SAIL), where an in-domain video model iteratively updates itself on self-produced trajectories, collected through adaptation with an internet-scale pretrained video model, and steadily improves its performance for a specified task of interest. We apply SAIL to a diverse suite of MetaWorld tasks, as well as two manipulation tasks on a real robot arm, and find that performance improvements continuously emerge over multiple iterations for novel tasks initially unseen during original in-domain video model training. Furthermore, we discover that SAIL is surprisingly robust regarding if and how the self-collected experience is filtered, and the quality of the initial in-domain demonstrations. Through adaptation with summarized internet-scale data, and learning through online experience, we thus demonstrate a way to iteratively bootstrap a high-performance video model for solving novel robotic tasks through self-improvement.

SAIL Framework

SAIL utilizes a visual planner composed from two video generative models: one pretrained generally on internet-scale data and another pretrained on a general set of in-domain demonstrations. SAIL iteratively improves the performance of this adapted visual planner by finetuning the in-domain model on its own self-collected experience. In this way, SAIL effectively combines offline data with online experience into a self-adapting improvement cycle that iteratively bootstraps an in-domain video model into a strong visual planner for a particular task of interest.

Experiments

We evaluate SAIL on two main robot settings: a real-world Franka Emika Panda robot arm and the MetaWorld-v2 simulated environment. We utilize the Panda arm for two distinct tasks: pushing a colored cup and opening a colored drawer, where generalization is evaluated over initially unseen colors; for MetaWorld, generalization is across novel tasks with their own visual settings. We note that for all the tasks visualized below, the video model used for visual planning had never seen any successful demonstrations during initial pretraining - all iterative performance gains arise from utilizing self-collected experience through SAIL.

Visual Planning with SAIL

SAIL is able to iteratively improve success rate on novel robotic tasks it has never seen initial demonstrations for, specified by natural language. This is facilitated by effective utilization of not only offline data (internet video datasets and in-domain demonstrations on other tasks) but also online self-collected experience data through the SAIL framework.
Visual Plan
Environment Execution

SAIL without Experience Filtering

We explore the performance of SAIL without experience filtering, and discover that SAIL is still able to iteratively improve for certain tasks. This highlights how SAIL can leverage self-collected experience in a robust manner with respect to behavior quality and can enable scalable self-improvement, as filtering often requires some level of human intervention or carefully designed heuristics.
Visual Plan
Environment Execution

SAIL with Suboptimal Data Initialization

We investigate the performance of SAIL when only suboptimal demonstration data is available, and discover that for certain tasks performance indeed increases over iterations without utilizing any initial expert demonstrations or data filtering. This highlights how SAIL can potentially enable cheaper task-specific learning, as the visual planner learns primarily from its own experience rather than relying on expert data collection, which can be an expensive procedure.
Visual Plan
Environment Execution

Quantitative Results Across SAIL Iterations

We highlight the importance of utilizing large-scale internet video knowledge in the SAIL framework across multiple robot settings and tasks. We compare the performance against using only an in-domain video model for iterative fine-tuning on self-collected experience, and find that success rate struggles to increase with the same computational and resource budget; in many cases, performance instead iteratively decreases in the absence of adaptation with large-scale video knowledge. SAIL therefore effectively combines offline in-domain and internet-domain video knowledge with online experience to facilitate iterative improvements on initially unseen robotic tasks.

sail_comparison

BibTeX

@article{
      luo2025self,
      title={Self-Adapting Improvement Loops for Robotic Learning},
      author={Luo, Calvin and Zeng, Zilai and Jia, Mingxi and Du, Yilun and Sun, Chen},
      journal={arXiv preprint arXiv:2506.06658},
      year={2025}
    }