SAIL

Abstract

Video generative models trained on expert demonstrations have been utilized as performant text-conditioned visual planners for solving robotic tasks. However, generalization to unseen tasks remains a challenge. Whereas improved generalization may be facilitated by leveraging learned prior knowledge from additional pre-collected offline data sources, such as web-scale video datasets, in the era of experience we aim to design agents that can continuously improve in an online manner from self-collected behaviors. In this work we thus propose the Self-Adapting Improvement Loop (SAIL), where an in-domain video model iteratively updates itself on self-produced trajectories, collected through adaptation with an internet-scale pretrained video model, and steadily improves its performance for a specified task of interest. We apply SAIL to a diverse suite of MetaWorld tasks, as well as two manipulation tasks on a real robot arm, and find that performance improvements continuously emerge over multiple iterations for novel tasks initially unseen during original in-domain video model training. Furthermore, we discover that SAIL is surprisingly robust regarding if and how the self-collected experience is filtered, and the quality of the initial in-domain demonstrations. Through adaptation with summarized internet-scale data, and learning through online experience, we thus demonstrate a way to iteratively bootstrap a high-performance video model for solving novel robotic tasks through self-improvement.

SAIL Framework

SAIL utilizes a visual planner composed from two video generative models: one pretrained generally on internet-scale data and another pretrained on a general set of in-domain demonstrations. SAIL iteratively improves the performance of this adapted visual planner by finetuning the in-domain model on its own self-collected experience. In this way, SAIL effectively combines offline data with online experience into a self-adapting improvement cycle that iteratively bootstraps an in-domain video model into a strong visual planner for a particular task of interest.

Experiments

We evaluate SAIL on two main robot settings: a real-world Franka Emika Panda robot arm and the MetaWorld-v2 simulated environment. We utilize the Panda arm for two distinct tasks: pushing a colored cup and opening a colored drawer, where generalization is evaluated over initially unseen colors; for MetaWorld, generalization is across novel tasks with their own visual settings. We note that for all the tasks visualized below, the video model used for visual planning had never seen any successful demonstrations during initial pretraining - all iterative performance gains arise from utilizing self-collected experience through SAIL.

Visual Planning with SAIL

Visual Plan

Environment Execution

"Open the Yellow Drawer"

❌

"Open the Yellow Drawer"

✅

"Open the Yellow Drawer"

✅

"Push the Orange Cup"

❌

"Push the Orange Cup"

❌

"Push the Orange Cup"

✅

"Push the Purple Cup"

❌

"Push the Purple Cup"

❌

"Push the Purple Cup"

✅

"Drawer Close"

❌

"Drawer Close"

❌

"Drawer Close"

✅

"Window Close"

✅

"Window Close"

✅

"Window Close"

✅

SAIL without Experience Filtering

Visual Plan

Environment Execution

"Push the Orange Cup"

❌

"Push the Orange Cup"

✅

"Push the Orange Cup"

✅

"Push the Orange Cup"

❌

"Push the Orange Cup"

❌

"Push the Orange Cup"

✅

"Drawer Close"

❌

"Drawer Close"

❌

"Drawer Close"

✅

SAIL with Suboptimal Data Initialization

Visual Plan

Environment Execution

"Button Push"

❌

"Button Push"

❌

"Button Push"

✅

"Window Close"

❌

"Window Close"

❌

"Window Close"

✅

"Drawer Close"

❌

"Drawer Close"

❌

"Drawer Close"

✅

Quantitative Results Across SAIL Iterations

We highlight the importance of utilizing large-scale internet video knowledge in the SAIL framework across multiple robot settings and tasks. We compare the performance against using only an in-domain video model for iterative fine-tuning on self-collected experience, and find that success rate struggles to increase with the same computational and resource budget; in many cases, performance instead iteratively decreases in the absence of adaptation with large-scale video knowledge. SAIL therefore effectively combines offline in-domain and internet-domain video knowledge with online experience to facilitate iterative improvements on initially unseen robotic tasks.

@article{ luo2025self, title={Self-Adapting Improvement Loops for Robotic Learning}, author={Luo, Calvin and Zeng, Zilai and Jia, Mingxi and Du, Yilun and Sun, Chen}, journal={arXiv preprint arXiv:2506.06658}, year={2025} }

Self-Adapting Improvement Loops for Robotic Learning

Abstract

SAIL Framework

Experiments

Visual Planning with SAIL

Iteration 0

"Open the Yellow Drawer"

❌

Iteration 1

"Open the Yellow Drawer"

✅

Iteration 2

"Open the Yellow Drawer"

✅

Iteration 0

"Push the Orange Cup"

❌

Iteration 1

"Push the Orange Cup"

❌

Iteration 2

"Push the Orange Cup"

✅

Iteration 0

"Push the Purple Cup"

❌

Iteration 1

"Push the Purple Cup"

❌

Iteration 2

"Push the Purple Cup"

✅

Iteration 0

"Drawer Close"

❌

Iteration 1

"Drawer Close"

❌

Iteration 2

"Drawer Close"

✅

Iteration 0

"Window Close"

✅

Iteration 1

"Window Close"

✅

Iteration 2

"Window Close"

✅

SAIL without Experience Filtering

Iteration 0

"Push the Orange Cup"

❌

Iteration 1

"Push the Orange Cup"

✅

Iteration 2

"Push the Orange Cup"

✅

Iteration 0

"Push the Orange Cup"

❌

Iteration 1

"Push the Orange Cup"

❌

Iteration 2

"Push the Orange Cup"

✅

Iteration 0

"Drawer Close"

❌

Iteration 1

"Drawer Close"

❌

Iteration 2

"Drawer Close"

✅

SAIL with Suboptimal Data Initialization

Iteration 0