When Learning Is Out of Reach, Reset:
Generalization in Autonomous Visuomotor Reinforcement Learning

PRIOR @ Allen Institute for AI
Episodic, Reset-Free, and Reset-Minimizing RL. In standard (i.e. episodic) reinforcement learning (RL) agents have their environments reset after every success or failure, an expensive operation in the real world. In Reset-Free RL (RF-RL), researchers have designed "reset games" which allow for learning so long as special care is taken to avoid irreversible transitions (e.g. an apple falling out of reach). We consider Reset-Minimizing RL (RM-RL) where in realistic and dynamic environments agents may request human interventions but should minimize these requests.


      Episodic training, where an agent's environment is reset to some initial condition after every success or failure, is the de facto standard when training embodied reinforcement learning (RL) agents. The underlying assumption that the environment can be easily reset is limiting both practically, as resets generally require human effort in the real world and can be computationally expensive in simulation, and philosophically, as we'd expect intelligent agents to be able to continuously learn without external intervention. Work in learning without any resets, i.e. Reset-Free RL (RF-RL), is very promising but is plagued by the problem of irreversible transitions (e.g. an object breaking or falling out of reach) which halt learning. Moreover, the limited state diversity and instrument setup encountered during RF-RL means that works studying RF-RL largely do not require their models to generalize to new environments.
     In this work, we instead look to minimize, rather than completely eliminate, resets while building visual agents that can meaningfully generalize. As studying generalization has previously not been a focus of benchmarks designed for RF-RL, we propose a new Stretch Pick-and-Place (Stretch-P&P) benchmark designed for evaluating generalizations across goals, cosmetic variations, and structural changes. Moreover, towards building performant reset-minimizing RL agents, we propose unsupervised metrics to detect irreversible transitions and a single-policy training mechanism to enable generalization. Our proposed approach significantly outperforms prior episodic, reset-free, and reset-minimizing approaches achieving higher success rates with fewer resets in Stretch-P&P and another popular RF-RL benchmark. Finally, we find that our proposed approach can dramatically reduce the number of resets required for training other embodied tasks, in particular for RoboTHOR ObjectNav we obtain higher success rates than episodic approaches using 99.97% fewer resets.


Overview of proposed Stretch-P&P and other experiment environments. Here we show the training and evaluation configurations for our Stretch-P&P, Sawyer Peg, and RoboTHOR ObjectNav tasks (from left to right). During training (blue panels), the agent observes: (Stretch-P&P) as few one household object and one container depending on allowed object budget, (Sawyer Peg) exactly one type of stationary box with a goal hole at its upper center, and (ObjectNav) a limited set of house structures. During evaluation (yellow panels), we require the agent to generalize to: (Stretch-P&P) novel cosmetic changes, novel object instances, and a combination of the above alongside with other cosmetic and structural background changes, (Sawyer Peg) novel box and hole positions (the hole position is highlight with green here only for visualization purposes), and (ObjectNav) fully unseen house structures.


Two fundamental problems when attempting to build generalizable agents in the reset-free setting are irreversible transitions and limited state diversity. As it is often impractical or impossible to guarantee that an agent does not undergo any irreversible transitions during training, we present measures that we use to quantify when an agent has undergone such a transition. Next, we take a first step towards building generalizable RM-RL agents; in particular, we propose to do away with the learned forward-backward policies popular in prior RF-RF work and, instead, learn a single policy which is presented with randomly generated goals during training.

Existence of Near Irreversible States

Some irreversible transitions are explicit, e.g. a glass is dropped and shatters. However, in a more complex real-world environment, they may be more subtle. For instance, when a robot is tasked with cleaning a room, it may encounter situations where some trash is accidentally pushed or blown into hard-to-reach locations, such as under a sofa or in the corners of the room. In such cases, the robot may find it challenging, but not strictly impossible, to pick up or sweep debris back using its regular cleaning tools. We refer to these states that are difficult, but not impossible, to recover from as near-irreversible (NI) states.

Reversible and (Near) Irreversible States in Stretch-P&P. Left: Reversible (top right): the target apple is within easy reaching distance. Irreversible (bottom right): the apple has fallen off the table, as the agent cannot rotate in Stretch-P&P the apple can no longer be reached. Near-irreversible (left): the apple is in tricky-to-reach locations being behind other objects or at the extreme limits of the arm's reaching capabilities. Right: Visualizing successful and failed object trajectories in Stretch-P&P during training. Notice that the object occupies many diverse states and can fall off of the table or roll away from the agent.

Reversible and (Near) Irreversible States in Sawyer Peg. Point cloud visualizations for evaluations on the narrower and the normal-sized table, where red points indicate the object (peg) head positions for failed trajectories while greens show the successful ones. Evaluations on two settings have the exact the same performance and rollouts trajectories for the final checkpoint (last two figures), but the same policy is leading to different consequences for the evaluation at 300k steps (first two figures): the peg always drops off the narrower table but mostly is still on the edge of the normal-size table.

Measures of Irreversibility

Suppose that agent has taken $t$ steps producing the trajectory $\mathcal{T}_{t}=\{\tau_\pi(0), \ldots, \tau_\pi(t)\}$. Intuitively, undergoing an NI transition should correspond to a decrease in the degrees of freedom available to the agent to manipulate its environment: that is, if an agent underwent an NI transition at timestep $i$ then the diversity of states $\tau_{\pi}(i+1),\ldots, \tau_{\pi}(t)$ should be small compared to the diversity before undergoing the irreversible transition. Further, the set of near-irreversible states also depends on the agent's policy $\pi$: which states should be considered near-irreversible can, and should, change during training. To formalize this, we can compute the above count, which we call $\phi_{W,\alpha,d,}(\mathcal{T}_t)$, as \[ \max_{(i_0, \ldots, i_m)\in P(t)}\sum_{j=0}^{m-1}1_{[i_{j+1}-i_{j} \geq N]} \cdot 1_{\{ d\left(\tau_\pi(i_j), \ldots, \tau_\pi(i_{j+1}-1)\right) < \alpha\}} \] where $d:\mathcal{S}^H\to \mathbb{R}_{\geq0}$ is some non-negative measure of diversity among states. As $\phi_{W,\alpha,d}$ is a counting function, we can turn it into a decision function simply by picking some count $N>0$ and deciding to reset when $\phi_{W,\alpha,d}\geq N$. In particular, we will let $\Phi_{W,N,\alpha,d}$ be the function that equals 1 if and only if $\phi_{W,\alpha,d}\geq N$. In our experiments we evaluate several diversity measures $d(s_1,\ldots, s_H)$ including: (1) a dispersion-based method using an empirical measure of entropy, or the mean standard deviation of the $s_i$, (2) a distance-based method using euclidean distance, or dynamic time warping (DTW). While we find surprisingly robust performance when varying $d$, we expect that there is no single best choice of diversity measure for all tasks. See our paper for more details.

RM-RF with a Single Policy

We aim to use single policy to achieve RM-RL that can adapt to general embodied tasks. Recall the objective for goal-conditioned POMDP in traditional episodic RL: \[\pi^\star \arg\max_\pi J(\pi\mid g) = \arg\max_\pi \mathbb{E}\left[\sum_{t=0} ^ \infty \gamma^t r(s_t, a_t\mid g)\right]\] In FB-RL, the ``forward'' goal space is normally defined as a singleton $\mathcal{G}_{f} = \{g^\star\}$ for the target task goal $g^\star$ (e.g. the apple is on the plate, the peg is inserted into the hole, \etc). The goal space for ``backward'' phase is then the (generally limited) initial state space $\mathcal{G}_{b} = \mathcal{I} \subset \mathcal{S}$ such that $\mathcal{G}_f\cap \mathcal{G}_b=\emptyset$. As the goal spaces in FB-RL are disjoint and asymmetric, it is standard for separate forward/backward policies (with separate parameters) and even different learning objectives to be used when training FB-RL agents. In our setting, however, there is only a single goal space which, in principle, equals the entire state space excluding the states we detect as being NI (\ie, $\mathcal{G} = \mathcal{S} \setminus \{s_t\mid s_t\in\tau_{\pi} (t) \in \mathcal{T}_\pi, \Phi_{W,N,\alpha,d} (\mathcal{T}_\pi) = 1\}$). In our training setting, we call each period between goal switches a \textit{phase} and, when formulating our learning objectives, treat these phases as separate "episodes" in episodic approaches.
In our RM-RL setting, we aim for evaluating both the sample efficiency and intervention efficiency such that an agent or algorithm is considered as parfait with less training steps and human supervisions, i.e. minimize both: \[ \sum_{t=0}^\infty J(\pi^\star) - J(\pi^\pi_t), \sum_{k=0}^\infty J(\pi^\star) - J(\pi^\pi_k) \] where $\pi_t, \pi_k$ are the policy learned after $t$ steps and $k$ interventions respectively, while also with meaningful generalizations.


We consider the following three classes of baseline training strategies.

  • (1) Ours (Random $+$ NI Measure). Our method using our single-policy random-target training strategy with resets being requested based on our unsupervised irreversibility measure.
  • (2) Periodical resets ($+$ random goals). Perhaps the simplest strategy for deciding when to request resets. Our periodical resets baselines do precisely this and are labeled simply as ``$N$ steps/reset`` where $N$ is some positive integer; here $N$ is set to generally be somewhat (or much) larger than in the standard episodic setting. Note that, in principle, there are no irreversible states in ObjectNav as the task merely involves navigating around an, otherwise static, environment. For this reason, we also include a baseline trained without any resets beyond those used to initialize the environment, i.e. $N=\infty$, and a baseline trained in the episodic setting just as in prior work
  • (3) FBRL $+$ GT. Here we implement the popular two-policy forward-backward training strategy in existing work. Inspired by PAINT, which learns a classifier trained on ground-truth irreversibility labels to request resets, we will use an oracle version of this method and reset the environment whenever the agent enters one of a fixed collection of hand-labeled irreversible states (e.g. target object has fallen off the table).

Evaluation Results

In our experiments, we look to answer a number of questions related to:

  • (1) the importance of resets for learning
  • (2) the efficacy of our proposed methodological contributions (unsupervised irreversibility detection and single-policy RM-RF) in reducing the number of resets required for learning and enabling out-of-distribution generalization
  • (3) how our methods may be applied more generally to existing embodied tasks
  • (4) How does performance vary using different measures and budgets

We show the training performance of our method versus other competing baselines in Stretch-P&P with the budget $=1$. We see that our method is far more efficient in its use of resets: it achieves high success rates more consistently and with far fewer resets than the periodical reset models. Our method is also more efficient in terms of training steps. This suggests that our measure metric can consistently and accurately identify time-points where a reset will be highly valuable for learning.

Method Comparisons for Stretch-P&P. Evaluation results for testing different facets of agent generalization proposed in Benchmark section: Position out-of-distribution (Pos-OoD), Visual out-of-distribution (Vis-OoD), Object out-of-distribution (Obj-OoD), and All out-of-distribution (Obj-OoD).

Sawyer Peg
Method Comparisons for Sawyer Peg. In Sawyer Peg, we also observe the performance drop for FB-RL as shown in the bottom row of Figure above, which is somewhat smaller than results for Stretch-P&P, we attribute this to the smaller state space of Sawyer Peg which makes generalization somewhat easier.

Success ($50$M) SPL ($50$M) Resets ($50$M) Success ($100$M) SPL ($100$M) Resets ($100$M)
Ours 0.216 0.131 592 0.551 0.275 635
H=300 0.334 0.166 24k 0.355 0.167 1M
H=10k 0.246 0.134 5k 0.418 0.218 10k
H=$\infty$ 0.206 0.141 60 0.339 0.178 60
EmbCLIP 0.431 0.204 1M 0.504 0.234 2M

Method Comparisons for RoboTHOR ObjectNav. We show our initial results of training curve (left) and evaluation table (right) for ObjectNav task in autonomous RL. After 100M steps with only 635 resets we are able to achieve success rates higher than all competing baselines despite the next best performing baseline using 2M resets.

Ablation on Measurements and Budgets

Stretch-P&P Irreversibility measurements & object budgets ablations. Our measurement-determined irreversibility intervention method is relatively robust to the selection of diversity measure (dispersion-based: Std, Ent, and distance-based: L2, Dtw). Interestingly, using an object budget 2 or 4 results in lower performance compared to when using a budget of 1, while providing slightly better with unseen object and scenes.

Sawyer Peg Irreversibility measurements ablations. Similarly as Stretch-P&P, our measurement-determined irreversibility intervention method is relatively robust to the selection of diversity measure. All of our proposed baselines achieve consistently high performance in the training, in-domain evaluation, and novel box evaluation. Moreover, they achieve this high performance within $\approx$100 resets in total and converge in around 1M steps.


In this work we study the problem of training Reset-Minimizing Reinforcement Learning (RM-RL) agents within visually complex environments which can generalize to novel cosmetic and structural changes during evaluation. We design the Stretch-P&P benchmark to study this problem and find that two methodological contributions, unsupervised irreversible transition detection and a single-policy random-goal training strategy, allow agents to learn with fewer resets and better generalize than competing baselines. In future work we look to further explore the implications of our irreversible transition detection methods for improving RM-RL methods and for building models that can ask for help during evaluation. We also leave the space for design and balancing of how to penalize visits to unexpected NI states (with labels provided by our method), which may potentially conflict with encouraging exploration, as future work.


  title   = {When Learning Is Out of Reach, Reset: Generalization in Autonomous Visuomotor Reinforcement Learning},
  author  = {Zichen Zhang and Luca Weihs},
  year    = {2023},
  journal = {arXiv preprint arXiv: Arxiv-2303.17600},