Episodic training, where an agent's environment is reset to some initial condition after every success or failure,
is the de facto standard when training embodied reinforcement learning (RL) agents. The underlying assumption that the
environment can be easily reset is limiting both practically, as resets generally require human effort in the real world
and can be computationally expensive in simulation, and philosophically, as we'd expect intelligent agents to be able to
continuously learn without external intervention. Work in learning without any resets, i.e. Reset-Free RL (RF-RL),
is very promising but is plagued by the problem of irreversible transitions (e.g. an object breaking or falling out of reach)
which halt learning. Moreover, the limited state diversity and instrument setup encountered during RF-RL means that works
studying RF-RL largely do not require their models to generalize to new environments.
In this work, we instead look to minimize, rather than completely eliminate, resets while building visual agents that can meaningfully
generalize. As studying generalization has previously not been a focus of benchmarks designed for RF-RL, we propose a new
Stretch Pick-and-Place (Stretch-P&P) benchmark
designed for evaluating generalizations across goals, cosmetic variations, and structural changes.
Moreover, towards building performant reset-minimizing RL agents, we propose unsupervised metrics to detect irreversible
transitions and a single-policy training mechanism to enable generalization. Our proposed approach significantly
outperforms prior episodic, reset-free, and reset-minimizing approaches achieving higher success rates with fewer
resets in Stretch-P&P and another popular RF-RL benchmark. Finally, we find that
our proposed approach can dramatically reduce the number of resets required for training other embodied tasks,
in particular for RoboTHOR ObjectNav we obtain higher success rates than episodic approaches using 99.97% fewer resets.
Two fundamental problems when attempting to build generalizable agents in the reset-free setting are irreversible transitions and limited state diversity. As it is often impractical or impossible to guarantee that an agent does not undergo any irreversible transitions during training, we present measures that we use to quantify when an agent has undergone such a transition. Next, we take a first step towards building generalizable RM-RL agents; in particular, we propose to do away with the learned forward-backward policies popular in prior RF-RF work and, instead, learn a single policy which is presented with randomly generated goals during training.
Some irreversible transitions are explicit, e.g. a glass is dropped and shatters.
However, in a more complex real-world environment, they may be more subtle. For instance,
when a robot is tasked with cleaning a room, it may encounter situations where some
trash is accidentally pushed or blown into hard-to-reach locations, such as under a sofa or
in the corners of the room. In such cases, the robot may find it challenging, but not strictly
impossible, to pick up or sweep debris back using its regular cleaning tools. We refer to these
states that are difficult, but not impossible, to recover from as near-irreversible (NI) states.
Suppose that agent has taken $t$ steps producing the trajectory $\mathcal{T}_{t}=\{\tau_\pi(0), \ldots, \tau_\pi(t)\}$. Intuitively, undergoing an NI transition should correspond to a decrease in the degrees of freedom available to the agent to manipulate its environment: that is, if an agent underwent an NI transition at timestep $i$ then the diversity of states $\tau_{\pi}(i+1),\ldots, \tau_{\pi}(t)$ should be small compared to the diversity before undergoing the irreversible transition. Further, the set of near-irreversible states also depends on the agent's policy $\pi$: which states should be considered near-irreversible can, and should, change during training. To formalize this, we can compute the above count, which we call $\phi_{W,\alpha,d,}(\mathcal{T}_t)$, as \[ \max_{(i_0, \ldots, i_m)\in P(t)}\sum_{j=0}^{m-1}1_{[i_{j+1}-i_{j} \geq N]} \cdot 1_{\{ d\left(\tau_\pi(i_j), \ldots, \tau_\pi(i_{j+1}-1)\right) < \alpha\}} \] where $d:\mathcal{S}^H\to \mathbb{R}_{\geq0}$ is some non-negative measure of diversity among states. As $\phi_{W,\alpha,d}$ is a counting function, we can turn it into a decision function simply by picking some count $N>0$ and deciding to reset when $\phi_{W,\alpha,d}\geq N$. In particular, we will let $\Phi_{W,N,\alpha,d}$ be the function that equals 1 if and only if $\phi_{W,\alpha,d}\geq N$. In our experiments we evaluate several diversity measures $d(s_1,\ldots, s_H)$ including: (1) a dispersion-based method using an empirical measure of entropy, or the mean standard deviation of the $s_i$, (2) a distance-based method using euclidean distance, or dynamic time warping (DTW). While we find surprisingly robust performance when varying $d$, we expect that there is no single best choice of diversity measure for all tasks. See our paper for more details.
We aim to use single policy to achieve RM-RL that can adapt to general
embodied tasks. Recall the objective for goal-conditioned POMDP in traditional episodic RL:
\[\pi^\star \arg\max_\pi J(\pi\mid g) = \arg\max_\pi \mathbb{E}\left[\sum_{t=0} ^ \infty \gamma^t r(s_t, a_t\mid g)\right]\]
In FB-RL, the ``forward'' goal space is normally defined as a singleton
$\mathcal{G}_{f} = \{g^\star\}$ for the target task goal $g^\star$ (e.g. the apple is on the plate,
the peg is inserted into the hole, \etc). The goal space for ``backward'' phase is then the (generally limited)
initial state space $\mathcal{G}_{b} = \mathcal{I} \subset \mathcal{S}$ such that $\mathcal{G}_f\cap \mathcal{G}_b=\emptyset$.
As the goal spaces in FB-RL are disjoint and asymmetric, it is standard for separate forward/backward
policies (with separate parameters) and even different learning objectives to be used when training
FB-RL agents. In our setting, however, there is only a single goal space which, in principle, equals
the entire state space excluding the states we detect as being NI
(\ie, $\mathcal{G} = \mathcal{S} \setminus \{s_t\mid s_t\in\tau_{\pi} (t) \in \mathcal{T}_\pi, \Phi_{W,N,\alpha,d} (\mathcal{T}_\pi) = 1\}$).
In our training setting, we call each period between goal switches a \textit{phase} and, when
formulating our learning objectives, treat these phases as separate "episodes" in episodic approaches.
In our RM-RL setting, we aim for evaluating both the sample efficiency and intervention
efficiency such that an agent or algorithm is considered as parfait with less training steps and human supervisions,
i.e. minimize both:
\[
\sum_{t=0}^\infty J(\pi^\star) - J(\pi^\pi_t), \sum_{k=0}^\infty J(\pi^\star) - J(\pi^\pi_k)
\]
where $\pi_t, \pi_k$ are the policy learned after $t$ steps and $k$ interventions respectively, while also with meaningful generalizations.
We consider the following three classes of baseline training strategies.
In our experiments, we look to answer a number of questions related to:
Success ($50$M) | SPL ($50$M) | Resets ($50$M) | Success ($100$M) | SPL ($100$M) | Resets ($100$M) | |
---|---|---|---|---|---|---|
Ours | 0.216 | 0.131 | 592 | 0.551 | 0.275 | 635 |
H=300 | 0.334 | 0.166 | 24k | 0.355 | 0.167 | 1M |
H=10k | 0.246 | 0.134 | 5k | 0.418 | 0.218 | 10k |
H=$\infty$ | 0.206 | 0.141 | 60 | 0.339 | 0.178 | 60 |
EmbCLIP | 0.431 | 0.204 | 1M | 0.504 | 0.234 | 2M |
In this work we study the problem of training Reset-Minimizing Reinforcement Learning (RM-RL) agents within visually complex environments which can generalize to novel cosmetic and structural changes during evaluation. We design the Stretch-P&P benchmark to study this problem and find that two methodological contributions, unsupervised irreversible transition detection and a single-policy random-goal training strategy, allow agents to learn with fewer resets and better generalize than competing baselines. In future work we look to further explore the implications of our irreversible transition detection methods for improving RM-RL methods and for building models that can ask for help during evaluation. We also leave the space for design and balancing of how to penalize visits to unexpected NI states (with labels provided by our method), which may potentially conflict with encouraging exploration, as future work.
@article{zhang2023when,
title = {When Learning Is Out of Reach, Reset: Generalization in Autonomous Visuomotor Reinforcement Learning},
author = {Zichen Zhang and Luca Weihs},
year = {2023},
journal = {arXiv preprint arXiv: Arxiv-2303.17600},
}