Universal Visual Decomposer:
Long-Horizon Manipulation Made Easy

Finalist for the Best Paper Award in Robot Vision, ICRA 2024

1PRIOR @ Allen Institute for AI; 2University of Pennsylvania; 3University of Washington
Equal Contribution; Equal Advising ; Corresponding to: $\texttt{charlesz@allenai.org}$

Abstract

Real-world robotic tasks stretch over extended horizons and encompass multiple stages. Learning long-horizon manipulation tasks, however, is a long-standing challenge, and demands decomposing the overarching task into several manageable subtasks to facilitate policy learning and generalization to unseen tasks. Prior task decomposition methods require task-specific knowledge, are computationally intensive, and cannot readily be applied to new tasks. To address these shortcomings, we propose Universal Visual Decomposer (UVD), an off-the-shelf task decomposition method for visual long-horizon manipulation using pre-trained visual representations designed for robotic control. At a high level, UVD discovers subgoals by detecting phase shifts in the embedding space of the pre-trained representation. Operating purely on visual demonstrations without auxiliary information, UVD can effectively extract visual subgoals embedded in the videos, while incurring zero additional training cost on top of standard visuomotor policy training. Goal-conditioned policies learned with UVD-discovered subgoals exhibit significantly improved compositional generalization at test time to unseen tasks. Furthermore, UVD-discovered subgoals can be used to construct goal-based reward shaping that jump-starts temporally extended exploration for reinforcement learning. We extensively evaluate UVD on both simulation and real-world tasks, and in all cases, UVD substantially outperforms baselines across imitation and reinforcement learning settings on in-domain and out-of-domain task sequences alike, validating the clear advantage of automated visual task decomposition within the simple, compact UVD framework.


Try our UVD decomposition hosted with Gradio below! Note: due to the limited memory, only VIP preprocessor is supported for now. If the demo is down, please contact the author.

Methods

Our goal is to derive a general-purpose subgoal decomposition method that can operate purely from raw visual inputs on a per-trajectory basis. The key intuition of UVD is that, conditioned on a goal frame, some frames preceding it must visually approach the goal frame; once we discover the first frame in this goal-reaching sequence, the frame that precedes it is then another subgoal. Now by conditioning the new subgoal, we can apply the algorithm recursively until the full sequence is exhausted. Now we show the low-level and high-level psudocode with a visualization of the recursive decomposition process below.


UVD low-level pseudocode in Python

from scipy.signal import argrelextrema

def UVD(
    embeddings: np.ndarray | torch.Tensor, 
    smooth_fn: Callable,
    min_interval: int = 15,
) -> list[int]:
    # last frame as the last subgoal
    cur_goal_idx = -1 
    # saving (reversed) subgoal indices (timesteps)
    goal_indices = [cur_goal_idx]
    cur_emb = embeddings.copy() # L, d
    while cur_goal_idx > min_interval:
        # smoothed embedding distance curve (L,)
        d = norm(cur_emb - cur_emb[-1], axis=-1)
        d = smooth_fn(d)
        # monotonicity breaks (e.g. maxima)
        extremas = argrelextrema(d, np.greater)[0]
        extremas = [
            e for e in extremas 
            if cur_goal_idx - e > min_interval
        ]
        if extremas:
            # update subgoal by Eq.(3)
            cur_goal_idx = extremas[-1] - 1
            goal_indices.append(cur_goal_idx)
            cur_emb = embeddings[:cur_goal_idx + 1]
        else:
            break
    return embeddings[
        goal_indices[::-1]  # chronological
    ]
Universal Visual Decomposer (UVD)
Init: frozen visual encoder $\phi$, $\tau = \{o_0, \cdots, o_T\}$
Init: set of subgoals $\tau_{goal}$ = {}, $t = T$
While $t$ not small enough:
$\tau_{goal} = \tau_{goal} \cup \{o_{t}\}$
$o_{t-n-1} :=\arg \max_{o_h} d_\phi(o_h;o_t) < d_\phi(o_{h+1};o_t), h < t$ (Eq.3)
$ t = t - n - 1$
End

Visualization of UVD recursive decomposition ​

UVD in the wild

UVD is not limited to robotic settings—it's also highly effective in household scenarios on human videos. Here are some examples of how UVD can decompose subgoals in the wild:


Open a cabinet and rearrange

Open a drawer and charge

Unlock a computer

Wash hands in bathroom

Activities in kitchen

Experiments

Simulation Results

In-domain and out-of-domain IL Results on FrankaKitchen. We report the mean and standard deviation of success rate (full-stage completion) and the percentage of the completion (out of 4 stages), evaluated over diverse existing pretrained visual representations trained by GCBC with three seeds. Highlighted scores represent improvements in out-of-domain evaluations and in-domain results with gains exceeding 0.01.



Hide/Show Numerical Results


Next, we visualize the qualititive results for one of the task in FrankaKitchen: open the microwave, turn on the bottom burner, toggle the light switch, and slide the cabinet. We compare the decomposition results with different frozen visual backbones, as well as 3D t-SNE visualizations (colors are labeled by each subgoal). Representations pretrained with temporal objectives like VIP and R3M provide more smooth, continuous, and monotone clusters in feature space than others, whereas the ResNet trained for supervised classification on ImageNet-1k provide the most sparse embeddings.

Your Image

UVD Decomposition Results in Simulation

GCBC

GCBC + UVD

GCRL

GCRL + UVD

Real Robot Results

In-domain evaluation. For real-world applications, we've tested UVD on three multistage tasks: placing an apple in an oven and close the oven ($\texttt{Apple-in-Oven}$), pouring fries then place on a rack ($\texttt{Fries-and-Rack}$), and folding a cloth ($\texttt{Fold-Cloth}$). The corresponding videos show how we break down these tasks into semantically meaningful sub-goals. Two successful and one failed rollouts on these three tasks. All videos for real robot experiments are 2x speed up.


$\texttt{Apple-in-Oven}$

UVD Decomposition Results


$\texttt{Fries-and-Rack}$

UVD Decomposition Results


$\texttt{Fold-Cloth}$

UVD Decomposition Results

Compositional Generalization. We evaluate UVD's ability to generalize compositionally by introducing unseen initial states for these tasks. While methods like GCBC fail (first row) under these circumstances, GCBC + UVD (second row) successfully adapts.

GCBC

GCBC + UVD

Robustness with Human Involvement. We further demonstrate how UVD is able to recover or continue to complete the task with human interference. In $\texttt{Apple-in-Oven}$ and $\texttt{Fries-and-Rack}$ task, we either reset the scene by putting the apple to initial position or skip the intermediate step with human interference. Our method shows great robustness in these cases.

Reset the apple to initial position

Accomplish intermediate step "pushing"

Accomplish intermediate step "pouring"

Implementation Details

Policies


Training


Inference

BibTeX

@inproceedings{zhang2024universal,
  title={Universal visual decomposer: Long-horizon manipulation made easy},
  author={Zhang, Zichen and Li, Yunshuang and Bastani, Osbert and Gupta, Abhishek and Jayaraman, Dinesh and Ma, Yecheng Jason and Weihs, Luca},
  booktitle={2024 IEEE International Conference on Robotics and Automation (ICRA)},
  pages={6973--6980},
  year={2024},
  organization={IEEE}
}