logoBBSEA: Brain-Body Synchronization for Embodied Agents

Anonymous Authors

Abstract

Embodied agents capable of complex physical skills can improve productivity, elevate life quality, and reshape human-machine collaboration. We aim at autonomous training of embodied agents for various tasks involving mainly large foundation models. It is believed that these models could act as a brain for embodied agents; however, existing methods heavily rely on humans for task proposal and scene customization, limiting the learning autonomy, training efficiency, and generalization of the learned policies. In contrast, we introduce a brain-body synchronization scheme to promote embodied learning in unknown environments without human involvement. The proposed combines the wisdom of foundation models ("brain") with the physical capabilities of embodied agents ("body"). Specifically, it leverages the "brain" to propose learnable physical tasks and success metrics, enabling the "body" to automatically acquire various skills by continuously interacting with the scene. We demonstrate that the proposed synchronization can generate diverse tasks and develop multi-task policies with solid adaptability to new tasks and environments. We will release our data, code, and trained models to facilitate future study in building autonomously learning agents.

Method

We propose a framework in which the entire process of diverse skill acquisition is performed in an automatic way without human intervention.
The framework consists of two essential components: the brain -- large foundation models and perception module, and the body -- robot arms, which take textual instructions and interact with the environment. An LLM utilizes a scene graph to facilitate its understanding of the environment and proposes tasks for the robot to learn by guiding the exploration and determining whether the task is accomplished. Our pipeline includes three parts: 1) scene-compatible task proposal; 2) task completion inference; and 3) task-conditioned policy learning. An overview of the proposed framework is shown in Figure 2.


Trajectories Collected on 60 Tasks


Zero-Shot Generalization


Few-Shot Generalization