RoboTron-Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction

Meituan
ICCV 2025

Indicates Equal Contribution

Corresponding Author

RoboTron-Nav's motivation


Motivation. In (1) navigation tasks, only actions are produced, missing the perception and planning found in (2) EQA tasks. (3) Multitask collaboration unifies perception, planning, and prediction for a more comprehensive model.

Abstract

In language-guided visual navigation, agents locate target objects in unseen environments using natural language instructions. For reliable navigation in unfamiliar scenes, agents should possess strong perception, planning, and prediction capabilities. Additionally, when agents revisit previously explored areas during long-term navigation, they may retain irrelevant and redundant historical perceptions, leading to suboptimal results. In this work, we propose RoboTron-Nav, a unified framework that integrates perception, planning, and prediction capabilities through multitask collaborations on navigation and embodied question answering tasks, thereby enhancing navigation performances. Furthermore, RoboTron-Nav employs an adaptive 3D-aware history sampling strategy to effectively and efficiently utilize historical observations. By leveraging large language model, RoboTron-Nav comprehends diverse commands and complex visual scenes, resulting in appropriate navigation actions. RoboTron-Nav achieves an 81.1% success rate in object goal navigation on the CHORES-S benchmark, setting a new state-of-the-art performance.

Pipeline

RoboTron-Nav's pipeline

Overview of RoboTron-Nav architecture. The current frame It is initially processed through 2D and 3D feature extraction using the visual encoder. Then, historical features are filtered through the Adaptive 3D-aware History Sampling strategy. The visual features obtained, along with the accompanying linguistic instructions, are then fed into the large language model (LLM). Leveraging the Multitask Collaboration strategy, which significantly enhances navigation capabilities through joint training on both navigation and EQA tasks, the LLM produces two outputs via multimodal fusion: executable navigation actions and natural language answers.

Results

RoboTron-Nav's performance results

We compare with state-of-the-art navigation methods on the CHORES-S ObjectNav benchmark across three evaluation metrics: success rate (SR), episode-length weighted success (SEL), and percentage of rooms visited (%Rooms).

RoboTron-Nav's qualitative results

Left: Both RoboTron-Nav and SPOC generate the shortest path when the agent is near the target (e.g., in the same room). Right: In contrast, SPOC fails due to repeatedly searching along the same paths when distant from the target (e.g., in different rooms), whereas RoboTron-Nav succeeds by effectively avoiding revisiting areas and exploring new ones.

Visualizations

Multi-view visualization of RoboTron-Nav navigation process showing bird's-eye view, head-mounted and wrist-mounted camera perspectives.

Task: Find a houseplant

🗺️ Bird's-Eye View

Bird's-eye view

👁️ Head View

🤚 Wrist View

Task: Find a bed

🗺️ Bird's-Eye View

Bird's-eye view

👁️ Head View

🤚 Wrist View

Task: Search for a sofa

🗺️ Bird's-Eye View

Bird's-eye view

👁️ Head View

🤚 Wrist View

BibTeX


        @article{zhong2025p3nav,
          title={P3nav: A unified framework for embodied navigation integrating perception, planning, and prediction},
          author={Zhong, Yufeng and Feng, Chengjian and Yan, Feng and Liu, Fanfan and Zheng, Liming and Ma, Lin},
          journal={arXiv preprint arXiv:2503.18525},
          year={2025}
        }