Pipeline
Overview of RoboTron-Nav architecture. The current frame It is initially processed through 2D and 3D feature extraction using the visual encoder. Then, historical features are filtered through the Adaptive 3D-aware History Sampling strategy. The visual features obtained, along with the accompanying linguistic instructions, are then fed into the large language model (LLM). Leveraging the Multitask Collaboration strategy, which significantly enhances navigation capabilities through joint training on both navigation and EQA tasks, the LLM produces two outputs via multimodal fusion: executable navigation actions and natural language answers.