Constraint-Aware Zero-Shot Vision-Language Navigation in Continuous Environments

Abstract

We present CA-Nav, a zero-shot approach for Vision-Language Navigation in Continuous Environments (VLN-CE). To achieve this, CA-Nav reframes zero-shot VLN-CE as a sequential sub-instruction completion problem, continuously translating sub-instructions into navigation plans via a cross-modal value map. Central to our approach are two modules namely Constraint-aware Sub-instruction Manager (CSM) and Constraint-aware Value Map (CVM). CSM decomposes instructions into sub-instructions, defines their completion criteria as constraints, and tracks navigation progress by switching sub-instructions in a constraint-aware manner. Based on constraints identified by CSM, CVM generates a value map from a vision-language model and refines it using superpixel clustering to enhance navigation stability. CA-Nav achieves the state-of-the-art performance on two VLN-CE benchmarks, surpassing the second-best method by 12% on R2R-CE and 13% on RxR-CE in terms of Success Rate on the validation unseen split. Furthermore, CA-Nav demonstrates effectiveness in real-world robot deployments across diverse indoor scenes and instructions, verifying its practical potential.

Method Overview

Illustration of the proposed CA-Nav. (a) The Constraint-aware Sub-instruction Manager decomposes the instruction into a sequence of sub-instructions and determinesobject constraints, location constraints and direction constraints for each of them. (b) During navigation, CA-Nav builds a Constraint-aware Value Map based on a landmark prompt provided by CSM and uses the superpixel clustering method to segment it into regions. It switches sub-instructions in a constraint-aware manner and chooses the most promising region's geometric center as waypoints.

Framework of CA-Nav

we present CA-Nav that reframe VLN-CE as a sequential sub-instruction completion problem. Within each episode, a Constraint-aware Sub-instruction Manager (CSM) decomposes instructions and conducts constraint-aware sub-instruction switching by determining if all constraints are satisfied. Then a Constraint-aware Value Map (CVM) which captures both visual details and environmental structures is built based on current constraints and observations. According to the CVM, CA-Nav generates navigation plans that are executed by classical control algorithms, guiding the agent to complete each sub-instruction until the episode terminates.

Visualization in Simulation

Navigation on episode 1313 with Superpixel-based waypoint selection.

Instruction: Go straight. Pass the stairs on the right and continue straight. When you get to the stairs going up pass those as well. Go into the room with the couches and then turn right. wait near the glass table with white chairs.

SVM avoids focusing solely on local high values. Instead, it considers the global value map, allowing it to navigate correctly when switching sub-instructions.

Navigation on episode 1313 with frontier-based waypoint selection.

Instruction: Go straight. Pass the stairs on the right and continue straight. When you get to the stairs going up pass those as well. Go into the room with the couches and then turn right. wait near the glass table with white chairs.

The FBE-based waypoint selection method navigates correctly during the first sub-instruction. However, upon reaching the second sub-instruction, the landmark prompt changes to “room,” causing the value map, initially based on the previous landmark prompt “stairs”, to decay as new values are updated. Then the agent chooses a left frontier, which has the highest value among the available options, but the correct path is to walk straight toward the open area with the stairs.

BibTeX

@article{chen2024CANav, title={Constraint-Aware Zero-Shot Vision-Language Navigation in Continuous Environments}, author={Kehan Chen and Dong An and Yan Huang and Rongtao Xu and Yifei Su and Yonggen Ling and Ian Reid and Liang Wang}, year={2024}, journal={arXiv preprint arXiv:2412.10137} }

Constraint-Aware Zero-Shot Vision-Language Navigation in Continuous Environments

Abstract

Method Overview

Framework of CA-Nav

Real-world Experiments

Visualization in Simulation

Navigation on episode 1313 with Superpixel-based waypoint selection.

Instruction: Go straight. Pass the stairs on the right and continue straight. When you get to the stairs going up pass those as well. Go into the room with the couches and then turn right. wait near the glass table with white chairs.

SVM avoids focusing solely on local high values. Instead, it considers the global value map, allowing it to navigate correctly when switching sub-instructions.

Navigation on episode 1313 with frontier-based waypoint selection.

Instruction: Go straight. Pass the stairs on the right and continue straight. When you get to the stairs going up pass those as well. Go into the room with the couches and then turn right. wait near the glass table with white chairs.

BibTeX