Constraint-Aware Zero-Shot Vision-Language Navigation in Continuous Environments

Kehan Chen*, Dong An*, Yan Huang, Rongtao Xu, Yifei Su, Yonggen Ling, Ian Reid, Liang Wang†

Abstract

We present CA-Nav, a zero-shot approach for Vision-Language Navigation in Continuous Environments (VLN-CE). To achieve this, CA-Nav reframes zero-shot VLN-CE as a sequential sub-instruction completion problem, continuously translating sub-instructions into navigation plans via a cross-modal value map. Central to our approach are two modules namely Constraint-aware Sub-instruction Manager (CSM) and Constraint-aware Value Map (CVM). CSM decomposes instructions into sub-instructions, defines their completion criteria as constraints, and tracks navigation progress by switching sub-instructions in a constraint-aware manner. Based on constraints identified by CSM, CVM generates a value map from a vision-language model and refines it using superpixel clustering to enhance navigation stability. CA-Nav achieves the state-of-the-art performance on two VLN-CE benchmarks, surpassing the second-best method by 12% on R2R-CE and 13% on RxR-CE in terms of Success Rate on the validation unseen split. Furthermore, CA-Nav demonstrates effectiveness in real-world robot deployments across diverse indoor scenes and instructions, verifying its practical potential.

Method Overview

MY ALT TEXT

Illustration of the proposed CA-Nav. (a) The Constraint-aware Sub-instruction Manager decomposes the instruction into a sequence of sub-instructions and determinesobject constraints, location constraints and direction constraints for each of them. (b) During navigation, CA-Nav builds a Constraint-aware Value Map based on a landmark prompt provided by CSM and uses the superpixel clustering method to segment it into regions. It switches sub-instructions in a constraint-aware manner and chooses the most promising region's geometric center as waypoints.

Framework of CA-Nav

MY ALT TEXT

we present CA-Nav that reframe VLN-CE as a sequential sub-instruction completion problem. Within each episode, a Constraint-aware Sub-instruction Manager (CSM) decomposes instructions and conducts constraint-aware sub-instruction switching by determining if all constraints are satisfied. Then a Constraint-aware Value Map (CVM) which captures both visual details and environmental structures is built based on current constraints and observations. According to the CVM, CA-Nav generates navigation plans that are executed by classical control algorithms, guiding the agent to complete each sub-instruction until the episode terminates.

Real-world Experiments

Visualization in Simulation

BibTeX

BibTex Code Here