We present CA-Nav, a zero-shot approach for Vision-Language Navigation in Continuous Environments (VLN-CE). To achieve this, CA-Nav reframes zero-shot VLN-CE as a sequential sub-instruction completion problem, continuously translating sub-instructions into navigation plans via a cross-modal value map. Central to our approach are two modules namely Constraint-aware Sub-instruction Manager (CSM) and Constraint-aware Value Map (CVM). CSM decomposes instructions into sub-instructions, defines their completion criteria as constraints, and tracks navigation progress by switching sub-instructions in a constraint-aware manner. Based on constraints identified by CSM, CVM generates a value map from a vision-language model and refines it using superpixel clustering to enhance navigation stability. CA-Nav achieves the state-of-the-art performance on two VLN-CE benchmarks, surpassing the second-best method by 12% on R2R-CE and 13% on RxR-CE in terms of Success Rate on the validation unseen split. Furthermore, CA-Nav demonstrates effectiveness in real-world robot deployments across diverse indoor scenes and instructions, verifying its practical potential.
BibTex Code Here