Abstract: Vision-and-language Navigation (VLN) is a challenging problem that requires agents to follow natural language instructions in a photo-realistic environment. The alignment between visual ...