: To save on processing power, researchers often pre-compute visual features (using models like CLIP or ResNet) and store them in compressed formats for the agent to use during training.
: The agent must understand spatial relationships and object semantics, such as distinguishing a "wooden table" from a "marble counter".
Based on common naming conventions for technical datasets and model checkpoints, "VLN-155zip" likely refers to a specific compressed archive containing , pre-trained weights , or code repositories for a VLN research project. Understanding Vision-Language Navigation (VLN)