Skip to content

VLNActionDataset: Integer division (//) for num_rounds discards final action segments and STOP signals, leading to agent failing to learn to stop #88

@followingcode

Description

@followingcode

In streamvln/dataset/vln_action_dataset.py, line 657 calculates num_rounds using floor division (//):

num_rounds = (actions_len - valid_idx) // self.num_frames
Then it iterates over range(num_rounds + 1) and skips the last empty window (when n * self.num_frames == actions_len - valid_idx), which results in only num_rounds samples being added per episode.
This means:
For any episode where (actions_len - valid_idx) % self.num_frames != 0, the final action segment (including the STOP step) is discarded entirely.
The training set loses a large number of STOP signals, so the agent never sees enough examples of when to stop during navigation.
This could lead to the agent failing to stop at the target during inference, severely hurting navigation success rates and SPL metrics.
Example
actions_len - valid_idx = 87
self.num_frames = 32
num_rounds = 87 // 32 = 2
Only windows 0~31 and 32~63 are kept; the final segment 64~87 (including STOP) is lost.
Suggested Fix
Instead of discarding the final segment, we should use ceiling division to keep all action segments (including the last partial one) and pad shorter segments to self.num_frames length during collation. This way:
No action/STOP steps are discarded
The agent can learn to recognize and execute the STOP signal
Training stability is maintained via padding
Could we modify the sampling logic to preserve all action segments (including partial ones) instead of using floor division?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions