In streamvln/dataset/vln_action_dataset.py, line 657 calculates num_rounds using floor division (//):
num_rounds = (actions_len - valid_idx) // self.num_frames
Then it iterates over range(num_rounds + 1) and skips the last empty window (when n * self.num_frames == actions_len - valid_idx), which results in only num_rounds samples being added per episode.
This means:
For any episode where (actions_len - valid_idx) % self.num_frames != 0, the final action segment (including the STOP step) is discarded entirely.
The training set loses a large number of STOP signals, so the agent never sees enough examples of when to stop during navigation.
This could lead to the agent failing to stop at the target during inference, severely hurting navigation success rates and SPL metrics.
Example
actions_len - valid_idx = 87
self.num_frames = 32
num_rounds = 87 // 32 = 2
Only windows 0~31 and 32~63 are kept; the final segment 64~87 (including STOP) is lost.
Suggested Fix
Instead of discarding the final segment, we should use ceiling division to keep all action segments (including the last partial one) and pad shorter segments to self.num_frames length during collation. This way:
No action/STOP steps are discarded
The agent can learn to recognize and execute the STOP signal
Training stability is maintained via padding
Could we modify the sampling logic to preserve all action segments (including partial ones) instead of using floor division?