-
Notifications
You must be signed in to change notification settings - Fork 31
Description
Thank you very much for your excellent work!
In Figure 2: Co-Training Data Recipe of StreamVLN, we would like to clarify the total dataset size and the corresponding data amounts for each split.
Assuming the DAgger data size is accurately 240K and accounts for 16% of the total data, we calculate the total dataset size (X) as follows:
16% × X = 240K
X = 240K / 0.16 = 1500K
Based on this total size of 1500K, we further compute the data amounts for other splits:
- MP3D (31%): 1500K × 31% = 465K
- HM3D (20%): 1500K × 20% = 300K
- VQA (17%): 1500K × 17% = 255K
- MMC4 (16%): 1500K × 16% = 240K
- VLA (67%): 1500K × 67% = 1005K
- General Multi-modal (33%): 1500K × 33% = 495K
Could you please confirm if these calculations align with the actual dataset configuration? Thank you so much for your time and clarification!
- Vision-Language Action (VLA) Data
- MP3D: Text states 450K, but calculation (1500K × 31%) gives 465K (discrepancy)
- HM3D: Text states 300K, calculation (1500K × 20%) gives 300K (consistent ✅)
- DAgger: Text states 240K, calculation (1500K × 16%) gives 240K (consistent ✅)
- VLA Total: Text sum is 450K + 300K + 240K = 990K, but calculation (1500K × 67%) gives 1005K (discrepancy)
- General Multi-modal Data
- VQA: Text states 248K, but calculation (1500K × 17%) gives 255K (discrepancy)
- MMC4: Text states 230K, but calculation (1500K × 16%) gives 240K (discrepancy)
- General Total: Text sum is 248K + 230K = 478K, but calculation (1500K × 33%) gives 495K (discrepancy)
We suspect these discrepancies come from the rounded percentage values in the pie chart (e.g., 31%, 17%, 16%) not being the exact true ratios. Could you please confirm the precise total dataset size and the exact ratios for each split? Thank you so much for your time and clarification!