While training OpenLRM, I consistently encounter a problem where, after completing a certain number of training steps (depending on the batch size I set), GPU utilization drops abruptly to zero. Simultaneously, CPU usage noticeably spikes, causing a significant delay before training resumes to the next step.
I have experimented by increasing the number of workers in the DataLoader, but this hasn't resolved the problem.
Environment:
- GPU Model: 8x NVIDIA A100-SXM4-40GB
- CUDA Version: cu118
- PyTorch Version: 2.4.0