GPU Utilization Drops to Zero During Training

While training OpenLRM, I consistently encounter a problem where, after completing a certain number of training steps (depending on the batch size I set), GPU utilization drops abruptly to zero. Simultaneously, CPU usage noticeably spikes, causing a significant delay before training resumes to the next step.

I have experimented by increasing the number of workers in the DataLoader, but this hasn't resolved the problem.

### Environment:
- **GPU Model:** 8x NVIDIA A100-SXM4-40GB
- **CUDA Version:** cu118
- **PyTorch Version:** 2.4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Utilization Drops to Zero During Training #65

Environment:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU Utilization Drops to Zero During Training #65

Description

Environment:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions