I have sufficient memory while not many CPU cores on my server, therefore IO can be the bottleneck of the training trials.
I noticed that you have set pin_memory=False in PyTorch DataLoader, and I didn't see any change of run time from toggling it.
Since this experiment is quite IO heavy, have you tried any speeding up method?