Skip to content

Still 'CUDA out of memory' with 4 TITAN X (pascal) when training model in PASCAL VOC dataset #61

@rrryan2016

Description

@rrryan2016

Hey, thx for your codes released!

OS: Ubuntu 16.04
CUDA: 8.0.44
GPU: TITAN X Pascal (11.2GB as the memory) X 4

I intend to train model in PASCAL VOC 2012, and I run CUDA_VISIBLE_DEVICES=0,1,2,3 python train_autodeeplab.py --backbone resnet --lr 0.007 --workers 4 --epochs 40 --batch_size 1 --eval_interval 1 --dataset pascal

and the error message show as below

Namespace(arch_lr=0.003, arch_weight_decay=0.001, backbone='resnet', base_size=320, batch_size=1, checkname='deeplab-resnet', crop_size=320, cuda=True, dataset='pascal', epochs=40, eval_interval=1, freeze_bn=False, ft=False, gpu_ids=0, loss_type='ce', lr=0.007, lr_scheduler='cos', momentum=0.9, nesterov=False, no_cuda=False, no_val=False, out_stride=16, resize=512, resume=None, seed=1, start_epoch=0, sync_bn=True, test_batch_size=1, use_balanced_weights=False, use_sbd=False, weight_decay=0.0003, workers=4)Number of images in train: 1464Number of images in val: 1449cuda finished
Using cos LR Scheduler!
Starting Epoch: 0 Total Epoches: 40 0%| | 0/1464 [00:00<?, ?it/s]=>Epoches 0, learning rate = 0.0070, previous best = 0.0000
/home/ljy/anaconda3/envs/p36c8t041ljy/lib/python3.6/site-packages/torch/nn/modules/upsampling.py:122: UserWarning: nn.Upsampling is deprecated. Use nn.functional.interpolate instead. warnings.warn("nn.Upsampling is deprecated. Use nn.functional.interpolate instead.")
/home/ljy/anaconda3/envs/p36c8t041ljy/lib/python3.6/site-packages/torch/nn/functional.py:1961: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. "See the documentation of nn.Upsample for details.".format(mode))
Traceback (most recent call last):
File "train_autodeeplab.py", line 324, in main()
File "train_autodeeplab.py", line 317, in main trainer.training(epoch) File "train_autodeeplab.py", line 116, in training output = self.model(image)
File "/home/ljy/anaconda3/envs/p36c8t041ljy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, **kwargs)
File "/home/ljy/li-codes/lwz/codes/AutoDeeplab/auto_deeplab.py", line 214, in forward level4_new_2 = self.cells[count] (self.level_4[-2], self.level_8[-1], weight_cells)
File "/home/ljy/anaconda3/envs/p36c8t041ljy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, **kwargs)
File "/home/ljy/li-codes/lwz/codes/AutoDeeplab/model_search.py", line 68, in forward s = sum(self._ops[offset+j](h, weights[offset+j]) for j, h in enumerate(states) if h is not None)
File "/home/ljy/li-codes/lwz/codes/AutoDeeplab/model_search.py", line 68, in s = sum(self._ops[offset+j](h, weights[offset+j]) for j, h in enumerate(states) if h is not None)
File "/home/ljy/anaconda3/envs/p36c8t041ljy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, **kwargs)
File "/home/ljy/li-codes/lwz/codes/AutoDeeplab/model_search.py", line 22, in forward return sum(w * op(x) for w, op in zip(weights, self._ops))
File "/home/ljy/li-codes/lwz/codes/AutoDeeplab/model_search.py", line 22, in return sum(w * op(x) for w, op in zip(weights, self._ops))
RuntimeError: CUDA error: out of memory

I guess I may fail to use multi-GPUs, so I even change a line the code into self.model = torch.nn.DataParallel(self.model, device_ids=[0, 1, 2, 3]), but the same error message show again.

What can I do to resolve it, please

Thx in advance

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions