You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to run training for the end-to-end masked transformer using the ActivityNet data set. Currently I am running this on an AWS EC2 instance of type p2.xlarge, which has one GPU. I call the training script as follows:
Unfortunately I run into the error below with regards to multiprocessing. So far I have been unable to debug it successfully. When adding the spawn method as indicated by the error messages, further errors occur. I would appreciate any help in figuring out what I'm doing wrong.
train.py:122: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
options_yaml = yaml.load(handle)
Namespace(alpha=0.95, attn_dropout=0.2, batch_size=14, beta=0.999, cap_dropout=0.2, cfgs_file='cfgs/anet.yml', checkpoint_path='./checkpoint/weird', cls_weight=1.0, cuda=True, d_hidden=2048, d_model=1024, dataset='anet', dataset_file='./data/anet/anet_annotations_trainval.json', densecap_references=['./data/anet/val_1.json', './data/anet/val_2.json'], dist_backend='gloo', dist_url='./weird', dur_file='./data/anet/anet_duration_frame.csv', enable_visdom=False, epsilon=1e-08, feature_root='./dataset', gated_mask=True, grad_norm=1, image_feat_size=3072, in_emb_dropout=0.1, kernel_list=[1, 2, 3, 4, 5, 7, 9, 11, 15, 21, 29, 41, 57, 71, 111, 161, 211, 251], learning_rate=0.1, load_train_samplelist=False, load_valid_samplelist=False, loss_alpha_r=2, losses_log_every=1, mask_weight=1.0, max_epochs=20, max_sentence_len=20, n_heads=8, n_layers=2, neg_thresh=0.3, num_workers=1, optim='sgd', patience_epoch=1, pos_thresh=0.7, reduce_factor=0.5, reg_weight=10, sample_prob=0, sampling_sec=0.5, save_checkpoint_every=1, save_train_samplelist=False, save_valid_samplelist=False, scst_weight=0.0, seed=213, sent_weight=0.25, slide_window_size=480, slide_window_stride=20, start_from='', stride_factor=50, train_data_folder=['training'], train_sample=20, train_samplelist_path='/z/home/luozhou/subsystem/densecap_vid/train_samplelist.pkl', val_data_folder=['validation'], valid_batch_size=64, valid_samplelist_path='/z/home/luozhou/subsystem/densecap_vid/valid_samplelist.pkl', vis_emb_dropout=0.1, world_size=1)
loading dataset
# of words in the vocab: 4563
# of sentences in training: 37421, # of sentences in validation: 17505
# of training videos: 10009
size of the sentence block variable (['training']): torch.Size([37415, 20])
Process ForkPoolWorker-1:
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/pool.py", line 108, in worker
task = get()
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/queues.py", line 337, in get
return _ForkingPickler.loads(res)
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 95, in rebuild_storage_cuda
torch.cuda._lazy_init()
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 159, in _lazy_init
"Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Process ForkPoolWorker-2:
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/pool.py", line 108, in worker
task = get()
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/queues.py", line 337, in get
return _ForkingPickler.loads(res)
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 95, in rebuild_storage_cuda
torch.cuda._lazy_init()
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 159, in _lazy_init
"Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Process ForkPoolWorker-3:
Process ForkPoolWorker-4:
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/pool.py", line 108, in worker
task = get()
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/queues.py", line 337, in get
return _ForkingPickler.loads(res)
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 95, in rebuild_storage_cuda
torch.cuda._lazy_init()
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 159, in _lazy_init
"Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/pool.py", line 108, in worker
task = get()
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/queues.py", line 337, in get
return _ForkingPickler.loads(res)
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 95, in rebuild_storage_cuda
torch.cuda._lazy_init()
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 159, in _lazy_init
"Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/THC/generic/THCStorage.c line=150 error=3 : initialization error
Process ForkPoolWorker-5:
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/pool.py", line 108, in worker
task = get()
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/queues.py", line 337, in get
return _ForkingPickler.loads(res)
File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 94, in rebuild_storage_cuda
return storage._new_view(offset, view_size)
RuntimeError: cuda runtime error (3) : initialization error at /opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/THC/generic/THCStorage.c:150
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/THC/generic/THCStorage.c line=150 error=3 : initialization error
The text was updated successfully, but these errors were encountered:
I am trying to run training for the end-to-end masked transformer using the ActivityNet data set. Currently I am running this on an AWS EC2 instance of type p2.xlarge, which has one GPU. I call the training script as follows:
CUDA_VISIBLE_DEVICES=0 python scripts/train.py --dist_url ./ss_model --cfgs_file cfgs/anet.yml --checkpoint_path ./checkpoint/ss_model --batch_size 14 --world_size 1 --cuda --sent_weight 0.25 --mask_weight 1.0 --gated_mask | tee log/ss_model-0
Unfortunately I run into the error below with regards to multiprocessing. So far I have been unable to debug it successfully. When adding the spawn method as indicated by the error messages, further errors occur. I would appreciate any help in figuring out what I'm doing wrong.
The text was updated successfully, but these errors were encountered: