Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU training for LiDAR and fusion exp #9

Open
Song-Jingyu opened this issue Aug 18, 2023 · 4 comments
Open

Multi-GPU training for LiDAR and fusion exp #9

Song-Jingyu opened this issue Aug 18, 2023 · 4 comments

Comments

@Song-Jingyu
Copy link

Song-Jingyu commented Aug 18, 2023

Hi,

Thanks for open-sourcing this work. When I was trying to train the teacher network of LiDAR and fusion I wasn't able to start it with multiple GPU. Single-GPU training works. Multi-GPU training of camera exp works. Here is the error log, which I didn not find very informative.

I already changed the num_workers to be 0 but it did not work. Is there anything significantly different among different modalities? Would you mind providing any insight on why happens? Thanks!

RuntimeError: zero_ /tmp/pip-build-env-__cfq4tn/overlay/lib/python3.6/site-packages/cumm/incl
ude/tensorview/tensor.h 221                                                                  
cuda failed with error 1 invalid argument. use CUDA_LAUNCH_BLOCKING=1 to get correct tracebac
k.                                                                                           
                                                                                             
                                                                                             
During handling of the above exception, another exception occurred:                          
                                                                                             
Traceback (most recent call last):                                                           
  File "/home/jingyuso/kd_project/test/CVPR2023-UniDistill/unidistill/exps/multisensor_fusion
/nuscenes/BEVFusion/BEVFusion_nuscenes_centerhead_lidar_exp.py", line 35, in <module>        
    run_cli(Exp, "BEVFusion_nuscenes_centerhead_lidar_exp")                                  
  File "/home/jingyuso/kd_project/test/CVPR2023-UniDistill/unidistill/exps/base_cli.py", line
 58, in run_cli                                                                              
    trainer.fit(model, model.train_dataloader, model.val_dataloader)                         
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/trainer/trainer.py", line 741, in fit                                                
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path         
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/trainer/trainer.py", line 698, in _call_and_handle_interrupt                         
    self.training_type_plugin.reconciliate_processes(traceback.format_exc())                 
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/plugins/training_type/ddp.py", line 533, in reconciliate_processes                   
    raise DeadlockDetectedException(f"DeadLock detected from rank: {self.global_rank} \n {tra
ce}")                                                                                        
pytorch_lightning.utilities.exceptions.DeadlockDetectedException: DeadLock detected from rank
: 3                                                                                          
 Traceback (most recent call last):                                                          
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/trainer/trainer.py", line 685, in _call_and_handle_interrupt                         
    return trainer_fn(*args, **kwargs)                                                       
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/trainer/trainer.py", line 777, in _fit_impl                                          
    self._run(model, ckpt_path=ckpt_path)                                                    
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/trainer/trainer.py", line 1199, in _run                                              
    self._dispatch()                                                                         
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/trainer/trainer.py", line 1279, in _dispatch                                         
    self.training_type_plugin.start_training(self)                                           
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/plugins/training_type/training_type_plugin.py", line 202, in start_training          
    self._results = trainer.run_stage()                                                      
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/trainer/trainer.py", line 1289, in run_stage                                         
    return self._run_train()                                                                 
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/trainer/trainer.py", line 1319, in _run_train                                        
    self.fit_loop.run()                                                                      
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/base.py", line 145, in run                                                     
    self.advance(*args, **kwargs)                                                            
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/fit_loop.py", line 234, in advance                                             
    self.epoch_loop.run(data_fetcher)                                                        
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/base.py", line 145, in run                                                     
    self.advance(*args, **kwargs)                                                            
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/epoch/training_epoch_loop.py", line 193, in advance                            
    batch_output = self.batch_loop.run(batch, batch_idx)                                     
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/base.py", line 145, in run                                                     
    self.advance(*args, **kwargs)                                                            
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/batch/training_batch_loop.py", line 88, in advance                             
    outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)                    
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/base.py", line 145, in run                                                     
    self.advance(*args, **kwargs)                                                            
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/optimization/optimizer_loop.py", line 219, in advance                          
    self.optimizer_idx,                                                                      
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/optimization/optimizer_loop.py", line 266, in _run_optimization                
    self._optimizer_step(optimizer, opt_idx, batch_idx, closure)                             
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/optimization/optimizer_loop.py", line 386, in _optimizer_step                  
    using_lbfgs=is_lbfgs,                                                                    
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 1652, in optimizer_step
    optimizer.step(closure=optimizer_closure)                                                
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 164, in step
    trainer.accelerator.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 339, in optimizer_step
    self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 163, in optimizer_step
    optimizer.step(closure=closure, **kwargs)
File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/opti
m/lr_scheduler.py", line 65, in wrapper
    return wrapped(*args, **kwargs)                                                          
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/opti
m/optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)              
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/auto
grad/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)              
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/opti
m/adamw.py", line 65, in step
    loss = closure()                          
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/plugins/precision/precision_plugin.py", line 148, in _wrap_closure
    closure_result = closure()                
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/optimization/optimizer_loop.py", line 160, in __call__
    self._result = self.closure(*args, **kwargs)                                             
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/optimization/optimizer_loop.py", line 142, in closure
    step_output = self._step_fn()                                                            
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/optimization/optimizer_loop.py", line 435, in _training_step
    training_step_output = self.trainer.accelerator.training_step(step_kwargs)
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/accelerators/accelerator.py", line 219, in training_step
    return self.training_type_plugin.training_step(*step_kwargs.values())
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/plugins/training_type/ddp.py", line 439, in training_step
    return self.model(*args, **kwargs)                                                       
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/nn/m
odules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)                                                    
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/nn/p
arallel/distributed.py", line 799, in forward
    output = self.module(*inputs[0], **kwargs[0])                                            
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/nn/m
odules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)                                                    
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/overrides/base.py", line 81, in forward
    output = self.module.training_step(*inputs, **kwargs)                                    
  File "/home/jingyuso/kd_project/test/CVPR2023-UniDistill/unidistill/exps/multisensor_fusion
/nuscenes/BEVFusion/BEVFusion_nuscenes_base_exp.py", line 374, in training_step
    ret_dict, tf_dict, _, _, _, _ = self(points, imgs, metas, gt_boxes)
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/nn/m
odules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)                                                    
  File "/home/jingyuso/kd_project/test/CVPR2023-UniDistill/unidistill/exps/multisensor_fusion
/nuscenes/BEVFusion/BEVFusion_nuscenes_base_exp.py", line 358, in forward
    return self.model(points, imgs, metas, gt_boxes)                                         
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/nn/m
odules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)                                                    
  File "/home/jingyuso/kd_project/test/CVPR2023-UniDistill/unidistill/exps/multisensor_fusion
/nuscenes/BEVFusion/BEVFusion_nuscenes_centerhead_fusion_exp.py", line 144, in forward
    lidar_output = self.lidar_encoder(lidar_points)                                          
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/nn/m
odules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)                                                    
  File "/home/jingyuso/kd_project/test/CVPR2023-UniDistill/unidistill/exps/multisensor_fusion
/nuscenes/BEVFusion/BEVFusion_nuscenes_base_exp.py", line 76, in forward
    voxels, voxel_coords, voxel_num_points = self.voxelizer(lidar_points)
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/nn/m
odules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)                                                    
  File "/home/jingyuso/kd_project/test/CVPR2023-UniDistill/unidistill/data/det3d/preprocess/v
oxelization.py", line 54, in forward
    voxel_output = self.voxel_generator(p)                                                   
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/spconv/pyt
orch/utils.py", line 88, in __call__
    res = self.generate_voxel_with_id(pc, clear_voxels, empty_mean)
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/spconv/pyt
orch/utils.py", line 139, in generate_voxel_with_id
    empty_mean, clear_voxels, stream)                                                        
RuntimeError: zero_ /tmp/pip-build-env-__cfq4tn/overlay/lib/python3.6/site-packages/cumm/incl
ude/tensorview/tensor.h 221
cuda failed with error 1 invalid argument. use CUDA_LAUNCH_BLOCKING=1 to get correct tracebac
k.


Killed 

@LutaoChu
Copy link

I ran into the same problem. The little difference is that when in fusion modality and batch=1, multi-GPU training is normal. Do you know how to solve this problem? @Song-Jingyu

@Song-Jingyu
Copy link
Author

I ran into the same problem. The little difference is that when in fusion modality and batch=1, multi-GPU training is normal. Do you know how to solve this problem? @Song-Jingyu

I think it turns out to be my server has limited CPU/RAM. I only did a preliminary exploration of this repo :(

@LutaoChu
Copy link

LutaoChu commented Mar 1, 2024

Thanks for the response. From the error logs it looks like it should be a GPU memory related issue, why do you say it's because of the LIMITED CPU/RAM?

@SivenCapo
Copy link

I ran into the same problem. I tried every single way I know,always failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants