Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training error in dpt_vit-b16 #77

Open
a227799770055 opened this issue Feb 24, 2023 · 3 comments
Open

Training error in dpt_vit-b16 #77

a227799770055 opened this issue Feb 24, 2023 · 3 comments

Comments

@a227799770055
Copy link

a227799770055 commented Feb 24, 2023

Hi @zhyever
I want to train the model with my custom dataset in dpt_vit-b16_kitti, and I encounter the error as below.
It's seem that can not find the pretrain file nfs/checkpoints/jx_vit_base_p16_224-80ecf9dd.pth. Where can I download the file and which path should I put the file?

Traceback (most recent call last): File "./tools/train.py", line 168, in <module> main() File "./tools/train.py", line 135, in main model.init_weights() File "/home/insign2/.local/lib/python3.8/site-packages/mmcv/runner/base_module.py", line 117, in init_weights m.init_weights() File "/home/insign2/work/Monocular-Depth-Estimation-Toolbox/depth/models/backbones/vit.py", line 282, in init_weights checkpoint = CheckpointLoader.load_checkpoint( File "/home/insign2/.local/lib/python3.8/site-packages/mmcv/runner/checkpoint.py", line 314, in load_checkpoint return checkpoint_loader(filename, map_location) # type: ignore File "/home/insign2/.local/lib/python3.8/site-packages/mmcv/runner/checkpoint.py", line 333, in load_from_local raise FileNotFoundError(f'{filename} can not be found.') FileNotFoundError: nfs/checkpoints/jx_vit_base_p16_224-80ecf9dd.pth can not be found. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 151529) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/insign2/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module> main() File "/home/insign2/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main launch(args) File "/home/insign2/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch run(args) File "/home/insign2/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/insign2/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/insign2/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

@a227799770055 a227799770055 changed the title dptvit dpt_vit-b16 training error Feb 24, 2023
@a227799770055 a227799770055 changed the title dpt_vit-b16 training error Training error in dpt_vit-b16 Feb 24, 2023
@Z-chocking
Copy link

You should download the pre-trained models.Please refer the DPT's markdown file.

@a227799770055
Copy link
Author

@Z-chocking thank you!

@a227799770055
Copy link
Author

a227799770055 commented Mar 15, 2023

@zhyever
Hi I got an other error when training.
How should I solve the problem?

 Traceback (most recent call last):
  File "./tools/train.py", line 168, in <module>
    main()
  File "./tools/train.py", line 157, in main
    train_depther(
  File "/home/insign/work/Monocular-Depth-Estimation-Toolbox/depth/apis/train.py", line 121, in train_depther
    runner.run(data_loaders, cfg.workflow)
  File "/home/insign/.local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/insign/.local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 49, in train
    for i, data_batch in enumerate(self.data_loader):
  File "/home/insign/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/insign/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
    return self._process_data(data)
  File "/home/insign/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
    data.reraise()
  File "/home/insign/.local/lib/python3.8/site-packages/torch/_utils.py", line 542, in reraise
    raise RuntimeError(msg) from None
RuntimeError: Caught UnicodeDecodeError in DataLoader worker process 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants