Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch problem #32

Open
tianqibucuo0 opened this issue May 18, 2023 · 13 comments
Open

torch problem #32

tianqibucuo0 opened this issue May 18, 2023 · 13 comments

Comments

@tianqibucuo0
Copy link

my cuda version is 11.7, but cuda version is 8.0 in DeepRule.txt, could i download 11.7?

@soap117
Copy link
Owner

soap117 commented May 18, 2023

I have add the new environment file see updates and is able to complie cpools layers

@tianqibucuo0
Copy link
Author

thank you very much!

@tianqibucuo0
Copy link
Author

hello, requirement-2023.txt have 33 packages, but DeepRule.txt have 96 packages, other packages not need download?

@soap117
Copy link
Owner

soap117 commented Jun 16, 2023

Generally not I have tested it, if found someone is missing, just install it.

@tianqibucuo0
Copy link
Author

Hello, I am training a model using "linedata(1028)" and encountered two errors. Could you please help me?
1、DeepRule-master/models/py_utils/kp_utils.py:592: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at ../aten/src/ATen/native/cuda/Indexing.cu:1239.) tag_full[1-mask_full] = 0
2、python3.9/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead. warnings.warn(warning.format(ret))
Segmentation fault (core dumped)

@soap117
Copy link
Owner

soap117 commented Jun 21, 2023

For first I think you can use type_as to torch.float32 before the masked_fill_ command

@tianqibucuo0
Copy link
Author

Thank you, after fixing all the UserWarning errors, I encountered the error "Segmentation fault (core dumped)" during the execution. Here is my execution process. Can you please explain why this is happening?

(DeepRule) sun@sun:~/DeepRule-master$ python train_chart.py --cfg_file CornerNetLine --data_dir "/home/sun/data/linedata(1028)" --cache_path "/home/sun/data/linedata(1028)/cache/"
:228: RuntimeWarning: compiletime version 3.6 of module 'pycocotools._mask' does not match runtime version 3.9
:228: RuntimeWarning: builtins.type size changed, may indicate binary incompatibility. Expected 864 from C header, got 880 from PyObject
./config/CornerNetLine.json
['cache', 'line']
loading all datasets...
using 1 threads
loading from cache file: /home/sun/data/linedata(1028)/cache/line_train2019.pkl
loading annotations into memory...
/home/sun/data/linedata(1028)/line/annotations/instancesLine(1023)_train2019.json
Done (t=2.72s)
creating index...
index created!
loading from cache file: /home/sun/data/linedata(1028)/cache/line_val2019.pkl
loading annotations into memory...
/home/sun/data/linedata(1028)/line/annotations/instancesLine(1023)_val2019.json
Done (t=0.05s)
creating index...
index created!
system config...
{'batch_size': 5,
'cache_dir': '/home/sun/yangshaohan/618/data/linedata(1028)/cache/',
'chunk_sizes': [5, 7, 7, 7],
'config_dir': './config',
'data_dir': '/home/sun/yangshaohan/618/data/linedata(1028)',
'data_rng': RandomState(MT19937) at 0x7FE69C7CB340,
'dataset': 'Line',
'decay_rate': 10,
'display': 5,
'learning_rate': 0.00025,
'max_iter': 50000,
'nnet_rng': RandomState(MT19937) at 0x7FE69C7CB440,
'opt_algo': 'adam',
'prefetch_size': 5,
'pretrain': None,
'result_dir': './results',
'sampling_function': 'kp_detection',
'snapshot': 5000,
'snapshot_name': 'CornerNetLine',
'stepsize': 45000,
'tar_data_dir': 'cls',
'test_split': 'testchart',
'train_split': 'trainchart',
'val_iter': 100,
'val_split': 'valchart',
'weight_decay': False,
'weight_decay_rate': 1e-05,
'weight_decay_type': 'l2'}
db config...
{'ae_threshold': 0.5,
'border': 128,
'categories': 1,
'data_aug': True,
'gaussian_bump': True,
'gaussian_iou': 0.3,
'gaussian_radius': -1,
'input_size': [511, 511],
'lighting': True,
'max_per_image': 100,
'merge_bbox': False,
'nms_algorithm': 'exp_soft_nms',
'nms_kernel': 3,
'nms_threshold': 0.5,
'output_sizes': [[128, 128]],
'rand_color': True,
'rand_crop': True,
'rand_pushes': False,
'rand_samples': False,
'rand_scale_max': 1.4,
'rand_scale_min': 0.6,
'rand_scale_step': 0.1,
'rand_scales': array([0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3]),
'special_crop': False,
'test_scales': [1],
'top_k': 200,
'weight_exp': 8}
len of db: 116745
building model...
module_file: models.CornerNetLine
use kp
total parameters: 198592138
setting learning rate to: 0.00025
training start...
start prefetching data...
shuffling indices...
['read.txt']
0%| | 0/50000 [00:00<?, ?it/s]
Segmentation fault (core dumped)

@soap117
Copy link
Owner

soap117 commented Jun 27, 2023

Sounds like the Cornernet package problem. Follow the instructions to compile it.

@tianqibucuo0
Copy link
Author

Hello, after recompiling, the same problem still persists. Could you please provide the versions of Python, CUDA, and GCC specified in the requirements-2023.txt file? Additionally, I would like to know the amount of GPU memory required for training "line" model.

@soap117
Copy link
Owner

soap117 commented Jul 1, 2023

Package Version


adal 1.2.7
argcomplete 2.1.2
azure-common 1.1.28
azure-core 1.27.1
azure-graphrbac 0.61.1
azure-mgmt-authorization 3.0.0
azure-mgmt-containerregistry 10.1.0
azure-mgmt-core 1.4.0
azure-mgmt-keyvault 10.2.2
azure-mgmt-resource 22.0.0
azure-mgmt-storage 21.0.0
azureml 0.2.7
azureml-core 1.52.0
backports.tempfile 1.0
backports.weakref 1.0.post1
bcrypt 4.0.1
certifi 2023.5.7
cffi 1.15.1
charset-normalizer 3.1.0
contextlib2 21.6.0
contourpy 1.0.5
cryptography 41.0.1
cycler 0.11.0
docker 6.1.3
fonttools 4.25.0
h5py 3.8.0
humanfriendly 10.0
idna 3.4
importlib-resources 5.2.0
isodate 0.6.1
jeepney 0.8.0
jmespath 1.0.1
jsonpickle 3.0.1
kiwisolver 1.4.4
knack 0.10.1
matplotlib 3.7.1
mkl-fft 1.3.6
mkl-random 1.2.2
mkl-service 2.4.0
msal 1.22.0
msal-extensions 1.0.0
msrest 0.7.1
msrestazure 0.6.4
munkres 1.1.4
ndg-httpsclient 0.5.1
numpy 1.24.3
oauthlib 3.2.2
opencv-python 4.7.0.72
packaging 23.0
pandas 2.0.3
paramiko 3.2.0
pathspec 0.11.1
Pillow 9.4.0
pip 23.0.1
pkginfo 1.9.6
ply 3.11
portalocker 2.7.0
pyasn1 0.5.0
pycparser 2.21
Pygments 2.15.1
PyJWT 2.7.0
PyNaCl 1.5.0
pyOpenSSL 23.2.0
pyparsing 3.0.9
PyQt5-sip 12.11.0
PySocks 1.7.1
python-dateutil 2.8.2
pytz 2023.3
PyYAML 6.0
requests 2.30.0
requests-oauthlib 1.3.1
SecretStorage 3.3.3
setuptools 66.0.0
sip 6.6.2
six 1.16.0
tabulate 0.9.0
toml 0.10.2
torch 1.7.1+cu110
torchaudio 0.7.2
torchvision 0.8.2+cu110
tornado 6.2
typing_extensions 4.5.0
tzdata 2023.3
urllib3 1.26.16
websocket-client 1.6.1
wheel 0.38.4
I am able to run the train code

@tianqibucuo0
Copy link
Author

tianqibucuo0 commented Jul 1, 2023 via email

@tianqibucuo0
Copy link
Author

Hello, my GPU is relatively small, so I modified train.json and val.json files to keep only 10 data entries for testing purposes. However, when it reaches the line "training = pinned_training_queue.get(block=True)", the execution gets stuck and does not proceed. Below is my execution process. Can you please tell me the reason for this?

/home/ubuntu/anaconda3/envs/myenv/bin/python /home/ubuntu/download/pycharm-community-2023.1.4/plugins/python-ce/helpers/pydev/pydevd.py --multiprocess --qt-support=auto --client 127.0.0.1 --port 44227 --file /media/ubuntu/A4823F1E823EF480/2023/env/python/DeepRule-master-weixiugai/DeepRule-master/train_chart.py
Connected to pydev debugger (build 231.9225.15)
/home/ubuntu/anaconda3/envs/myenv/lib/python3.6/site-packages/OpenSSL/_util.py:6: CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team. Therefore, support for it is deprecated in cryptography. The next release of cryptography will remove support for Python 3.6.
from cryptography.hazmat.bindings.openssl.binding import Binding
Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.scriptrun = azureml.core.script_run:ScriptRun._from_run_dto with exception (pyOpenSSL 23.2.0 (/home/ubuntu/anaconda3/envs/myenv/lib/python3.6/site-packages), Requirement.parse('pyopenssl<23.0.0')).
/media/ubuntu/A4823F1E823EF480/2023/env/python/DeepRule-master-weixiugai/DeepRule-master/train_chart.py:22: FutureWarning: azureml.core: AzureML support for Python 3.6 is deprecated and will be dropped in an upcoming release. At that point, existing Python 3.6 workflows that use AzureML will continue to work without modification, but Python 3.6 users will no longer get access to the latest AzureML features and bugfixes. We recommend that you upgrade to Python 3.7 or newer. To disable SDK V1 deprecation warning set the environment variable AZUREML_DEPRECATE_WARNING to 'False'
from azureml.core.run import Run
['line']
loading all datasets...
using 1 threads
loading from cache file: /media/ubuntu/A4823F1E823EF480/2023/env/python/linedata(1028)/line/line_train2019.pkl
loading annotations into memory...
/media/ubuntu/A4823F1E823EF480/2023/env/python/linedata(1028)/line/annotations/instancesLine(1023)_train2019.json
Done (t=0.00s)
creating index...
index created!
loading from cache file: /media/ubuntu/A4823F1E823EF480/2023/env/python/linedata(1028)/line/line_val2019.pkl
loading annotations into memory...
/media/ubuntu/A4823F1E823EF480/2023/env/python/linedata(1028)/line/annotations/instancesLine(1023)_val2019.json
Done (t=0.00s)
creating index...
index created!
system config...
{'batch_size': 5,
'cache_dir': '/media/ubuntu/A4823F1E823EF480/2023/env/python/linedata(1028)/line',
'chunk_sizes': [5, 7, 7, 7],
'config_dir': './config',
'data_dir': '/media/ubuntu/A4823F1E823EF480/2023/env/python/linedata(1028)',
'data_rng': RandomState(MT19937) at 0x7FCC248FF258,
'dataset': 'Line',
'decay_rate': 10,
'display': 5,
'learning_rate': 0.01,
'max_iter': 50000,
'nnet_rng': RandomState(MT19937) at 0x7FCC248FF570,
'opt_algo': 'adam',
'prefetch_size': 5,
'pretrain': None,
'result_dir': './results',
'sampling_function': 'kp_detection',
'snapshot': 5000,
'snapshot_name': 'CornerNetLine',
'stepsize': 45000,
'tar_data_dir': 'cls',
'test_split': 'testchart',
'train_split': 'trainchart',
'val_iter': 100,
'val_split': 'valchart',
'weight_decay': False,
'weight_decay_rate': 1e-05,
'weight_decay_type': 'l2'}
db config...
{'ae_threshold': 0.5,
'border': 128,
'categories': 1,
'data_aug': True,
'gaussian_bump': True,
'gaussian_iou': 0.3,
'gaussian_radius': -1,
'input_size': [511, 511],
'lighting': True,
'max_per_image': 100,
'merge_bbox': False,
'nms_algorithm': 'exp_soft_nms',
'nms_kernel': 3,
'nms_threshold': 0.5,
'output_sizes': [[128, 128]],
'rand_color': True,
'rand_crop': True,
'rand_pushes': False,
'rand_samples': False,
'rand_scale_max': 1.4,
'rand_scale_min': 0.6,
'rand_scale_step': 0.1,
'rand_scales': array([0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3]),
'special_crop': False,
'test_scales': [1],
'top_k': 200,
'weight_exp': 8}
len of db: 11
building model...
module_file: models.CornerNetLine
use kp
total parameters: 198592138
setting learning rate to: 0.01
training start...
start prefetching data...
['read.txt']
0%| | 0/50000 [00:00<?, ?it/s]

@LouisPouliot
Copy link

I am currently facing a simmilar issue.
Did you manage to find a soultion to this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants