Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about the hanging training process #20

Open
tairen99 opened this issue Jun 28, 2022 · 3 comments
Open

about the hanging training process #20

tairen99 opened this issue Jun 28, 2022 · 3 comments

Comments

@tairen99
Copy link

tairen99 commented Jun 28, 2022

Hi Junyu,

Thank you for your good work on the chart extraction.

I follow the Readme and properly install the DeepRule, and I want to test the training code. Everything looks good at the beginning:

['cache', 'pie']
loading all datasets... 
using 1 threads
loading from cache file: /home/DeepRule/data/piedata_1008/cache/pie_train2019.pkl
loading annotations into memory...
/home/DeepRule/data/piedata_1008/pie/annotations/instancesPie(1008)_train2019.json
Done (t=1.61s)
creating index...
index created!
loading from cache file: /home/DeepRule/data/piedata_1008/cache/pie_val2019.pkl
loading annotations into memory...
/home/DeepRule/data/piedata_1008/pie/annotations/instancesPie(1008)_val2019.json
Done (t=0.03s)
creating index...
index created!
system config...
{'batch_size': 26,
 'cache_dir': '/home/DeepRule/data/piedata_1008/cache',
 'chunk_sizes': [5, 7, 7, 7],
 'config_dir': './config',
 'data_dir': '/home/DeepRule/data/piedata_1008/',
 'data_rng': <mtrand.RandomState object at 0x7f1b20d20d38>,
 'dataset': 'Pie',
 'decay_rate': 10,
 'display': 5,
 'learning_rate': 0.00025,
 'max_iter': 50000,
 'nnet_rng': <mtrand.RandomState object at 0x7f1b20d20d80>,
 'opt_algo': 'adam',
 'prefetch_size': 5,
 'pretrain': None,
 'result_dir': './results',
 'sampling_function': 'kp_detection',
 'snapshot': 5000,
 'snapshot_name': 'CornerNetPurePie',
 'stepsize': 45000,
 'tar_data_dir': 'cls',
 'test_split': 'testchart',
 'train_split': 'trainchart',
 'val_iter': 100,
 'val_split': 'valchart',
 'weight_decay': False,
 'weight_decay_rate': 1e-05,
 'weight_decay_type': 'l2'}
db config...
{'ae_threshold': 0.5,
 'border': 128,
 'categories': 1,
 'data_aug': True,
 'gaussian_bump': True,
 'gaussian_iou': 0.3,
 'gaussian_radius': -1,
 'input_size': [511, 511],
 'lighting': True,
 'max_per_image': 100,
 'merge_bbox': False,
 'nms_algorithm': 'exp_soft_nms',
 'nms_kernel': 3,
 'nms_threshold': 0.5,
 'output_sizes': [[128, 128]],
 'rand_color': True,
 'rand_crop': True,
 'rand_pushes': False,
 'rand_samples': False,
 'rand_scale_max': 1.4,
 'rand_scale_min': 0.6,
 'rand_scale_step': 0.1,
 'rand_scales': array([0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3]),
 'special_crop': False,
 'test_scales': [1],
 'top_k': 100,
 'weight_exp': 8}
len of db: 73075
building model...
module_file: models.CornerNetPurePie
use kp pure pie
total parameters: 198592652
setting learning rate to: 0.00025
training start...
start prefetching data...
shuffling indices...
['read.txt']
0%|                                                                             | 0/50000 [00:00<?, ?it/s]

But for some reason, the training is hanging without any progress, I check the CPU usage and it has a zombie process as below:
image

The GPU usage is below:
image

Since I do not have an Azure account, I commented on code on file: "/DeepRule/models/CornerNetPurePie.py”
at line 32: # from azureml.core.compute import ComputeTarget

I do not know what is the main reason for this. Please help us.
Thank you in advance!

@soap117
Copy link
Owner

soap117 commented Aug 17, 2022

The OCR is replacable. You can replce it with some local OCR package https://pypi.org/project/pytesseract/
However you need to rewrite the ocr_result function

@sdh5349
Copy link

sdh5349 commented Sep 29, 2022

Can you tell me in more detail?

@kpostnov
Copy link

kpostnov commented Dec 5, 2023

Hey @tairen99, have you found the reason for this issue? We encountered the same problems.

Edit: By stepping through the execution we could pinpoint the code responsible for the process being stuck.
The issue appears to be image = cv2.resize(image, (new_width, new_height)) on line 30 in sample/bar.py (and in other files in the same directory accordingly). We ended up using the suggestions from this thread and inserted multiprocessing.set_start_method('spawn', force=True) at the beginning of train_chart.py. Afterwards, everything worked as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants