Gluon SoftmaxCELoss does not converge, for large number of classes. #11019

nttstar · 2018-05-22T05:30:06Z

nttstar
May 22, 2018

Description

Training with 85K classes failed while I use gluon trainer with SoftmaxCELoss. But it is ok if I defined the same network by gluon but training with symbolic module interface(sym.SoftmaxOutput).

Error Message:

training acc starts from 0.0 to 0.001, but then drop to 0.0 again after about 1K iterations.

Steps to reproduce

checkout latest insightface repo(https://github.com/deepinsight/insightface)
Download ms1m dataset from the repo and unzip to ./faces_ms1m
Run training script insightface/gluon/train.py and you can see the training acc changing at about 1.5K iterations. Validation process will start every 2K iterations, depends on --verbose param.

The below command works fine:

CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train.py --data-dir ./faces_ms1m --network r18 --prefix ./model-r18-test --per-batch-size 128 --lr-steps '10000,20000,3000' --lr 0.1 --ckpt 0 --verbose 2000 --wd 0.0005 --margin-a 0.0 --eval lfw --mode symbol

The below command does not converge:

CUDA_VISIBLE_DEVICES='0,1,2,3' python -u train.py --data-dir ./faces_ms1m --network r18 --prefix ./model-r18-test --per-batch-size 128 --lr-steps '10000,20000,3000' --lr 0.1 --ckpt 0 --verbose 2000 --wd 0.0005 --margin-a 0.0 --eval lfw --mode gluon

Environment info (Required)

----------Python Info----------
('Version :', '2.7.5')
('Compiler :', 'GCC 4.8.5 20150623 (Red Hat 4.8.5-16)')
('Build :', ('default', 'Aug 4 2017 00:39:18'))
('Arch :', ('64bit', 'ELF'))
------------Pip Info-----------
('Version :', '9.0.2')
('Directory :', '/usr/lib/python2.7/site-packages/pip')
----------MXNet Info-----------
('Version :', '1.2.0')
('Directory :', '/usr/lib/python2.7/site-packages/mxnet')
('Commit Hash :', 'f0be910ae5e3fa01e0a9aaf98dbd4616c35be76b')
----------System Info----------
('Platform :', 'Linux-3.10.0-327.el7.x86_64-x86_64-with-centos-7.4.1708-Core')
('system :', 'Linux')
('node :', 'cdsl-gpu-a04')
('release :', '3.10.0-327.el7.x86_64')
('version :', '#1 SMP Thu Nov 19 22:10:57 UTC 2015')
----------Hardware Info----------
('machine :', 'x86_64')
('processor :', 'x86_64')
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Stepping: 1
CPU MHz: 2199.914
CPU max MHz: 2900.0000
CPU min MHz: 1200.0000
BogoMIPS: 4400.12
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0067 sec, LOAD: 2.4557 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0062 sec, LOAD: 1.8693 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.2143 sec, LOAD: 2.3253 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0067 sec, LOAD: 1.1611 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.5429 sec, LOAD: 2.8609 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0070 sec, LOAD: 1.6273 sec.

nttstar · 2018-05-22T06:42:27Z

nttstar
May 22, 2018
Author

I think sym.SoftmaxOutput considered the numerical stability problem while gluon SoftmaxCELoss not. reference in Chinese.

EDIT:
It is strange that it also works if I apply gluon.SoftmaxCELoss on symbolic module:

change

sym = mx.symbol.SoftmaxOutput(data=fc7, label = label, name='softmax', normalization='valid')

to

ceop = gluon.loss.SoftmaxCrossEntropyLoss()
loss = ceop(fc7, label)
loss = loss/args.per_batch_size
sym = mx.sym.Group( [mx.symbol.BlockGrad(fc7), mx.symbol.MakeLoss(loss, name='softmax')] )

0 replies

nttstar · 2018-05-22T07:33:42Z

nttstar
May 22, 2018
Author

I want to note that in symbolic interface I normalize the loss(gradient) in each softmax container and rescale the total gradients in optimizer by 1.0/gpu_num.

In gluon interface I do not apply any gradient normalization but calling trainer.step(batch_size).
I think they're the same.

0 replies

hetong007 · 2018-05-31T19:03:53Z

hetong007
May 31, 2018
Collaborator

When trying to reproduce, I got the following error message:

INFO:root:Epoch[0] Batch [1940]	Speed: 1124.429415 samples/sec	acc=0.000416
INFO:root:Epoch[0] Batch [1960]	Speed: 1130.981356 samples/sec	acc=0.000412
INFO:root:Epoch[0] Batch [1980]	Speed: 1127.127662 samples/sec	acc=0.000408
lr-batch-epoch: 0.1 2000
testing verification..
Traceback (most recent call last):
  File "train.py", line 746, in <module>
    main()
  File "train.py", line 743, in main
    train_net(args)
  File "train.py", line 660, in train_net
    _batch_callback()
  File "train.py", line 594, in _batch_callback
    acc_list = ver_test(mbatch)
  File "train.py", line 453, in ver_test
    acc1, std1, acc2, std2, xnorm, embeddings_list = verification.test(ver_list[i], net, ctx, batch_size = args.batch_size)
  File "/home/ubuntu/insightface/gluon/verification.py", line 269, in test
    embeddings = sklearn.preprocessing.normalize(embeddings)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/data.py", line 1412, in normalize
    estimator='the normalize function', dtype=FLOAT_DTYPES)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 453, in check_array
    _assert_all_finite(array)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 44, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Is it something to be expected?

0 replies

zhanghang1989 · 2018-05-31T19:43:51Z

zhanghang1989
May 31, 2018

@nttstar mentioned using the option --version--output A can converge. The default network option is E, which fails to converge using Gluon.

0 replies

hetong007 · 2018-05-31T20:19:47Z

hetong007
May 31, 2018
Collaborator

I'm trying to observe the non-convergence, but the error message is not mentioned. I'd like to know if they are two separate issues, or directly related.

0 replies

nttstar · 2018-06-01T02:17:17Z

nttstar
Jun 1, 2018
Author

@hetong007 They're the same issue. This error msg means the network fails to converge at that time so the output embedding layer has illegal values. Training acc is also decreasing.

0 replies

hetong007 · 2018-06-09T02:02:42Z

hetong007
Jun 9, 2018
Collaborator

I have spent sometime in this issue, it is not related to the number of classes. If I fake the labels into 100 classes, it still doesn't converge.

The model's defined at https://github.com/deepinsight/insightface/blob/master/gluon/blocks/UDD.py#L27.

From the definition, seems it is self.body.add(nn.BatchNorm(scale=False, epsilon=2e-5, prefix='fc1')) causing the problem. If I set scale=True and train, the model converges.

Also, it is reported by @nttstar that if attaching this layer to the end of --version-output A model (after https://github.com/deepinsight/insightface/blob/master/gluon/blocks/UDD.py#L42) and train , the model also doesn't converge.

@piiswrong @eric-haibin-lin Do you recall any other issues related to this parameter?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gluon SoftmaxCELoss does not converge, for large number of classes. #11019

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Gluon SoftmaxCELoss does not converge, for large number of classes. #11019

nttstar May 22, 2018

Description

Error Message:

Steps to reproduce

Environment info (Required)

Replies: 7 comments

nttstar May 22, 2018 Author

nttstar May 22, 2018 Author

hetong007 May 31, 2018 Collaborator

zhanghang1989 May 31, 2018

hetong007 May 31, 2018 Collaborator

nttstar Jun 1, 2018 Author

hetong007 Jun 9, 2018 Collaborator

nttstar
May 22, 2018

nttstar
May 22, 2018
Author

nttstar
May 22, 2018
Author

hetong007
May 31, 2018
Collaborator

zhanghang1989
May 31, 2018

hetong007
May 31, 2018
Collaborator

nttstar
Jun 1, 2018
Author

hetong007
Jun 9, 2018
Collaborator