This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
Replies: 3 comments
-
@FCInter Could you try with the latest version of MXNet? |
Beta Was this translation helpful? Give feedback.
0 replies
-
Any progress? I have the same problems on the master branch. |
Beta Was this translation helpful? Give feedback.
0 replies
-
It's the problem of PinnedMemoryStorage: https://github.com/apache/incubator-mxnet/blob/master/src/storage/pinned_memory_storage.h#L59 I use the CUDA_VISIBLE_DEVICES environment variable to bypass this bug. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Description
(Brief description of the problem in no more than 2 sentences.)
mx.io.ImageRecordIter reports CUDA out of memory when loading a 50MB .rec file.
Environment info (Required)
----------Python Info----------
('Version :', '2.7.14')
('Compiler :', 'GCC 7.2.0')
('Build :', ('default', 'Oct 16 2017 17:29:19'))
('Arch :', ('64bit', ''))
------------Pip Info-----------
('Version :', '9.0.1')
('Directory :', '/home/users/mypath/anaconda2/lib/python2.7/site-packages/pip')
----------MXNet Info-----------
/home/users/mypath/anaconda2/lib/python2.7/site-packages/urllib3/contrib/pyopenssl.py:46: DeprecationWarning: OpenSSL.rand is deprecated - you should use os.urandom instead
import OpenSSL.SSL
('Version :', '1.0.0')
('Directory :', '/home/users/mypath/anaconda2/lib/python2.7/site-packages/mxnet')
('Commit Hash :', '25720d0e3c29232a37e2650f3ba3a2454f9367bb')
----------System Info----------
('Platform :', 'Linux-3.10.0-693.2.2.el7.x86_64-x86_64-with-centos-7.4.1708-Core')
('system :', 'Linux')
('node :', 'somename')
('release :', '3.10.0-693.2.2.el7.x86_64')
('version :', '#1 SMP Tue Sep 12 22:26:13 UTC 2017')
----------Hardware Info----------
('machine :', 'x86_64')
('processor :', 'x86_64')
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Stepping: 1
CPU MHz: 1499.121
CPU max MHz: 3000.0000
CPU min MHz: 1200.0000
BogoMIPS: 4199.54
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0014 sec, LOAD: 1.4199 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0005 sec, LOAD: 2.7118 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0005 sec, LOAD: 1.3227 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0005 sec, LOAD: 0.7410 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0007 sec, LOAD: 0.5419 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0926 sec, LOAD: 0.8068 sec.
Package used (Python/R/Scala/Julia):
Python 2.7.14 |Anaconda, Inc.| (default, Oct 16 2017, 17:29:19)
Compiler (gcc/clang/mingw/visual studio):
gcc 7.2.0
Build config:
(Paste the content of config.mk, or the build command.)
Error Message:
The error message:
[16:00:08] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: /home/users/mypath/work/image-classification/data/test_raw_3c.rec, use 4 threads for decoding..
[16:00:08] /home/travis/build/dmlc/mxnet-distro/mxnet-build/dmlc-core/include/dmlc/logging.h:308: [16:00:08] src/storage/./pinned_memory_storage.h:59: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: out of memory
Stack trace returned 10 entries:
[bt] (0) /home/users/mypath/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x28965c) [0x7f0cc9dab65c]
[bt] (1) /home/users/mypath/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x29cc4e9) [0x7f0ccc4ee4e9]
[bt] (2) /home/users/mypath/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x29cc52d) [0x7f0ccc4ee52d]
[bt] (3) /home/users/mypath/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x29ca628) [0x7f0ccc4ec628]
[bt] (4) /home/users/mypath/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x23ea7a1) [0x7f0ccbf0c7a1]
[bt] (5) /home/users/mypath/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x253ee36) [0x7f0ccc060e36]
[bt] (6) /home/users/mypath/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x250ae44) [0x7f0ccc02ce44]
[bt] (7) /home/users/mypath/anaconda2/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f0cb4196c5c]
[bt] (8) /lib64/libpthread.so.0(+0x7e25) [0x7f0cec691e25]
[bt] (9) /lib64/libc.so.6(clone+0x6d) [0x7f0cebcb634d]
terminate called after throwing an instance of 'dmlc::Error'
what(): [16:00:08] src/storage/./pinned_memory_storage.h:59: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: out of memory
Stack trace returned 10 entries:
[bt] (0) /home/users/mypath/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x28965c) [0x7f0cc9dab65c]
[bt] (1) /home/users/mypath/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x29cc4e9) [0x7f0ccc4ee4e9]
[bt] (2) /home/users/mypath/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x29cc52d) [0x7f0ccc4ee52d]
[bt] (3) /home/users/mypath/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x29ca628) [0x7f0ccc4ec628]
[bt] (4) /home/users/mypath/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x23ea7a1) [0x7f0ccbf0c7a1]
[bt] (5) /home/users/mypath/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x253ee36) [0x7f0ccc060e36]
[bt] (6) /home/users/mypath/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x250ae44) [0x7f0ccc02ce44]
[bt] (7) /home/users/mypath/anaconda2/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f0cb4196c5c]
[bt] (8) /lib64/libpthread.so.0(+0x7e25) [0x7f0cec691e25]
[bt] (9) /lib64/libc.so.6(clone+0x6d) [0x7f0cebcb634d]
Aborted (core dumped)
The GPU information:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 00000000:02:00.0 Off | N/A |
| 58% 85C P2 238W / 250W | 12097MiB / 12189MiB | 76% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN X (Pascal) Off | 00000000:03:00.0 Off | N/A |
| 37% 62C P2 59W / 250W | 4220MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN X (Pascal) Off | 00000000:82:00.0 Off | N/A |
| 35% 59C P2 59W / 250W | 4220MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN X (Pascal) Off | 00000000:83:00.0 Off | N/A |
| 33% 57C P2 57W / 250W | 4220MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Minimum reproducible example
(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)
import mxnet as mx
data_train = 'some_path/somefile.rec'
train_iter = mx.io.ImageRecordIter(
path_imgrec=data_train,
data_shape=(3,128,128),
batch_size=1)
Steps to reproduce
Just execute the above-mentioned code.
What have you tried to solve it?
Beta Was this translation helpful? Give feedback.
All reactions