A problem in running run.sh #29

zkailong · 2018-03-06T09:54:54Z

Environment：ubuntu 16.04;cuda 9.0.176;cuDNN 7.0.5;TensorFlow 1.6.0(gpu).
Reference to #10 #3,
I've been installed torch,lucrocks,hdf5,etc...But there are still problems running...

name@name-All-Series:~/AlphaPose$ ./run.sh --indir examples/demo/ --outdir examples/results/ --vis
0
generating bbox from Faster RCNN...
/usr/local/lib/python2.7/dist-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
2018-03-06 17:32:48.947257: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-03-06 17:32:49.022229: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-06 17:32:49.022485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties:
name: GeForce GTX 750 major: 5 minor: 2 memoryClockRate(GHz): 1.188
pciBusID: 0000:01:00.0
totalMemory: 1.95GiB freeMemory: 1.64GiB
2018-03-06 17:32:49.022502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-03-06 17:32:49.244052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1403 MB memory) -> physical GPU (device: 0, name: GeForce GTX 750, pci bus id: 0000:01:00.0, compute capability: 5.2)
Loaded network ../output/res152/coco_2014_train+coco_2014_valminusminival/default/res152.ckpt
/home/name/AlphaPose/examples/demo/
0%| | 0/3 [00:00<?, ?it/s]2018-03-06 17:32:55.388581: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-03-06 17:32:55.500633: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-03-06 17:32:56.432382: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 922.50MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
67%|██████████████████████████████ | 2/3 [00:07<00:03, 3.52s/it]2018-03-06 17:33:01.331154: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-03-06 17:33:01.411505: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-03-06 17:33:01.501123: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-03-06 17:33:02.435389: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-03-06 17:33:02.543607: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
100%|█████████████████████████████████████████████| 3/3 [00:10<00:00, 3.45s/it]
pose estimation with RMPE...
/home/name/torch/install/bin/lua: /home/name/torch/install/share/lua/5.2/trepl/init.lua:389: /home/name/torch/install/share/lua/5.2/hdf5/ffi.lua:56: expected align(#) on line 579
stack traceback:
[C]: in function 'error'
/home/name/torch/install/share/lua/5.2/trepl/init.lua:389: in function 'require'
/home/name/AlphaPose/predict/util.lua:7: in main chunk
[C]: in function 'dofile'
/home/name/torch/install/share/lua/5.2/paths/init.lua:84: in function 'dofile'
main-alpha-pose.lua:7: in main chunk
[C]: in function 'dofile'
...oyer/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: in ?
/usr/local/lib/python2.7/dist-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Traceback (most recent call last):
File "parametric-pose-nms-MPII.py", line 256, in
get_result_json(args)
File "parametric-pose-nms-MPII.py", line 243, in get_result_json
test_parametric_pose_NMS_json(delta1, delta2, mu, gamma,args.outputpath)
File "parametric-pose-nms-MPII.py", line 99, in test_parametric_pose_NMS_json
h5file = h5py.File(os.path.join(outputpath,"POSE/test-pose.h5"), 'r')
File "/usr/local/lib/python2.7/dist-packages/h5py/_hl/files.py", line 269, in init
fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
File "/usr/local/lib/python2.7/dist-packages/h5py/_hl/files.py", line 99, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 78, in h5py.h5f.open
IOError: Unable to open file (unable to open file: name = '/home/name/AlphaPose/examples/results/POSE/test-pose.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
visualization...
Traceback (most recent call last):
File "json-video.py", line 63, in
with open(jsonpath) as f:
IOError: [Errno 2] No such file or directory: '/home/name/AlphaPose/examples/results/POSE/alpha-pose-results-forvis.json'

So,how can I solve it?

The text was updated successfully, but these errors were encountered:

sberryman · 2018-03-06T15:42:24Z

So I spent the better part of the day yesterday trying to get AlphaPose to compile and run inference. I finally figured out a combination that works.

Dockerfile

https://gist.github.com/sberryman/82a6d13a44f9c4a3bfaf9263b36c92ed

Important versions:

cudnn version 5
Tensorflow >= 1.2 AND < 1.3 (if you build tensorflow from source the cudnn version isn't as important. installing from pip it becomes VERY important)
Input and output directories for ./run.sh must be relative to the CWD. Absolute paths do not work!

Even if you don't use Docker you can get a very good idea of the steps I had to take to get AlphaPose running. Also, a lot of those ubuntu dependencies that are installed on line 8 can be removed. Those are left over from another project and I haven't had time to clean them up.

sberryman · 2018-03-06T15:45:36Z

Your error looks more like it has to do with running out of GPU memory though. Your card (CPU) only has totalMemory: 1.95GiB freeMemory: 1.64GiB

I see RCNN using ~ 4.8GB of memory and Torch was using about 1.8GB with a batch size of 1. That is my experience running on a GTX 1080. I haven't tried my 1080 TI's yet.

Update: human-detection (tensorflow) is set to gpu_options.allow_growth=True so I'm not sure the actual minimum memory requirements.

zkailong · 2018-03-07T01:54:44Z

@sberryman Thanks for your reply. But it said

The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

So I don't think that the GPU memory of my computer is too less to run AlphaPose.
And thanks for your Dockerfile. Maybe I should rebuild it.

sberryman · 2018-03-07T02:25:27Z

Good luck, I know it took me a LONG time to figure out the right combination of dependencies. Hopefully the dockerfile will point you in the right direction.

Fang-Haoshu · 2018-03-07T05:30:17Z

Thanks @sberryman for the docker file!
@zkailong From the log it seems you meet this problem: google-deepmind/torch-hdf5#79, and a possible solution is to install torch with Lua5.1

zkailong · 2018-03-07T06:57:10Z

@Fang-Haoshu Thanks for your reply. I reinstall torch with lua5.1. But it did not work...

Fang-Haoshu · 2018-03-07T07:07:23Z

Sooooo weird.... In the issue of deepmind, it seems many people also suffer from this problem..

zkailong · 2018-03-07T07:15:53Z

@Fang-Haoshu So frustrated...I have send an E-mail for you. Maybe we can talk more about it.

wangweihb · 2018-04-16T08:02:38Z

`zhanghua@zhanghua-System-Product-Name:~/AlphaPose$ ./run.sh --indir examples/demo/ --outdir examples/results/ --vis
0
generating bbox from Faster RCNN...
2018-04-16 15:48:19.729543: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2018-04-16 15:48:20.037014: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6575
pciBusID: 0000:65:00.0
totalMemory: 10.90GiB freeMemory: 10.44GiB
2018-04-16 15:48:20.037044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-04-16 15:48:20.229660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-16 15:48:20.229700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-04-16 15:48:20.229705: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-04-16 15:48:20.229898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10102 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
Loaded network ../output/res152/coco_2014_train+coco_2014_valminusminival/default/res152.ckpt
/home/zhanghua/AlphaPose/examples/demo/

100%|█████████████████████████████████████████████| 3/3 [00:03<00:00, 1.12s/it]
pose estimation with RMPE...
Found Environment variable CUDNN_PATH = /usr/local/cuda/lib64/libcudnn.so.9.0:/usr/local/cuda-9.0/bin:/home/zhanghua/torch/install/bin:/home/zhanghua/bin:/home/zhanghua/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
/home/zhanghua/torch/install/bin/luajit: /home/zhanghua/torch/install/share/lua/5.1/trepl/init.lua:389: /home/zhanghua/torch/install/share/lua/5.1/trepl/init.lua:389: /home/zhanghua/torch/install/share/lua/5.1/cudnn/ffi.lua:1618: /usr/local/cuda/lib64/libcudnn.so.9.0:/usr/local/cuda-9.0/bin:/home/zhanghua/torch/install/bin:/home/zhanghua/bin:/home/zhanghua/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin: cannot open shared object file: No such file or directory
stack traceback:
[C]: in function 'error'
/home/zhanghua/torch/install/share/lua/5.1/trepl/init.lua:389: in function 'require'
/home/zhanghua/AlphaPose/predict/util.lua:12: in main chunk
[C]: in function 'dofile'
main-alpha-pose.lua:7: in main chunk
[C]: in function 'dofile'
...ghua/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50
Traceback (most recent call last):
File "parametric-pose-nms-MPII.py", line 256, in
get_result_json(args)
File "parametric-pose-nms-MPII.py", line 243, in get_result_json
test_parametric_pose_NMS_json(delta1, delta2, mu, gamma,args.outputpath)
File "parametric-pose-nms-MPII.py", line 99, in test_parametric_pose_NMS_json
h5file = h5py.File(os.path.join(outputpath,"POSE/test-pose.h5"), 'r')
File "/usr/lib/python2.7/dist-packages/h5py/_hl/files.py", line 272, in init
fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
File "/usr/lib/python2.7/dist-packages/h5py/_hl/files.py", line 92, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/build/h5py-nQFNYZ/h5py-2.6.0/h5py/_objects.c:2577)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/build/h5py-nQFNYZ/h5py-2.6.0/h5py/_objects.c:2536)
File "h5py/h5f.pyx", line 76, in h5py.h5f.open (/build/h5py-nQFNYZ/h5py-2.6.0/h5py/h5f.c:1811)
IOError: Unable to open file (Unable to open file: name = '/home/zhanghua/alphapose/examples/results/pose/test-pose.h5', errno = 2, error message = 'no such file or directory', flags = 0, o_flags = 0)
visualization...
Traceback (most recent call last):
File "json-video.py", line 63, in
with open(jsonpath) as f:
IOError: [Errno 2] No such file or directory: '/home/zhanghua/AlphaPose/examples/results/POSE/alpha-pose-results-forvis.json'
`

This is my problem. Who can help me?thanks

Fang-Haoshu closed this as completed Mar 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A problem in running run.sh #29

A problem in running run.sh #29

zkailong commented Mar 6, 2018

sberryman commented Mar 6, 2018

sberryman commented Mar 6, 2018 •

edited

Loading

zkailong commented Mar 7, 2018 •

edited

Loading

sberryman commented Mar 7, 2018

Fang-Haoshu commented Mar 7, 2018

zkailong commented Mar 7, 2018

Fang-Haoshu commented Mar 7, 2018

zkailong commented Mar 7, 2018

wangweihb commented Apr 16, 2018

A problem in running run.sh #29

A problem in running run.sh #29

Comments

zkailong commented Mar 6, 2018

sberryman commented Mar 6, 2018

Dockerfile

Important versions:

sberryman commented Mar 6, 2018 • edited Loading

zkailong commented Mar 7, 2018 • edited Loading

sberryman commented Mar 7, 2018

Fang-Haoshu commented Mar 7, 2018

zkailong commented Mar 7, 2018

Fang-Haoshu commented Mar 7, 2018

zkailong commented Mar 7, 2018

wangweihb commented Apr 16, 2018

sberryman commented Mar 6, 2018 •

edited

Loading

zkailong commented Mar 7, 2018 •

edited

Loading