Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A problem in running run.sh #29

Closed
zkailong opened this issue Mar 6, 2018 · 9 comments
Closed

A problem in running run.sh #29

zkailong opened this issue Mar 6, 2018 · 9 comments

Comments

@zkailong
Copy link

zkailong commented Mar 6, 2018

Environment:ubuntu 16.04;cuda 9.0.176;cuDNN 7.0.5;TensorFlow 1.6.0(gpu).
Reference to #10 #3,
I've been installed torch,lucrocks,hdf5,etc...But there are still problems running...

name@name-All-Series:~/AlphaPose$ ./run.sh --indir examples/demo/ --outdir examples/results/ --vis
0
generating bbox from Faster RCNN...
/usr/local/lib/python2.7/dist-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
2018-03-06 17:32:48.947257: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-03-06 17:32:49.022229: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-06 17:32:49.022485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties:
name: GeForce GTX 750 major: 5 minor: 2 memoryClockRate(GHz): 1.188
pciBusID: 0000:01:00.0
totalMemory: 1.95GiB freeMemory: 1.64GiB
2018-03-06 17:32:49.022502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-03-06 17:32:49.244052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1403 MB memory) -> physical GPU (device: 0, name: GeForce GTX 750, pci bus id: 0000:01:00.0, compute capability: 5.2)
Loaded network ../output/res152/coco_2014_train+coco_2014_valminusminival/default/res152.ckpt
/home/name/AlphaPose/examples/demo/
0%| | 0/3 [00:00<?, ?it/s]2018-03-06 17:32:55.388581: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-03-06 17:32:55.500633: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-03-06 17:32:56.432382: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 922.50MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
67%|██████████████████████████████ | 2/3 [00:07<00:03, 3.52s/it]2018-03-06 17:33:01.331154: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-03-06 17:33:01.411505: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-03-06 17:33:01.501123: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-03-06 17:33:02.435389: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-03-06 17:33:02.543607: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
100%|█████████████████████████████████████████████| 3/3 [00:10<00:00, 3.45s/it]
pose estimation with RMPE...
/home/name/torch/install/bin/lua: /home/name/torch/install/share/lua/5.2/trepl/init.lua:389: /home/name/torch/install/share/lua/5.2/hdf5/ffi.lua:56: expected align(#) on line 579
stack traceback:
[C]: in function 'error'
/home/name/torch/install/share/lua/5.2/trepl/init.lua:389: in function 'require'
/home/name/AlphaPose/predict/util.lua:7: in main chunk
[C]: in function 'dofile'
/home/name/torch/install/share/lua/5.2/paths/init.lua:84: in function 'dofile'
main-alpha-pose.lua:7: in main chunk
[C]: in function 'dofile'
...oyer/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: in ?
/usr/local/lib/python2.7/dist-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Traceback (most recent call last):
File "parametric-pose-nms-MPII.py", line 256, in
get_result_json(args)
File "parametric-pose-nms-MPII.py", line 243, in get_result_json
test_parametric_pose_NMS_json(delta1, delta2, mu, gamma,args.outputpath)
File "parametric-pose-nms-MPII.py", line 99, in test_parametric_pose_NMS_json
h5file = h5py.File(os.path.join(outputpath,"POSE/test-pose.h5"), 'r')
File "/usr/local/lib/python2.7/dist-packages/h5py/_hl/files.py", line 269, in init
fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
File "/usr/local/lib/python2.7/dist-packages/h5py/_hl/files.py", line 99, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 78, in h5py.h5f.open
IOError: Unable to open file (unable to open file: name = '/home/name/AlphaPose/examples/results/POSE/test-pose.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
visualization...
Traceback (most recent call last):
File "json-video.py", line 63, in
with open(jsonpath) as f:
IOError: [Errno 2] No such file or directory: '/home/name/AlphaPose/examples/results/POSE/alpha-pose-results-forvis.json'

So,how can I solve it?

@sberryman
Copy link

So I spent the better part of the day yesterday trying to get AlphaPose to compile and run inference. I finally figured out a combination that works.

Dockerfile

https://gist.github.com/sberryman/82a6d13a44f9c4a3bfaf9263b36c92ed

Important versions:

  • cudnn version 5
  • Tensorflow >= 1.2 AND < 1.3 (if you build tensorflow from source the cudnn version isn't as important. installing from pip it becomes VERY important)
  • Input and output directories for ./run.sh must be relative to the CWD. Absolute paths do not work!

Even if you don't use Docker you can get a very good idea of the steps I had to take to get AlphaPose running. Also, a lot of those ubuntu dependencies that are installed on line 8 can be removed. Those are left over from another project and I haven't had time to clean them up.

@sberryman
Copy link

sberryman commented Mar 6, 2018

Your error looks more like it has to do with running out of GPU memory though. Your card (CPU) only has totalMemory: 1.95GiB freeMemory: 1.64GiB

I see RCNN using ~ 4.8GB of memory and Torch was using about 1.8GB with a batch size of 1. That is my experience running on a GTX 1080. I haven't tried my 1080 TI's yet.

Update: human-detection (tensorflow) is set to gpu_options.allow_growth=True so I'm not sure the actual minimum memory requirements.

@zkailong
Copy link
Author

zkailong commented Mar 7, 2018

@sberryman Thanks for your reply. But it said

The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

So I don't think that the GPU memory of my computer is too less to run AlphaPose.
And thanks for your Dockerfile. Maybe I should rebuild it.

@sberryman
Copy link

Good luck, I know it took me a LONG time to figure out the right combination of dependencies. Hopefully the dockerfile will point you in the right direction.

@Fang-Haoshu
Copy link
Member

Thanks @sberryman for the docker file!
@zkailong From the log it seems you meet this problem: google-deepmind/torch-hdf5#79, and a possible solution is to install torch with Lua5.1

@zkailong
Copy link
Author

zkailong commented Mar 7, 2018

@Fang-Haoshu Thanks for your reply. I reinstall torch with lua5.1. But it did not work...

@Fang-Haoshu
Copy link
Member

Sooooo weird.... In the issue of deepmind, it seems many people also suffer from this problem..

@zkailong
Copy link
Author

zkailong commented Mar 7, 2018

@Fang-Haoshu So frustrated...I have send an E-mail for you. Maybe we can talk more about it.

@wangweihb
Copy link

`zhanghua@zhanghua-System-Product-Name:~/AlphaPose$ ./run.sh --indir examples/demo/ --outdir examples/results/ --vis
0
generating bbox from Faster RCNN...
2018-04-16 15:48:19.729543: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2018-04-16 15:48:20.037014: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6575
pciBusID: 0000:65:00.0
totalMemory: 10.90GiB freeMemory: 10.44GiB
2018-04-16 15:48:20.037044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-04-16 15:48:20.229660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-16 15:48:20.229700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-04-16 15:48:20.229705: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-04-16 15:48:20.229898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10102 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
Loaded network ../output/res152/coco_2014_train+coco_2014_valminusminival/default/res152.ckpt
/home/zhanghua/AlphaPose/examples/demo/

100%|█████████████████████████████████████████████| 3/3 [00:03<00:00, 1.12s/it]
pose estimation with RMPE...
Found Environment variable CUDNN_PATH = /usr/local/cuda/lib64/libcudnn.so.9.0:/usr/local/cuda-9.0/bin:/home/zhanghua/torch/install/bin:/home/zhanghua/bin:/home/zhanghua/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
/home/zhanghua/torch/install/bin/luajit: /home/zhanghua/torch/install/share/lua/5.1/trepl/init.lua:389: /home/zhanghua/torch/install/share/lua/5.1/trepl/init.lua:389: /home/zhanghua/torch/install/share/lua/5.1/cudnn/ffi.lua:1618: /usr/local/cuda/lib64/libcudnn.so.9.0:/usr/local/cuda-9.0/bin:/home/zhanghua/torch/install/bin:/home/zhanghua/bin:/home/zhanghua/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin: cannot open shared object file: No such file or directory
stack traceback:
[C]: in function 'error'
/home/zhanghua/torch/install/share/lua/5.1/trepl/init.lua:389: in function 'require'
/home/zhanghua/AlphaPose/predict/util.lua:12: in main chunk
[C]: in function 'dofile'
main-alpha-pose.lua:7: in main chunk
[C]: in function 'dofile'
...ghua/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50
Traceback (most recent call last):
File "parametric-pose-nms-MPII.py", line 256, in
get_result_json(args)
File "parametric-pose-nms-MPII.py", line 243, in get_result_json
test_parametric_pose_NMS_json(delta1, delta2, mu, gamma,args.outputpath)
File "parametric-pose-nms-MPII.py", line 99, in test_parametric_pose_NMS_json
h5file = h5py.File(os.path.join(outputpath,"POSE/test-pose.h5"), 'r')
File "/usr/lib/python2.7/dist-packages/h5py/_hl/files.py", line 272, in init
fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
File "/usr/lib/python2.7/dist-packages/h5py/_hl/files.py", line 92, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/build/h5py-nQFNYZ/h5py-2.6.0/h5py/_objects.c:2577)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/build/h5py-nQFNYZ/h5py-2.6.0/h5py/_objects.c:2536)
File "h5py/h5f.pyx", line 76, in h5py.h5f.open (/build/h5py-nQFNYZ/h5py-2.6.0/h5py/h5f.c:1811)
IOError: Unable to open file (Unable to open file: name = '/home/zhanghua/alphapose/examples/results/pose/test-pose.h5', errno = 2, error message = 'no such file or directory', flags = 0, o_flags = 0)
visualization...
Traceback (most recent call last):
File "json-video.py", line 63, in
with open(jsonpath) as f:
IOError: [Errno 2] No such file or directory: '/home/zhanghua/AlphaPose/examples/results/POSE/alpha-pose-results-forvis.json'
`

This is my problem. Who can help me?thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants