World-Grounded Human Motion Recovery via Gravity-View Coordinates
Zehong Shen*, Huaijin Pi*, Yan Xia, Zhi Cen, Sida Peng†, Zechen Hu, Hujun Bao, Ruizhen Hu, Xiaowei Zhou
SIGGRAPH Asia 2024
Please see installation for details.
Demo entries are provided in tools/demo
. Use -s
to skip visual odometry if you know the camera is static, otherwise the camera will be estimated by DPVO.
We also provide a script demo_folder.py
to inference a entire folder.
python tools/demo/demo.py --video=docs/example_video/tennis.mp4 -s
python tools/demo/demo_folder.py -f inputs/demo/folder_in -d outputs/demo/folder_out -s
python -m tools.demo.demo_multiperson --video=docs/example_video/two_persons.mp4 --output_root outputs/demo_mp --recreate_video
- Make the rendering videos the same fps as the input video.
- Check
pp_static_joint_cam
in./hmr4d/model/gvhmr/utils/postprocess.py
, which might be used for the-s
option in the demo script.
-
Test: To reproduce the 3DPW, RICH, and EMDB results in a single run, use the following command:
python tools/train.py global/task=gvhmr/test_3dpw_emdb_rich exp=gvhmr/mixed/mixed ckpt_path=inputs/checkpoints/gvhmr/gvhmr_siga24_release.ckpt
To test individual datasets, change
global/task
togvhmr/test_3dpw
,gvhmr/test_rich
, orgvhmr/test_emdb
. -
Train: To train the model, use the following command:
# The gvhmr_siga24_release.ckpt is trained with 2x4090 for 420 epochs, note that different GPU settings may lead to different results. python tools/train.py exp=gvhmr/mixed/mixed
During training, note that we do not employ post-processing as in the test script, so the global metrics results will differ (but should still be good for comparison with baseline methods).
Here's a draft for the "Different from the original repo" section in the README:
This version of the repository includes modifications to support multi-person HMR:
-
Multi-person tracking:
- Updated the
Tracker
class to return bounding boxes for multiple people usingget_all_tracks
instead ofget_one_track
. - Modified preprocessing to handle multiple person detections and features.
- Updated the
-
Multi-person pose estimation:
- Adapted the
VitPoseExtractor
to process multiple people simultaneously. - Updated the feature extraction process to handle batches of multiple people.
- Adapted the
-
Multi-person SMPL reconstruction:
- Modified the
DemoPL
class to predict SMPL parameters for multiple people. - Updated the rendering process to handle multiple SMPL models in both in-camera and global coordinate systems.
- Modified the
-
Rendering improvements:
- Implemented merged faces creation for rendering multiple SMPL models simultaneously.
- Added support for retargeting global translations to better align with in-camera positions.
-
New demo script:
- Added
demo_multiperson.py
to showcase the multi-person reconstruction pipeline. - Includes options for batch processing and verbose output for debugging.
- Added
-
Performance optimizations:
- Introduced batch processing for VitPose and feature extraction to improve efficiency.
Here's a draft for the "Results format" section in the README:
-
/preprocess/bbx.pt
:- Contains bounding box information for multiple people
bbx_xyxy
: Tensor of shape (P, L, 4), where P is the number of people and L is the number of framesbbx_xys
: Tensor of shape (P, L, 3), containing center coordinates and scale for each bounding box
-
/preprocess/slam_results.pt
:- Camera pose estimation results (if not using static camera)
- NumPy array of shape (L, 7), where each row contains [x, y, z, qx, qy, qz, qw]
-
/preprocess/vitpose.pt
:- 2D pose estimation results
- Tensor of shape (P, L, 17, 3), where 17 is the number of keypoints and 3 represents [x, y, confidence]
-
/preprocess/vit_features.pt
:- Image features extracted from the video frames
- Tensor of shape (P, L, 1024), where 1024 is the feature dimension
The main reconstruction results are stored in hmr4d_results.pt
, which contains the following keys:
-
smpl_params_global
andsmpl_params_incam
:- SMPL parameters for global and in-camera coordinate systems
- Each contains:
body_pose
: Tensor of shape (P, L, 63)betas
: Tensor of shape (P, L, 10)global_orient
: Tensor of shape (P, L, 3)transl
: Tensor of shape (P, L, 3)
-
K_fullimg
:- Camera intrinsic matrix
- Tensor of shape (L, 3, 3), same across all frames
-
net_outputs
:- Additional network outputs (not used for now)
If you find this code useful for your research, please use the following BibTeX entry.
@inproceedings{shen2024gvhmr,
title={World-Grounded Human Motion Recovery via Gravity-View Coordinates},
author={Shen, Zehong and Pi, Huaijin and Xia, Yan and Cen, Zhi and Peng, Sida and Hu, Zechen and Bao, Hujun and Hu, Ruizhen and Zhou, Xiaowei},
booktitle={SIGGRAPH Asia Conference Proceedings},
year={2024}
}
We thank the authors of WHAM, 4D-Humans, and ViTPose-Pytorch for their great works, without which our project/code would not be possible.