RTMPose vs RTMO vs DWPose vs RTMW #3135

davidpagnon · 2024-10-10T14:49:20Z

davidpagnon
Oct 10, 2024

Hi everyone,

I would appreciate it if you could help me understand a few details about RTMPose, RTMO, DWPose, and RTMW.

Does the summary below look correct and accurate?
I am a bit confused as to how RTMpose, RTMW, and DWPose are mixed together in the file names:
Some RTMW models have the "dw" suffix: rtmw-dw-x-l_simcc-cocktail14_270e-384x288
Some RTMpose models have the "dw" suffix, and seem to be trained on a less extensive dataset: rtmpose-l_simcc-ucoco_dw-ucoco_270e-384x288
Can you help me understand the logic behind it?
I see that the models trained on body8 (very large dataset) have a lower AP than those trained on AIC+COCO. On real images, would they still be less accurate? Or on the contrary, would they be better because they have seen a larger range of images? (or is it a tricky question?)
Ex for Coco 17: RTMPose-s: AP 72.2, while RTMPose-s*: AP 69.7

Here is what I found for now:

RTMPose: Top down (detection and then pose estimation).
Supports Halpe26 (with feet)
- Without feet: AP on body8 dataset: s:69.7, m:74.9, l:76.7 (+ largest x model, largest input size:78.8)
- Without feet: FLOPS: t:0.36, s:0.68, m:1.93, l:4.16 (+ largest x model, largest input size:17.22)
- With feet: AUC on body8 dataset: s:68.6, m:71.9, l:73.2 (+ largest x model, largest input size:74.8)
- With feet: FLOPS: t:0.37, s:0.70, m:1.95, l:4.19 (+ largest x model, largest input size:17.29)
RTMW: RTMPose for whole-body (face, feet, hands), further extends to 3D
- With feet, hands and face: AP on cocktail14 dataset: m: 58.2, l: 66.0 (+ largest x model, largest input size: 70.2)
- With feet, hands and face: FLOPS: m: 4.3, l:7.9 (+ largest x model, largest input size: 29.2)
RTMO: One stage (no detection needed).
Faster with more than 4 people, slightly worse accuracy, only 17 points
- Without feet: AP on body8 dataset: s: 68.6, m: 72.8, l: 74.8
- Without feet: Speed not comparable (no FLOPS provided, and latency tested on different system)
DWPose: Uses distillation (teacher and student models).
Whole-body, faster than RTMW but less accurate (although not trained on same datasets so AP should be taken with caution)
- With feet, hands and face: AP on COCOW+Ubody: s:53.8, m:60.6, l:63.1
- With feet, hands and face: FLOPS: s: 0.9, m:2.2, l: 4.5
YOLOX: Similar speed as RTMO but less accurate, no feet