Penerapan 3D Human Pose Estimation Indoor Area untuk Motion Capture dengan Menggunakan YOLOv4-Tiny, EfficientNet Simple Baseline, dan VideoPose3D

Gerry Steven, Liliana Liliana, Anita Nathania Purbowo

Abstract


Human pose estimation is a research topic that has goal to estimate every human’s keypoint coordinate that can be connected and make a human skeleton. The development of this topic can be applicated to human activity recognition, human tracking, and motion capture for film and animation. There are several challenges for this topic: diverse human pose, diverse body appearance from clothing and similar parts, and complex environment that may cause foreground occlusion. There are several methods to be used in this research: YOLOv4- Tiny, EfficientNet Simple Baseline, and VideoPose3D. YOLOv4- Tiny will process image input to get bounding box coordinate. This coordinate will be inputted to EfficientNet Simple Baseline modification to get 16 keypoint 2D coordinates. After that, VideoPose3D will processed 2D coordinates into 15 keypoints 3D coordinates. The result from this research is EfficientNet Simple Baseline modification is faster with 4.54ms time compared to its original with time of 5.15ms. Although faster, its modification has its own downside. In term of accuracy, modification still less accurate than its original with highest average Percentage of Correct Keypoints head (PCKh@0.2) 86.89%, and original with PCKh@0.2 89.62%. This affect 3D human pose estimation using VideoPose3D, where using EfficientNet modification resulting Mean Per Joints Position Error (MPJPE) 25.3 mm compared to original Simple Baseline resulting MPJPE 28.1mm.

Keywords


Human Pose Estimation; YOLO; EfficientNet; Simple Baseline; VideoPose3D

Full Text:

PDF

References


Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B.

(2014). 2d human pose estimation: New benchmark and state

of the art analysis. In Proceedings of the IEEE Conference on

computer Vision and Pattern Recognition (pp. 3686-3693).

DOI: 10.1109/CVPR.2014.471

Artacho, B., & Savakis, A. (2020). Unipose: Unified human

pose estimation in single images and videos. In Proceedings

of the IEEE/CVF conference on computer vision and pattern

recognition (pp. 7035-7044). DOI:

1109/CVPR42600.2020.00706

Bazarevsky, V., Grishchenko, I., Raveendran, K., Zhu, T.,

Zhang, F., & Grundmann, M. (2020). Blazepose: On-device

real-time body pose tracking. arXiv preprint

arXiv:2006.10204.

Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., & Sun, J.

(2018). Cascaded pyramid network for multi-person pose

estimation. In Proceedings of the IEEE conference on

computer vision and pattern recognition (pp. 7103-7112).

DOI: 10.1016/j.cviu.2019.102897

Debnath, B., O’Brien, M., Yamaguchi, M., & Behera, A.

(2018, November). Adapting MobileNets for mobile based

upper body pose estimation. In 2018 15th IEEE International

Conference on Advanced Video and Signal Based

Surveillance (AVSS) (pp. 1-6). IEEE. DOI:

1109/AVSS.2018.8639378

Eichner, M., Marin-Jimenez, M., Zisserman, A., & Ferrari,

V. (2012). 2d articulated human pose estimation and retrieval

in (almost) unconstrained still images. International journal

of computer vision, 99(2), 190-214. DOI: 10.1007/s11263-

-0524-9

Eichner, M., & Ferrari, V. (2010, September). We are family:

Joint pose estimation of multiple persons. In European

conference on computer vision (pp. 228-242). Springer,

Berlin, Heidelberg. DOI: 10.1007/978-3-642-15549-9_17

Fan, X., Zheng, K., Lin, Y., & Wang, S. (2015). Combining

local appearance and holistic view: Dual-source deep neural

networks for human pose estimation. In Proceedings of the

IEEE conference on computer vision and pattern

recognition (pp. 1347-1355). DOI:

1109/CVPR.2015.7298740.

Groos, D., Ramampiaro, H., & Ihlen, E. A. (2021).

EfficientPose: Scalable single-person pose

estimation. Applied intelligence, 51(4), 2518-2533. DOI:

1007/s10489-020-01918-7

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual

learning for image recognition. In Proceedings of the IEEE

conference on computer vision and pattern recognition (pp.

-778). DOI: 10.1109/CVPR.2016.90

Insafutdinov, E., Andriluka, M., Pishchulin, L., Tang, S.,

Levinkov, E., Andres, B., & Schiele, B. (2017). Arttrack:

Articulated multi-person tracking in the wild. In Proceedings

of the IEEE conference on computer vision and pattern

recognition (pp. 6457-6465). DOI: 10.1109/CVPR.2017.142

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012).

Imagenet classification with deep convolutional neural

networks. Advances in neural information processing

systems, 25. DOI: 10.1145/3065386

Li, B., Dai, Y., Cheng, X., Chen, H., Lin, Y., & He, M. (2017,

July). Skeleton based action recognition using translationscale invariant image mapping and multi-scale deep CNN.

In 2017 IEEE International Conference on Multimedia &

Expo Workshops (ICMEW) (pp. 601-604). IEEE. DOI:

1109/ICMEW.2017.8026282

Li, B., He, M., Dai, Y., Cheng, X., & Chen, Y. (2018). 3D

skeleton based action recognition by video-domain

translation-scale invariant mapping and multi-scale dilated

CNN. Multimedia Tools and Applications, 77(17), 22901-

DOI: 10.1007/s11042-018-5642-0

Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P.,

Ramanan, D., ... & Zitnick, C. L. (2014, September).

Microsoft coco: Common objects in context. In European

conference on computer vision (pp. 740-755). Springer,

Cham. DOI: 10.1007/978-3-319-10602-1_48

Long, J., Shelhamer, E., & Darrell, T. (2015). Fully

convolutional networks for semantic segmentation.

In Proceedings of the IEEE conference on computer vision

and pattern recognition (pp. 3431-3440). DOI:

1109/CVPR.2015.7298965

Luvizon, D. C., Picard, D., & Tabia, H. (2018). 2d/3d pose

estimation and action recognition using multitask deep

learning. In Proceedings of the IEEE conference on computer

vision and pattern recognition (pp. 5137-5146). DOI:

1109/CVPR.2018.00539

Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shafiei,

M., Seidel, H. P., ... & Theobalt, C. (2017). Vnect: Real-time

d human pose estimation with a single rgb camera. ACM

Transactions on Graphics (TOG), 36(4), 1-14. DOI:

1145/3072959.3073596

Pavllo, D., Feichtenhofer, C., Grangier, D., & Auli, M.

(2019). 3d human pose estimation in video with temporal

convolutions and semi-supervised training. In Proceedings of

the IEEE/CVF Conference on Computer Vision and Pattern

Recognition (pp. 7753-7762). DOI:

1109/CVPR.2019.00794

Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn:

Towards real-time object detection with region proposal

networks. Advances in neural information processing

systems, 28. DOI: 10.1109/TPAMI.2016.2577031

Sigal, L., Balan, A. O., & Black, M. J. (2010). Humaneva:

Synchronized video and motion capture dataset and baseline

algorithm for evaluation of articulated human

motion. International journal of computer vision, 87(1), 4-

DOI: 10.1007/s11263-009-0273-6

Sun, M., & Savarese, S. (2011, November). Articulated partbased model for joint object detection and pose estimation.

In 2011 International Conference on Computer Vision (pp.

-730). IEEE. DOI: 10.1109/ICCV.2011.6126309

Tan, M., & Le, Q. (2019, May). Efficientnet: Rethinking

model scaling for convolutional neural networks.

In International conference on machine learning (pp. 6105-

. PMLR 97:6105-6114.

Wang, F., & Li, Y. (2013). Beyond physical connections:

Tree models in human pose estimation. In Proceedings of the

IEEE conference on computer vision and pattern

recognition (pp. 596-603). DOI: 10.1109/CVPR.2013.83

Wang, C. Y., Bochkovskiy, A., & Liao, H. Y. M. (2021).

Scaled-yolov4: Scaling cross stage partial network.

In Proceedings of the IEEE/cvf conference on computer

vision and pattern recognition (pp. 13029-13038). DOI:

1109/CVPR46437.2021.01283

Wang, M., Chen, X., Liu, W., Qian, C., Lin, L., & Ma, L.

(2018). Drpose3d: Depth ranking in 3d human pose

estimation. arXiv preprint arXiv:1805.08973.

Xiao, B., Wu, H., & Wei, Y. (2018). Simple baselines for

human pose estimation and tracking. In Proceedings of the

European conference on computer vision (ECCV) (pp. 466-

. DOI: 10.1007/978-3-030-01231-1_29

Yang, Y., & Ramanan, D. (2011, June). Articulated pose

estimation with flexible mixtures-of-parts. In CVPR

(pp. 1385-1392). IEEE. DOI:

1109/CVPR.2011.5995741

Zhang, H., Ouyang, H., Liu, S., Qi, X., Shen, X., Yang, R., &

Jia, J. (2019). Human pose estimation with spatial contextual

information. arXiv preprint arXiv:1901.01760.

Zheng, C., Wu, W., Yang, T., Zhu, S., Chen, C., Liu, R., ... &

Shah, M. (2020). Deep learning-based human pose

estimation: A survey. arXiv preprint arXiv:2012.13392.

Zhou, K., Han, X., Jiang, N., Jia, K., & Lu, J. (2019). Hemlets

pose: Learning part-centric heatmap triplets for accurate 3d

human pose estimation. In Proceedings of the IEEE/CVF

International Conference on Computer Vision (pp. 2344-

. DOI: 10.1109/ICCV.2019.00243

Zimmermann, C., Welschehold, T., Dornhege, C., Burgard,

W., & Brox, T. (2018, May). 3d human pose estimation in

rgbd images for robotic task learning. In 2018 IEEE

International Conference on Robotics and Automation

(ICRA) (pp. 1986-1992). IEEE. DOI:

1109/ICRA.2018.84628338


Refbacks

  • There are currently no refbacks.


Jurnal telah terindeks oleh :