基于单目视觉的人体姿态估计方法综述

闫鑫; YAN Xin; 高浩; GAO Hao; 李昊伦; LI Haolun

doi:10.48014/ccsr.20230614001

基于单目视觉的人体姿态估计方法综述

A Review of Human Pose Estimation Methods Based on Monocular Vision

中国计算机科学评论 2023年第1卷第2期页码[13-30] 下载全文[3.3MB] [HTML]

摘要

本文是一篇关于单目人体姿态估计技术方法和行业应用的综述, 主要介绍近些年单目人体姿态估计的发展历程。随着计算机视觉和机器学习领域的快速发展, 单目人体姿态估计已经成为一项备受关注的研究方向。本文首先介绍了单目人体姿态估计的相关概念和意义, 并阐释了该领域研究的重要性。然后, 我们详细介绍了单目人体姿态估计的技术方法, 包括单目2D人体姿态估计以及单目3D人体姿态估计两种不同的技术路线。针对每种研究方法, 我们讨论了其原理、发展及优缺点。接着, 我们探索了单目人体姿态估计在人机交互, 自动驾驶, 医疗健康等领域的应用, 并对其举例分析以突出人体姿态估计的重要性。最后, 我们对当前的研究热点和挑战进行了分析, 同时也展望了单目人体姿态估计在未来的发展方向。本文旨在为研究者和从业者提供一个全面的概述, 促进单目人体姿态估计技术的应用和进一步研究。

Abstract

This article is a review of the technology methods and industry applications of monocular human pose estimation, which focuses on the recent development of monocular human pose estimation. With the rapid development of computer vision and machine learning, monocular human pose estimation has become a research direction that has attracted much attention. In this article, we first introduce the relevant concepts and significance of monocular human pose estimation, explain the importance of research in this field. Then, we provide a detailed overview of the technical methods of monocular human pose estimation, including two different approaches: monocular 2D human pose estimation and monocular 3D human pose estimation. For each research method, we discuss its principles, development, and pros and cons. Next, we explore the applications of monocular human pose estimation in various fields such as human-computer interaction, autonomous driving, healthcare, and provide examples to highlight the significance of human pose estimation. Finally, we analyze the current research hotspots and challenges and also look forward to the future directions of monocular human pose estimation. This article aims to provide researchers and practitioners with a comprehensive overview and promote the application and further research of monocular human pose estimation technology.

DOI	10.48014/ccsr.20230614001
文章类型	综述
收稿日期	2023-06-14
接收日期	2023-11-27
出版日期	2023-12-28
关键词	单目人体姿态估计, 深度学习, 技术方法, 行业应用
Keywords	Monocular human pose estimation, deep learning, technical approaches, industry applications
作者	闫鑫, 高浩^*, 李昊伦
Author	YAN Xin, GAO Hao^*, LI Haolun
所在单位	南京邮电大学, 南京 210023
Company	Nanjing University of Posts and Telecommunications, Nanjing 210023, China
浏览量	745
下载量	334
参考文献	[1] Johnson, Sam and Mark Everingham. Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation[C]. British Machine Vision Conference, 2010. DOI:10.5244/C.24.12 [2] Sapp, Benjamin and Ben Taskar. MODEC: Multimodal Decomposable Models for Human Pose Estimation[C]. 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013: 3674-3681. DOI:10.1109/CVPR.2013.471 [3] Andriluka, Mykhaylo, et al. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis[C]. 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014: 3686-3693. DOI:10.1109/CVPR.2014.471 [4] Lin Tsung-Yi, et al. Microsoft COCO: Common Objects in Context[C]. European Conference on Computer Vision, 2014. DOI:10.1007/978-3-319-10602-1_48 [5] Li, Jiefeng, et al. CrowdPose: Efficient Crowded Scenes Pose Estimation and a New Benchmark[C]. 2019 IEEE/ CVF Conference on Computer Vision and Pattern Recognition( CVPR), 2018: 10855-10864. DOI:10.1109/CVPR.2019.01112 [6] Zhang, Weiyu, et al. From Actemes to Action: A Strongly- Supervised Representation for Detailed Action Understanding[C]. 2013 IEEE International Conference on Computer Vision, 2013: 2248-2255. DOI:10.1109/ICCV.2013.280 [7] Jhuang, Hueihan, et al. Towards Understanding Action Recognition[C]. 2013 IEEE International Conference on Computer Vision, 2013: 3192-3199. DOI:10.1109/ICCV.2013.396 [8] Iqbal, Umar, et al. Pose for Action-Action for Pose[C]. 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition(FG 2017), 2016: 438-445. DOI:10.1109/FG.2017.61 [9] Andriluka, Mykhaylo, et al. PoseTrack: A Benchmark for Human Pose Estimation and Tracking[C]. 2018 IEEE/ CVF Conference on Computer Vision and Pattern Recognition, 2017: 5167-5176. DOI:10.1109/CVPR.2018.00542 [10] Lin, Weiyao, et al. Human in Events: A Large-Scale Benchmark for Human-centric Video Analysis in Complex Events. ArXiv abs/2005. 04490, 2020: n. pag. DOI:10.48550/arXiv.2005.04490 [11] Sigal, Leonid, et al. HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion[J]. International Journal of Computer Vision, 2010, 87: 4-27. DOI:10.1007/s11263-009-0273-6 [12] Ionescu, Catalin, et al. Human3. 6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments[C]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36: 1325-1339. DOI:10.1109/TPAMI.2013.248 [13] Joo, Hanbyul, et al. Panoptic Studio: A Massively Multiview System for Social Motion Capture[C]. 2015 IEEE International Conference on Computer Vision(ICCV), 2015: 3334-3342. DOI:10.1109/ICCV.2015.381 [14] Mehta, Dushyant, et al. Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision[C]. 2017 International Conference on 3D Vision(3DV), 2016: 506-516. DOI:10.1109/3DV.2017.00064 [15] Varol, Gül, et al. Learning from Synthetic Humans[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2017: 4627-4635. DOI:10.1109/CVPR.2017.492 [16] Fabbri, Matteo, et al. Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World[C]. European Conference on Computer Vision, 2018. DOI:10.1007/978-3-030-01225-0_27 [17] Marcard, Timo von, et al. Recovering Accurate 3D Human Pose in the Wild Using IMUs and a Moving Camera[C]. European Conference on Computer Vision, 2018. DOI:10.1007/978-3-030-01249-6_37 [18] Toshev, Alexander and Christian Szegedy. DeepPose: Human Pose Estimation via Deep Neural Networks[C]. 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2013: 1653-1660. DOI:10.1109/CVPR.2014.214 [19] Sun, Xiao, et al. Compositional Human Pose Regression[C]. 2017 IEEE International Conference on Computer Vision(ICCV), 2017: 2621-2630. DOI:10.1109/ICCV.2017.284 [20] Kipf, Thomas and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. ArXiv abs/1609. 02907, 2016: n. pag. DOI:10.48550/arXiv.1609.02907 [21] Qiu, Lingteng, et al. Peeking into occluded joints: A novel framework for crowd pose estimation. ArXiv abs/ 2003. 10506, 2020: n. pag. DOI:10.1007/978-3-030-58529-7_29 [22] Vaswani, Ashish, et al. Attention is All you Need. NIPS, 2017. DOI:10.48550/arXiv.1706.03762 [23] Li, Ke, et al. Pose Recognition with Cascade Transformers[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2021: 1944-1953. DOI:10.1109/CVPR46437.2021.00198 [24] Ramakrishna, Varun, et al. Pose Machines: Articulated Pose Estimation via Inference Machines[C]. European Conference on Computer Vision, 2014. DOI:10.1007/978-3-319-10605-2_3 [25] Wei, Shih-En, et al. Convolutional Pose Machines[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2016: 4724-4732. DOI:10.1109/CVPR.2016.511 [26] He, Kaiming, et al. Deep Residual Learning for Image Recognition[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2015: 770-778. DOI:10.1109/cvpr.2016.90 [27] Chen, Yilun, et al. Cascaded Pyramid Network for Multi-person Pose Estimation[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017: 7103-7112. DOI:10.1109/CVPR.2018.00742 [28] Newell, Alejandro, et al. Stacked Hourglass Networks for Human Pose Estimation[C]. European Conference on Computer Vision, 2016. DOI:10.1007/978-3-319-46484-8_29 [29] Chu, Xiao, et al. Multi-context Attention for Human Pose Estimation[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2017: 5669-5678. DOI:10.1109/CVPR.2017.601 [30] Ke, Lipeng, et al. Multi-Scale Structure-Aware Network for Human Pose Estimation[C]. European Conference on Computer Vision, 2018. DOI:10.1007/978-3-030-01216-8_44 [31] Tang, Weixian and Ying Wu. Does Learning Specific Features for Related Parts Help Human Pose Estimation?”[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2019: 1107-1116. DOI:10.1109/CVPR.2019.00120 [32] Sun, Ke, et al. Deep High-Resolution Representation Learning for Human Pose Estimation[C]. 2019 IEEE/ CVF Conference on Computer Vision and Pattern Recognition( CVPR), 2019: 5686-5696. DOI:10.1109/CVPR.2019.00584 [33] Liu, Zhenguang, et al. Deep Dual Consecutive Network for Human Pose Estimation[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2021: 525-534. DOI:10.1109/CVPR46437.2021.00059 [34] Liu, Huajun, et al. Polarized Self-Attention: Towards High-quality Pixel-wise Regression. ArXiv abs/2107. 00782, 2021: n. pag. DOI:10.1016/j.neucom.2022.07.054 [35] Xu, Yufei, et al. ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation. ArXiv abs/ 2204. 12484, 2022: n. pag. DOI:10.48550/arXiv.2204.12484 [36] Cao, Zhe, et al. Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2016: 1302-1310. DOI:10.1109/CVPR.2017.143 [37] Cheng, Bowen, et al. HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2019: 5385-5394. DOI:10.1109/cvpr42600.2020.00543 [38] Luo, Zhengxiong, et al. Rethinking the Heatmap Regression for Bottom-up Human Pose Estimation[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2020: 13259-13268. DOI:10.1109/CVPR46437.2021.01306 [39] Jin, Sheng, et al. Differentiable Hierarchical Graph Grouping for Multi-Person Pose Estimation. ArXiv abs/2007. 11864, 2020: n. pag. DOI:10.1007/978-3-030-58571-6_42 [40] Wang, Dongkai, et al. Robust Pose Estimation in Crowded Scenes with Direct Pose-Level Inference[R]. Neural Information Processing Systems, 2021. DOI:10.24963/ijcai.2021/5271 [41] Geng, Zigang, et al. Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression[C]. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2021: 14671-14681. DOI:10.1109/CVPR46437.2021.01444 [42] Bras'o, Guillem, et al. The Center of Attention: Center- Keypoint Grouping via Attention for Multi-Person Pose Estimation[C]. 2021 IEEE/CVF International Conference on Computer Vision(ICCV), 2021: 11833-11843. DOI:10.1109/ICCV48922.2021.01164 [43] Luvizon, Diogo Carbonera, et al. 2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 5137-5146. DOI:10.1109/CVPR.2018.00539 [44] Pavlakos, Georgios, et al. Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2016: 1263-1272. DOI:10.1109/CVPR.2017.139 [45] Zhou, Kun, et al. HEMlets Pose: Learning Part-Centric Heatmap Triplets for Accurate 3D Human Pose Estimation[C]. 2019 IEEE/CVF International Conference on Computer Vision(ICCV), 2019: 2344-2353. DOI:10.1109/ICCV.2019.00243 [46] Dabral, Rishabh, et al. Learning 3D Human Pose from Structure and Motion[C]. European Conference on Computer Vision, 2017. DOI:10.1007/978-3-030-01240-3_41 [47] Sun, Xiao, et al. Compositional Human Pose Regression[C]. 2017 IEEE International Conference on Computer Vision(ICCV), 2017: 2621-2630. DOI:10.1109/ICCV.2017.284 [48] Martinez, Julieta, et al. A Simple Yet Effective Baseline for 3d Human Pose Estimation[C]. 2017 IEEE International Conference on Computer Vision(ICCV), 2017: 2659-2668. DOI:10.1109/ICCV.2017.288 [49] Pavllo, Dario, et al. 3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2018: 7745-7754. DOI:10.1109/CVPR.2019.00794 [50] Zeng, Ailing, et al. SRNet: Improving Generalization in 3D Human Pose Estimation with a Split-and-Recombine Approach. ArXiv abs/2007. 09389, 2020: n. pag. DOI:10.1007/978-3-030-58568-6_30 [51] Chen, Tianlang, et al. Anatomy-Aware 3D Human Pose Estimation With Bone-Based Pose Decomposition[C]. IEEE Transactions on Circuits and Systems for Video Technology 32, 2021: 198-209. DOI:10.1109/TCSVT.2021.3057267 [52] Zhan, Yu-Wei, et al. Ray3D: ray-based 3D human pose estimation for monocular absolute 3D localization[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2022: 13106-13115. DOI:10.1109/CVPR52688.2022.01277 [53] Zhao, Long, et al. Semantic Graph Convolutional Networks for 3D Human Pose Regression[C]. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2019: 3420-3430. DOI:10.1109/CVPR.2019.00354 [54] Cai, Yujun, et al. Exploiting Spatial-Temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks[C]. 2019 IEEE/CVF International Conference on Computer Vision(ICCV), 2019: 2272-2281. DOI:10.1109/ICCV.2019.00236 [55] Zeng, Ailing, et al. Learning Skeletal Graph Neural Networks for Hard 3D Pose Estimation[C]. 2021 IEEE/ CVF International Conference on Computer Vision(ICCV), 2021: 11416-11425. DOI:10.1109/ICCV48922.2021.01124 [56] Zheng, Ce, et al. 3D Human Pose Estimation with Spatial and Temporal Transformers[C]. 2021 IEEE/CVF International Conference on Computer Vision(ICCV), 2021: 11636-11645. DOI:10.1109/ICCV48922.2021.01145 [57] Li, Wenhao, et al. MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2021: 13137-13146. DOI:10.1109/CVPR52688.2022.01280 [58] Zhang, Jinlu, et al. MixSTE: Seq2seq Mixed Spatio- Temporal Encoder for 3D Human Pose Estimation in Video[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2022: 13222-13232. DOI:10.1109/CVPR52688.2022.01288 [59] Zhu, Wenjie, et al. MotionBERT: Unified Pretraining for Human Motion Analysis. ArXiv abs/2210. 06551, 2022: n. pag. DOI:10.48550/arXiv.2210.06551 [60] Zhang, Zhengyou. Microsoft Kinect Sensor and Its Effect[J]. IEEE Multim, 2012, 19: 4-10. DOI:10.1109/MMUL.2012.24 [61] 唐心宇, 宋爱国. 人体姿态估计及在康复训练情景交互中的应用[J]. 仪器仪表学报, 2018, 39(11): 195-203. DOI:10.19650/j.cnki.cjsi.J1803879 [62] Xiao, Bin, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking[C]. Proceedings of the European conference on computer vision( ECCV). 2018. DOI:10.1007/978-3-030-01231-1_29 [63] Li, Yanjie, et al. Tokenpose: Learning keypoint tokens for human pose estimation[C]. Proceedings of the IEEE/CVF International conference on computer vision. 2021. DOI:10.1109/ICCV48922.2021.01112 [64] Li, Wenbo, et al. Rethinking on multi-stage networks for human pose estimation. arXiv preprint arXiv: 1901. 00148, 2019. DOI:10.1109/TPAMI.2019.2958916 [65] Zhang, Feng, et al. Distribution-aware coordinate representation for human pose estimation[C]. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. DOI:10.1109/cvpr42600.2020.00712 [66] Geng, Zigang, et al. Human Pose as Compositional Tokens[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. DOI:10.1109/CVPR52729.2023.00071 [67] Liu, Ze, et al. Swin transformer v2: Scaling up capacity and resolution[C]. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. DOI:10.1109/CVPR52688.2022.01170 [68] Liu, Huajun, et al. Polarized self-attention: Towards high-quality pixel-wise regression. arXiv preprint arXiv: 2107. 00782, 2021. DOI:10.48550/arXiv.2107.00782 [69] Zhang, Jing, Zhe Chen, and Dacheng Tao. Towards high performance human keypoint detection[J]. International Journal of Computer Vision 129. 9, 2021: 2639-2662. DOI:10.1007/s11263-021-01482-8 [70] Zhang, Feng, et al. Distribution-Aware Coordinate Representation for Human Pose Estimation[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2019: 7091-7100. DOI:10.1109/cvpr42600.2020.00712 [71] Xu, Yufei, et al. Vitpose: Simple vision transformer baselines for human pose estimation. arXiv preprint arXiv: 2204. 12484, 2022. DOI:10.48550/arXiv.2204.12484 [72] Dosovitskiy, Alexey, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv: 2010. 11929, 2020. DOI:10.48550/arXiv.2010.11929 [73] He, Kaiming, et al. Mask r-cnn. Proceedings of the IEEE international conference on computer vision. 2017. DOI:10.1109/ICCV.2017.322 [74] Papandreou, George, et al. Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model[C]. Proceedings of the European conference on computer vision(ECCV). 2018. DOI:10.1007/978-3-030-01264-9_17 [75] Yuan, Yuhui, et al. Hrformer: High-resolution transformer for dense prediction. arXiv preprint arXiv: 2110. 09408, 2021. DOI:10.1109/CVPR.2021.01300 [76] McNally, William, et al. Rethinking keypoint representations: Modeling keypoints and poses as objects for multi-person human pose estimation[C]. Computer Vision- ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part VI. Cham: Springer Nature Switzerland, 2022. DOI:10.1007/978-3-031-20068-7_3 [77] Li, Jiefeng, et al. Human pose regression with residual loglikelihood estimation[C]. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. DOI:10.1109/ICCV48922.2021.01084 [78] Shan, Wenkang, et al. Diffusion-Based 3D Human Pose Estimation with Multi-Hypothesis Aggregation. ArXiv abs/2303. 11579, 2023: n. pag. DOI:10.48550/arXiv.2303.11579 [79] Loper, Matthew, et al. SMPL: A skinned multi-person linear model[J]. ACM transactions on graphics(TOG)34. 6, 2015: 1-16. DOI:10.1145/3596711.3596800 [80] Li, Yanjie, et al. SimCC: A Simple Coordinate Classification Perspective for Human Pose Estimation[C]. European Conference on Computer Vision, 2021. DOI:10.1007/978-3-031-20068-7_6 [81] Yang, Sen, et al. TransPose: Keypoint Localization via Transformer[C]. 2021 IEEE/CVF International Conference on Computer Vision(ICCV), 2020: 11782-11792. DOI:10.1109/ICCV48922.2021.01159 [82] Liu, Ruixu, et al. Attention Mechanism Exploits Temporal Contexts: Real-Time 3D Human Pose Reconstruction[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2020: 5063-5072. DOI:10.1109/cvpr42600.2020.00511
引用本文	闫鑫, 高浩, 李昊伦. 基于单目视觉的人体姿态估计方法综述[J]. 中国计算机科学评论, 2023, 1(2): 13-30.
Citation	YAN Xin, GAO Hao, LI Haolun. A review of human pose estimation methods based on monocular vision[J]. Chinese Computer Sciences Review, 2023, 1(2): 13-30.