Multi-person 3D pose estimation from unlabelled data

Daniel Rodriguez-Criado; Pilar Bachiller-Burgos; Luis J. Manso; George Vogiatzis

doi:10.1007/s00138-024-01530-6

Multi-person 3D pose estimation from unlabelled data

Daniel Rodriguez-Criado, Pilar Bachiller-Burgos^*, Luis J. Manso, George Vogiatzis

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

Abstract

Its numerous applications make multi-human 3D pose estimation a remarkably impactful area of research. Nevertheless, it presents several challenges, especially when approached using multiple views and regular RGB cameras as the only input. First, each person must be uniquely identified in the different views. Secondly, it must be robust to noise, partial occlusions, and views where a person may not be detected. Thirdly, many pose estimation approaches rely on environment-specific annotated datasets that are frequently prohibitively expensive and/or require specialised hardware. Specifically, this is the first multi-camera, multi-person data-driven approach that does not require an annotated dataset. In this work, we address these three challenges with the help of self-supervised learning. In particular, we present a three-staged pipeline and a rigorous evaluation providing evidence that our approach performs faster than other state-of-the-art algorithms, with comparable accuracy, and most importantly, does not require annotated datasets. The pipeline is composed of a 2D skeleton detection step, followed by a Graph Neural Network to estimate cross-view correspondences of the people in the scenario, and a Multi-Layer Perceptron that transforms the 2D information into 3D pose estimations. Our proposal comprises the last two steps, and it is compatible with any 2D skeleton detector as input. These two models are trained in a self-supervised manner, thus avoiding the need for datasets annotated with 3D ground-truth poses.

Original language	English
Article number	46
Number of pages	18
Journal	Machine Vision and Applications
Volume	35
Early online date	6 Apr 2024
DOIs	https://doi.org/10.1007/s00138-024-01530-6
Publication status	Published - 6 Apr 2024

Bibliographical note

Copyright © The Author(s), 2024. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

Data Access Statement

The data and models that support the findings of this paper have been made publicly available at https://www.dropbox.com/sh/6cn6ajddrfkb332/AACg_UpK22BlytWrP19w_VaNa?dl=0. The link contains both the preprocessed datasets and pretrained models. The code is available in a public GitHub repository at https://github.com/gnns4hri/3D_multi_pose_estimator. Additionally, the experimental results utilize the CMU Panoptic dataset [24] and a dataset compiled specifically for this research work. We deleted all information that identifies individuals in compliance with the conditions set by the ethics committee of Aston University.

Keywords

3D multi-pose estimation
Skeleton matching
Deep learning
Graph neural networks
Self-supervised learning

Access to Document

10.1007/s00138-024-01530-6Licence: CC BY 4.0

Rodriguez-Criado et al Multi-person 3D pose estimation from unlabelled data
Copyright © The Author(s), 2024. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
Final published version, 3.4 MBLicence: CC BY 4.0

Cite this

@article{dfe497eb3dcb42b58883bbb7d9e86394,

title = "Multi-person 3D pose estimation from unlabelled data",

abstract = "Its numerous applications make multi-human 3D pose estimation a remarkably impactful area of research. Nevertheless, it presents several challenges, especially when approached using multiple views and regular RGB cameras as the only input. First, each person must be uniquely identified in the different views. Secondly, it must be robust to noise, partial occlusions, and views where a person may not be detected. Thirdly, many pose estimation approaches rely on environment-specific annotated datasets that are frequently prohibitively expensive and/or require specialised hardware. Specifically, this is the first multi-camera, multi-person data-driven approach that does not require an annotated dataset. In this work, we address these three challenges with the help of self-supervised learning. In particular, we present a three-staged pipeline and a rigorous evaluation providing evidence that our approach performs faster than other state-of-the-art algorithms, with comparable accuracy, and most importantly, does not require annotated datasets. The pipeline is composed of a 2D skeleton detection step, followed by a Graph Neural Network to estimate cross-view correspondences of the people in the scenario, and a Multi-Layer Perceptron that transforms the 2D information into 3D pose estimations. Our proposal comprises the last two steps, and it is compatible with any 2D skeleton detector as input. These two models are trained in a self-supervised manner, thus avoiding the need for datasets annotated with 3D ground-truth poses.",

keywords = "3D multi-pose estimation, Skeleton matching, Deep learning, Graph neural networks, Self-supervised learning",

author = "Daniel Rodriguez-Criado and Pilar Bachiller-Burgos and Manso, {Luis J.} and George Vogiatzis",

note = "Copyright {\textcopyright} The Author(s), 2024. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article{\textquoteright}s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article{\textquoteright}s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.",

year = "2024",

month = apr,

day = "6",

doi = "10.1007/s00138-024-01530-6",

language = "English",

volume = "35",

journal = "Machine Vision and Applications",

issn = "0932-8092",

publisher = "Springer",

}

TY - JOUR

T1 - Multi-person 3D pose estimation from unlabelled data

AU - Rodriguez-Criado, Daniel

AU - Bachiller-Burgos, Pilar

AU - Manso, Luis J.

AU - Vogiatzis, George

N1 - Copyright © The Author(s), 2024. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

PY - 2024/4/6

Y1 - 2024/4/6

N2 - Its numerous applications make multi-human 3D pose estimation a remarkably impactful area of research. Nevertheless, it presents several challenges, especially when approached using multiple views and regular RGB cameras as the only input. First, each person must be uniquely identified in the different views. Secondly, it must be robust to noise, partial occlusions, and views where a person may not be detected. Thirdly, many pose estimation approaches rely on environment-specific annotated datasets that are frequently prohibitively expensive and/or require specialised hardware. Specifically, this is the first multi-camera, multi-person data-driven approach that does not require an annotated dataset. In this work, we address these three challenges with the help of self-supervised learning. In particular, we present a three-staged pipeline and a rigorous evaluation providing evidence that our approach performs faster than other state-of-the-art algorithms, with comparable accuracy, and most importantly, does not require annotated datasets. The pipeline is composed of a 2D skeleton detection step, followed by a Graph Neural Network to estimate cross-view correspondences of the people in the scenario, and a Multi-Layer Perceptron that transforms the 2D information into 3D pose estimations. Our proposal comprises the last two steps, and it is compatible with any 2D skeleton detector as input. These two models are trained in a self-supervised manner, thus avoiding the need for datasets annotated with 3D ground-truth poses.

AB - Its numerous applications make multi-human 3D pose estimation a remarkably impactful area of research. Nevertheless, it presents several challenges, especially when approached using multiple views and regular RGB cameras as the only input. First, each person must be uniquely identified in the different views. Secondly, it must be robust to noise, partial occlusions, and views where a person may not be detected. Thirdly, many pose estimation approaches rely on environment-specific annotated datasets that are frequently prohibitively expensive and/or require specialised hardware. Specifically, this is the first multi-camera, multi-person data-driven approach that does not require an annotated dataset. In this work, we address these three challenges with the help of self-supervised learning. In particular, we present a three-staged pipeline and a rigorous evaluation providing evidence that our approach performs faster than other state-of-the-art algorithms, with comparable accuracy, and most importantly, does not require annotated datasets. The pipeline is composed of a 2D skeleton detection step, followed by a Graph Neural Network to estimate cross-view correspondences of the people in the scenario, and a Multi-Layer Perceptron that transforms the 2D information into 3D pose estimations. Our proposal comprises the last two steps, and it is compatible with any 2D skeleton detector as input. These two models are trained in a self-supervised manner, thus avoiding the need for datasets annotated with 3D ground-truth poses.

KW - 3D multi-pose estimation

KW - Skeleton matching

KW - Deep learning

KW - Graph neural networks

KW - Self-supervised learning

UR - https://link.springer.com/article/10.1007/s00138-024-01530-6

UR - http://www.scopus.com/inward/record.url?scp=85189472584&partnerID=8YFLogxK

U2 - 10.1007/s00138-024-01530-6

DO - 10.1007/s00138-024-01530-6

M3 - Article

SN - 0932-8092

VL - 35

JO - Machine Vision and Applications

JF - Machine Vision and Applications

M1 - 46

ER -

Multi-person 3D pose estimation from unlabelled data

Abstract

Bibliographical note

Data Access Statement

Keywords

Access to Document

Other files and links

Fingerprint

Cite this