2D-to-3D lifting is a fundamental approach in 3D human pose estimation (3DHPE). This task is crucial in applications, including motion analysis and virtual reality. While Graph Convolutional Networks (GCNs) have demonstrated effectiveness in capturing spatial relationships in human skeletons, they suffer from over-smoothing and limited receptive fields. Transformer-based models provide global context but struggle with local feature extraction and computational efficiency. To address these challenges, we propose ADGT, a novel parallel GCN-transformer architecture combining the strengths of both approaches. Our method introduces three key innovations: Hop-Wise Scalable Adaptive GCN to refine local feature extraction, Attention-Based Local Feature Extractor to enhance the integration of local and global representations, and Register-Based Transformer Enhancement to improve feature separation. Extensive experiments on Human3.6M and MPI-INF-3DHP datasets demonstrate ADGT achieves state-of-the-art performance among frame-based methods while maintaining computational efficiency. These results highlight the potential of ADGT for real-time applications requiring accurate and efficient 3DHPE.
Yang, S, Luu, AT, Nguyen, X, Histace, A, Jansen, B & Sahli, H 2026, 'ADGT: Enhancing 3D human pose estimation with attention-driven graph-transformers', Journal of Visual Communication and Image Representation, vol. 118, 104829. https://doi.org/10.1016/j.jvcir.2026.104829
Yang, S., Luu, A. T., Nguyen, X., Histace, A., Jansen, B., & Sahli, H. (2026). ADGT: Enhancing 3D human pose estimation with attention-driven graph-transformers. Journal of Visual Communication and Image Representation, 118, Article 104829. https://doi.org/10.1016/j.jvcir.2026.104829
@article{05e6f82d30514e0ab2dcebb12e8befa3,
title = "ADGT: Enhancing 3D human pose estimation with attention-driven graph-transformers",
abstract = "2D-to-3D lifting is a fundamental approach in 3D human pose estimation (3DHPE). This task is crucial in applications, including motion analysis and virtual reality. While Graph Convolutional Networks (GCNs) have demonstrated effectiveness in capturing spatial relationships in human skeletons, they suffer from over-smoothing and limited receptive fields. Transformer-based models provide global context but struggle with local feature extraction and computational efficiency. To address these challenges, we propose ADGT, a novel parallel GCN-transformer architecture combining the strengths of both approaches. Our method introduces three key innovations: Hop-Wise Scalable Adaptive GCN to refine local feature extraction, Attention-Based Local Feature Extractor to enhance the integration of local and global representations, and Register-Based Transformer Enhancement to improve feature separation. Extensive experiments on Human3.6M and MPI-INF-3DHP datasets demonstrate ADGT achieves state-of-the-art performance among frame-based methods while maintaining computational efficiency. These results highlight the potential of ADGT for real-time applications requiring accurate and efficient 3DHPE. ",
keywords = "3D HPE, Graph convolution, Transformer",
author = "Shuo Yang and Luu, \{Anh Tuan\} and Xuan-son Nguyen and Aymeric Histace and Bart Jansen and Hichem Sahli",
year = "2026",
month = jun,
doi = "10.1016/j.jvcir.2026.104829",
language = "English",
volume = "118",
journal = "Journal of Visual Communication and Image Representation",
issn = "1047-3203",
publisher = "Academic Press Inc.",
}