Since the introduction of the self-attention mech-anism and the adoption of the Transformer architecture forComputer Vision tasks, the Vision Transformer-based archi-tectures gained a lot of popularity in the field, being usedfor tasks such as image classification, object detection andimage segmentation. However, efficiently leveraging the attentionmechanism in vision transformers for the Monocular 3D ObjectDetection task remains an open question. In this paper, wepresent LAM3D, a framework that Leverages self-Attentionmechanism for Monocular 3D object Detection. To do so, theproposed method is built upon a Pyramid Vision Transformerv2 (PVTv2) as feature extraction backbone and 2D/3D detectionmachinery. We evaluate the proposed method on the KITTI3D Object Detection Benchmark, proving the applicability ofthe proposed solution in the autonomous driving domain andoutperforming reference methods. Moreover, due to the usage ofself-attention, LAM3D is able to systematically outperform theequivalent architecture that does not employ self-attention.
Sas, D-A, Di Bella, L, Yangxintong, L, Oniga, F & Munteanu, A 2024, LAM3D: Leveraging Attention for Monocular 3D Object Detection. in International Workshop on Multimedia Signal Processing (MMSP). IEEE, pp. 1-6. https://doi.org/10.1109/MMSP61759.2024.1074349, https://doi.org/10.48550/arXiv.2408.01739
Sas, D.-A., Di Bella, L., Yangxintong, L., Oniga, F., & Munteanu, A. (2024). LAM3D: Leveraging Attention for Monocular 3D Object Detection. In International Workshop on Multimedia Signal Processing (MMSP) (pp. 1-6). IEEE. https://doi.org/10.1109/MMSP61759.2024.1074349, https://doi.org/10.48550/arXiv.2408.01739
@inproceedings{d0a1511ede574867839b3d43ab383a71,
title = "LAM3D: Leveraging Attention for Monocular 3D Object Detection",
abstract = "Since the introduction of the self-attention mech-anism and the adoption of the Transformer architecture forComputer Vision tasks, the Vision Transformer-based archi-tectures gained a lot of popularity in the field, being usedfor tasks such as image classification, object detection andimage segmentation. However, efficiently leveraging the attentionmechanism in vision transformers for the Monocular 3D ObjectDetection task remains an open question. In this paper, wepresent LAM3D, a framework that Leverages self-Attentionmechanism for Monocular 3D object Detection. To do so, theproposed method is built upon a Pyramid Vision Transformerv2 (PVTv2) as feature extraction backbone and 2D/3D detectionmachinery. We evaluate the proposed method on the KITTI3D Object Detection Benchmark, proving the applicability ofthe proposed solution in the autonomous driving domain andoutperforming reference methods. Moreover, due to the usage ofself-attention, LAM3D is able to systematically outperform theequivalent architecture that does not employ self-attention.",
author = "Diana-Alexandra Sas and {Di Bella}, Leandro and Lyu Yangxintong and Florin Oniga and Adrian Munteanu",
year = "2024",
month = oct,
day = "2",
doi = "10.1109/MMSP61759.2024.1074349",
language = "English",
isbn = "979-8-3503-8726-1",
pages = "1--6",
booktitle = "International Workshop on Multimedia Signal Processing (MMSP)",
publisher = "IEEE",
}