Early action prediction deals with inferring the ongoing action from partially-observed videos, typically at the outset of the video. We propose a bottleneck-based attention model that captures the evolution of the action, through progressive sampling over fine-to-coarse scales. Our proposed Temporal Progressive (TemPr) model is composed of multiple attention towers, one for each scale. The predicted action label is based on the collective agreement considering confidences of these towers. Extensive experiments over four video datasets showcase state-of-the-art performance on the task of Early Action Prediction across a range of encoder architectures. We demonstrate the effectiveness and consistency of TemPr through detailed ablations.
Stergiou, A & Damen, D 2023, The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction. in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, pp. 14709-14719, IEEE/CVF Conference on Computer Vision and Pattern Recognition , Vancouver, British Columbia, Canada, 18/06/23. https://doi.org/10.1109/CVPR52729.2023.01413
Stergiou, A., & Damen, D. (2023). The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 14709-14719). (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). IEEE. https://doi.org/10.1109/CVPR52729.2023.01413
@inproceedings{fee0a74bc79441a7bc55e9f193d462d0,
title = "The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction",
abstract = "Early action prediction deals with inferring the ongoing action from partially-observed videos, typically at the outset of the video. We propose a bottleneck-based attention model that captures the evolution of the action, through progressive sampling over fine-to-coarse scales. Our proposed Temporal Progressive (TemPr) model is composed of multiple attention towers, one for each scale. The predicted action label is based on the collective agreement considering confidences of these towers. Extensive experiments over four video datasets showcase state-of-the-art performance on the task of Early Action Prediction across a range of encoder architectures. We demonstrate the effectiveness and consistency of TemPr through detailed ablations.",
author = "Alexandros Stergiou and Dima Damen",
note = "Funding Information: We use publicly available datasets. Research is funded by the United Nation's End Violence Fund (iCOP 2.0) and EPSRC UMPIRE (EP/T004991/1). We utilized Bristol's HPC Blue Crystal 4 facility. Publisher Copyright: {\textcopyright} 2023 IEEE.; IEEE/CVF Conference on Computer Vision and Pattern Recognition , CVPR ; Conference date: 18-06-2023 Through 22-06-2023",
year = "2023",
doi = "10.1109/CVPR52729.2023.01413",
language = "English",
series = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",
publisher = "IEEE",
pages = "14709--14719",
booktitle = "2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)",
}