Paired supervised learning and unsupervised pretraining of CNN-architecture for violence detection in videos

Paired supervised learning and unsupervised pretraining of CNN-architecture for violence detection in videos ■

Abel Díaz Berenguer, , Mitchel Perez Gonzalez, Hichem Sahli

Abstract ■

Recognizing violence in crowded scenes is a major challenge for automatic video surveillance. Indeed, there is a growing need of intelligent surveillance systems to strengthen public safety. In this paper we propose an effective approach to recognize violence in crowded videos based on a shallow Convolutional Neural Network (CNN) that is pretrained using an unsupervised layer-wise learning strategy. Afterwards, the pretrained hyper-parameters are fine-tuned to extract intermediate frame representations, which are subsequently aggregated via NetVLAD to obtain video representations to recognize violence in footage. Through experimental evaluation we validated that our proposal yields very competitive outcomes compared to results reported in the state-of-the-art.