Transformer Encoder With Multi-Modal Multi-Head Attention for Continuous Affect Recognition
 
Transformer Encoder With Multi-Modal Multi-Head Attention for Continuous Affect Recognition 
 
Haifeng Chen, Dongmei Jiang, Hichem Sahli
 
Abstract 

Continuous affect recognition is becoming an increasingly attractive research topic in affective computing. Previous works mainly focused on modelling the temporal dependency within a sensor modality, or adopting early or late fusion for multi-modal affective state recognition. However, early fusion suffers from the curse of dimensionality, and late fusion ignores the complementarity and redundancy between multiple modal streams. In this paper, we first introduce the transformer-encoder with a self-attention mechanism and propose a Convolutional Neural Network-Transformer Encoder (CNNTE) framework to model the temporal dependency for single modal affect recognition. Further, to effectively consider the complementarity and redundancy between multiple streams we propose a Transformer Encoder with Multi-modal Multi-head Attention (TEMMA) for multi-modal affect recognition. TEMMA allows to progressively and simultaneously refine the inter-modality interactions and intra-modality temporal dependency. The learned multi-modal representations are fed to an Inference Sub-network with fully connected layers to estimate the affective state. The proposed framework is trained in a nutshell and demonstrates its effectiveness on the AVEC2016 and AVEC2019 datasets. Compared to state-of-the-art models, our approach obtains remarkable improvements on both arousal and valence in terms of concordance correlation coefficient (CCC) reaching 0.583 for arousal and 0.564 for valence on the AVEC2019 test set.