In this paper, we target the CCPR 2016 Multimodal Emotion Recognition Challenge (MEC 2016) which is based on the Chinese Natural Audio-Visual Emotion Database (CHEAVD) of movies and TV programs showing (nearly) spontaneous human emotions. Low level descriptors (LLDs) are proposed as audio features. As visual features, we propose using histogram of oriented gradients (HOG), local phase quantisation (LPQ), shape features and behavior-related features such as head pose and eye gaze. The visual features are post processed to delete or smooth the all-zero feature vector segments. Single modal emotion recognition is performed using fully connected hidden Markov models (HMMs). For multimodal emotion recognition, two fusion schemes are proposed: 1) fusing the normalized probability vectors from the HMMs by a support vector machine (SVM), and 2) using the LLD features when the visual feature sequences are all-zero vector sequences. Moreover, to make full use of the labeled data and to overcome the problem of unbalanced data, we use the training set and validation set together to train the HMMs and SVMs with parameters optimized via cross-validation experiments.Experimental results on the test set show that the macro average precision (MAP) of audio emotion recognition is 42:85%, the HOG features obtain the best visual emotion recognition performance with MAP reaching 54:24%, and the multimodal fusion scheme 2 obtains 53:90% of MAP, which are all much higher than the baseline results (24:02%, 34:28%, and 30:63% for audio, visual and multimodal recognition, respectively). The obtained classification accuracies are also much higher than the baseline.
Xia, X, Guo, L, Jiang, D, Pei, E, Yang, L & Sahli, H 2016, Audio Visual Recognition of Spontaneous Emotions In-the-Wild. in T Tieniu, L Xuelong, X Chen, J Zhou & H Cheng (eds), Pattern Recognition: Communications in Computer and Information Science- CCPR 2016, Proceedings, Part II. vol. 663 , Springer, Singapore, pp. 692-706, 7th Chinese Conf. on Pattern Recognition: Communications in Computer and Information Science- , Chengdu, China, 5/11/16.
Xia, X., Guo, L., Jiang, D., Pei, E., Yang, L., & Sahli, H. (2016). Audio Visual Recognition of Spontaneous Emotions In-the-Wild. In T. Tieniu, L. Xuelong, X. Chen, J. Zhou, & H. Cheng (Eds.), Pattern Recognition: Communications in Computer and Information Science- CCPR 2016, Proceedings, Part II (Vol. 663 , pp. 692-706). Springer.
@inproceedings{880f0e32f1144765b79862e87687bdeb,
title = "Audio Visual Recognition of Spontaneous Emotions In-the-Wild",
abstract = "In this paper, we target the CCPR 2016 Multimodal Emotion Recognition Challenge (MEC 2016) which is based on the Chinese Natural Audio-Visual Emotion Database (CHEAVD) of movies and TV programs showing (nearly) spontaneous human emotions. Low level descriptors (LLDs) are proposed as audio features. As visual features, we propose using histogram of oriented gradients (HOG), local phase quantisation (LPQ), shape features and behavior-related features such as head pose and eye gaze. The visual features are post processed to delete or smooth the all-zero feature vector segments. Single modal emotion recognition is performed using fully connected hidden Markov models (HMMs). For multimodal emotion recognition, two fusion schemes are proposed: 1) fusing the normalized probability vectors from the HMMs by a support vector machine (SVM), and 2) using the LLD features when the visual feature sequences are all-zero vector sequences. Moreover, to make full use of the labeled data and to overcome the problem of unbalanced data, we use the training set and validation set together to train the HMMs and SVMs with parameters optimized via cross-validation experiments.Experimental results on the test set show that the macro average precision (MAP) of audio emotion recognition is 42:85%, the HOG features obtain the best visual emotion recognition performance with MAP reaching 54:24%, and the multimodal fusion scheme 2 obtains 53:90% of MAP, which are all much higher than the baseline results (24:02%, 34:28%, and 30:63% for audio, visual and multimodal recognition, respectively). The obtained classification accuracies are also much higher than the baseline.",
keywords = "emotion, data mining",
author = "Xiaohan Xia and Liyong Guo and Dongmei Jiang and Ercheng Pei and Le Yang and Hichem Sahli",
year = "2016",
language = "English",
volume = "663 ",
pages = "692--706",
editor = "Tan Tieniu and Li Xuelong and Chen, { Xilin} and Zhou, { Jie} and Hong Cheng",
booktitle = "Pattern Recognition",
publisher = "Springer",
note = "7th Chinese Conf. on Pattern Recognition: Communications in Computer and Information Science- , CCPR 2016 ; Conference date: 05-11-2016 Through 07-11-2016",
url = "http://www.uestcrobot.net/ccpr2016/english_index.html",
}