We present a triple stream DBN model (T_AsyDBN) for audio visual emotion recognition, in which the two audio feature streams are synchronous, while they are asynchronous with the visual feature stream within controllable constraints. MFCC features and the principle component analysis (PCA) coefficients of local prosodic features are used for the audio streams. For the visual stream, 2D facial features as well 3D facial animation unit features are defined and concatenated, and the feature dimensions are reduced by PCA. Emotion recognition experiments on the eNERFACE'05 database show that by adjusting the asynchrony constraint, the proposed T_AsyDBN model obtains 18.73% higher correction rate than the traditional multi-stream state synchronous HMM (MSHMM), and 10.21% higher than the two stream asynchronous DBN model (Asy_DBN).
Jiang, D, Cui, Y, Zhang, X, Fan, P, Gonzalez, I & Sahli, H 2011, Audio Visual Emotion Recognition Based on Triple-Stream Dynamic Bayesian Network Models. in S D?mello, A Graesser, B Schuller & J Martin (eds), Affective Computing and Intelligent Interaction. vol. 6974, Lecture Notes in Computer Science, Springer, pp. 609-618, Unknown, 1/01/11. <http://www.springerlink.com/content/57051t634481g267/>
Jiang, D., Cui, Y., Zhang, X., Fan, P., Gonzalez, I., & Sahli, H. (2011). Audio Visual Emotion Recognition Based on Triple-Stream Dynamic Bayesian Network Models. In S. D?mello, A. Graesser, B. Schuller, & J. Martin (Eds.), Affective Computing and Intelligent Interaction (Vol. 6974, pp. 609-618). (Lecture Notes in Computer Science). Springer. http://www.springerlink.com/content/57051t634481g267/
@inproceedings{e089c06309644340bd5abc5e5da5fe91,
title = "Audio Visual Emotion Recognition Based on Triple-Stream Dynamic Bayesian Network Models",
abstract = "We present a triple stream DBN model (T_AsyDBN) for audio visual emotion recognition, in which the two audio feature streams are synchronous, while they are asynchronous with the visual feature stream within controllable constraints. MFCC features and the principle component analysis (PCA) coefficients of local prosodic features are used for the audio streams. For the visual stream, 2D facial features as well 3D facial animation unit features are defined and concatenated, and the feature dimensions are reduced by PCA. Emotion recognition experiments on the eNERFACE'05 database show that by adjusting the asynchrony constraint, the proposed T_AsyDBN model obtains 18.73% higher correction rate than the traditional multi-stream state synchronous HMM (MSHMM), and 10.21% higher than the two stream asynchronous DBN model (Asy_DBN).",
keywords = "Computer Science",
author = "Dongmei Jiang and Yulu Cui and Xiaojing Zhang and Ping Fan and Isabel Gonzalez and Hichem Sahli",
note = "Sidney D?Mello, Arthur Graesser, Bj{\"o}rn Schuller, Jean-Claud Martin; Unknown ; Conference date: 01-01-2011",
year = "2011",
language = "English",
isbn = "978-3-642-24599-2",
volume = "6974",
series = "Lecture Notes in Computer Science",
publisher = "Springer",
pages = "609--618",
editor = "Sidney D?mello and Arthur Graesser and Bj{\"o}rn Schuller and Jean-claud Martin",
booktitle = "Affective Computing and Intelligent Interaction",
}