Hybrid Depression Classification and Estimation from Audio Video and Text Information
Host Publication: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge (AVEC ཌྷ)
Authors: L. Yang, H. Sahli, X. Xia, M. Oveneke and D. Jiang
UsePubPlace: New York, NY, USA
Publication Year: 2017
Number of Pages: 8
In this paper, we design a hybrid depression classification and depression estimation framework from audio, video and text descriptors. It contains three main components: 1) Deep Convolutional Neural Network (DCNN) and Deep Neural Network (DNN) based audio visual multi-modal depression recognition frameworks, trained with depressed and not-depressed participants, respectively 2) Paragraph Vector (PV), Support Vector Machine (SVM) and Random Forest based depression classification framework from the interview transcripts 3) A multivariate regression model fusing the audio visual PHQǊ estimations from the depressed and not-depressed DCNN-DNN models, and the depression classification result from the text information. In the DCNN-DNN based depression estimation framework, audio/video feature descriptors are first input into a DCNN to learn high-level features, which are then fed to a DNN to predict the PHQǊ score. Initial predictions from the two modalities are fused via a DNN model. In the PV-SVM and Random Forest based depression classification framework, we explore semantic-related text features using PV, as well as global text-features. Experiments have been carried out on the Distress Analysis Interview Corpus - Wizard of Oz (DAIC-WOZ) dataset for the Depression Sub-challenge at the 2017 Audio-Visual Emotion Challenge (AVEC), results show that the proposed depression recognition framework obtains very promising results, with root mean square error (RMSE) as 3.088, mean absolute error (MAE) as 2.477 on the development set, and RMSE as 5.400, MAE as 4.359 on the test set, which are all lower than the baseline results.