Multimodal Measurement of Depression Using Deep Learning Models
Host Publication: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge (AVEC ཌྷ)
Authors: L. Yang, D. Jiang, X. Xia, M. Oveneke and H. Sahli
UsePubPlace: New York, NY, USA
Publication Year: 2017
Number of Pages: 7
This paper addresses multi-modal depression analysis. We propose a multi-modal fusion framework composed of deep convolutional neural network (DCNN) and deep neural network (DNN) models. Our framework considers audio, video and text streams. For each modality, handcrafted feature descriptors are input into a DCNN to learn high-level global features with compact dynamic information, then the learned features are fed to a DNN to predict the PHQǊ scores. For multi-modal fusion, the estimated PHQǊ scores from the three modalities are integrated in a DNN to obtain the final PHQǊ score. Moreover, in this work, we propose new feature descriptors for text and video. For the text descriptors, we select the participant»s answers to the questions associated with psychoanalytic aspects of depression, such as sleep disorder, and make use of the Paragraph Vector (PV) to learn the distributed representations of these sentences. For the video descriptors, we propose a new global descriptor, the Histogram of Displacement Range (HDR), calculated directly from the facial landmarks to measure their displacements and speed. Experiments have been carried out on the AVEC2017 depression sub-challenge dataset. The obtained results show that the proposed depression recognition framework obtains very promising accuracy, with the root mean square error (RMSE) as 4.653, mean absolute error (MAE) as 3.980 on the development set, and RMSE as 5.974, MAE as 5.163 on the test set.