University of Oulu

W. Carneiro de Melo, E. Granger and M. B. Lopez, "Encoding Temporal Information For Automatic Depression Recognition From Facial Analysis," ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 1080-1084, doi: 10.1109/ICASSP40776.2020.9054375

Encoding temporal information for automatic depression recognition from facial analysis

Saved in:
Author: Carneiro de Melo, Wheidima1; Granger, Eric2; Bordallo Lopez, Miguel3,1
Organizations: 1Center for Machine Vision and Signal Analysis (CMVS), University of Oulu, Finland
2LIVIA, Dept. of Systems Engineering, ´ Ecole de technologie sup´erieure, Montreal, Canada
3VTT Technical Research Centre of Finland
Format: article
Version: accepted version
Access: open
Online Access: PDF Full Text (PDF, 2.2 MB)
Persistent link: http://urn.fi/urn:nbn:fi-fe2020090267167
Language: English
Published: Institute of Electrical and Electronics Engineers, 2020
Publish Date: 2020-09-02
Description:

Abstract

Depression is a mental illness that may be harmful to an individual’s health. Using deep learning models to recognize the facial expressions of individuals captured in videos has shown promising results for automatic depression detection. Typically, depression levels are recognized using 2D-Convolutional Neural Networks (CNNs) that are trained to extract static features from video frames, which impairs the capture of dynamic spatio-temporal relations. As an alternative, 3D-CNNs may be employed to extract spatiotemporal features from short video clips, although the risk of overfitting increases due to the limited availability of labeled depression video data. To address these issues, we propose a novel temporal pooling method to capture and encode the spatio-temporal dynamic of video clips into an image map. This approach allows fine-tuning a pre-trained 2D CNN to model facial variations, and thereby improving the training process and model accuracy. Our proposed method is based on two-stream model that performs late fusion of appearance and dynamic information. Extensive experiments on two benchmark AVEC datasets indicate that the proposed method is efficient and outperforms the state-of-the-art schemes.

see all

Series: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing
ISSN: 1520-6149
ISSN-E: 2379-190X
ISSN-L: 1520-6149
ISBN: 978-1-5090-6631-5
ISBN Print: 978-1-5090-6632-2
Pages: 1080 - 1084
DOI: 10.1109/ICASSP40776.2020.9054375
OADOI: https://oadoi.org/10.1109/ICASSP40776.2020.9054375
Host publication: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Conference: IEEE International Conference on Acoustics, Speech and Signal Processing
Type of Publication: A4 Article in conference proceedings
Field of Science: 213 Electronic, automation and communications engineering, electronics
Subjects:
Funding: The financial support of the Academy of Finland, Infotech Oulu and State University of Amazonas is acknowledged.
Copyright information: © 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works