University of Oulu

W. C. de Melo, E. Granger and A. Hadid, "Combining Global and Local Convolutional 3D Networks for Detecting Depression from Facial Expressions," 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 2019, pp. 1-8. doi: 10.1109/FG.2019.8756568

Combining global and local convolutional 3D networks for detecting depression from facial expressions

Saved in:
Author: de Melo, Wheidima Carneiro1; Granger, Eric2; Hadid, Abdenour1
Organizations: 1Center for Machine Vision and Signal Analysis (CMVS), University of Oulu, Finland
2Laboratoire d’imagerie, de vision et d’intelligence artificielle (LIVIA), ETS Montréal, Canada
Format: article
Version: accepted version
Access: open
Online Access: PDF Full Text (PDF, 2 MB)
Persistent link:
Language: English
Published: Institute of Electrical and Electronics Engineers, 2019
Publish Date: 2020-03-24


Deep learning architectures have been successfully applied in video-based health monitoring, to recognize distinctive variations in the facial appearance of subjects. To detect patterns of variation linked to depressive behavior, deep neural networks (NNs) typically exploit spatial and temporal information separately by, e.g., cascading a 2D convolutional NN (CNN) with a recurrent NN (RNN), although the intrinsic spatio-temporal relationships can deteriorate. With the recent advent of 3D CNNs like the convolutional 3D (C3D) network, these spatio-temporal relationships can be modeled to improve performance. However, the accuracy of C3D networks remain an issue when applied to depression detection. In this paper, the fusion of diverse C3D predictions are proposed to improve accuracy, where spatio-temporal features are extracted from global (full-face) and local (eyes) regions of subject. This allows to increasingly focus on a local facial region that is highly relevant for analyzing depression. Additionally, the proposed network integrates 3D Global Average Pooling in order to efficiently summarize spatio-temporal features without using fully-connected layers, and thereby reduce the number of model parameters and potential over-fitting. Experimental results on the Audio Visual Emotion Challenge (AVEC 2013 and AVEC 2014) depression datasets indicates that combining the responses of global and local C3D networks achieves a higher level of accuracy than state-of-the-art systems.

see all

ISBN: 978-1-7281-0089-0
ISBN Print: 978-1-7281-0090-6
Pages: 1 - 8
Article number: 8756568
DOI: 10.1109/FG.2019.8756568
Host publication: 14th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2019
Conference: IEEE International Conference on Automatic Face and Gesture Recognition
Type of Publication: A4 Article in conference proceedings
Field of Science: 113 Computer and information sciences
Funding: The financial support of the Academy of Finland, Infotech Oulu, and the Natural Sciences and Engineering Research Council of Canada is acknowledged. The first author wishes to thank the State University of Amazonas for the support.
Copyright information: © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.