MDN: A Deep Maximization-Differentiation Network for Spatio-Temporal Depression Detection

Deep learning (DL) models have been successfully applied in video-based affective computing, allowing, for instance, to recognize emotions and mood, or to estimate the intensity of pain or stress of individuals based on their facial expressions. Despite the recent advances with state-of-the-art DL models for spatio-temporal recognition of facial expressions associated with depressive behaviour, some key challenges remain in the cost-effective application of 3D-CNNs: (1) 3D convolutions usually employ structures with fixed temporal depth that decreases the potential to extract discriminative representations due to the usually small difference of spatio-temporal variations along different depression levels; and (2) the computational complexity of these models with consequent susceptibility to overfitting. To address these challenges, we propose a novel DL architecture called the Maximization and Differentiation Network (MDN) in order to effectively represent facial expression variations that are relevant for depression assessment. The MDN, operating without 3D convolutions, explores multiscale temporal information using a maximization block that captures smooth facial variations and a difference block that encodes sudden facial variations. Extensive experiments using our proposed MDN with models with 100 and 152 layers result in improved performance while reducing the number of parameters by more than <inline-formula><tex-math notation="LaTeX">$3\times$</tex-math><alternatives><mml:math><mml:mrow><mml:mn>3</mml:mn><mml:mo>×</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="carneirodemelo-ieq1-3072579.gif"/></alternatives></inline-formula> when compared with 3D ResNet models. Our model also outperforms other 3D models and achieves state-of-the-art results for depression detection. Code available at: <uri>https://github.com/wheidima/MDN</uri>.


INTRODUCTION
H EALTH care has attracted an increasing amount of interest from the computer vision and machine learning communities due to its large number of applications. It is anticipated that automatic diagnosis systems may provide effective decision support for clinicians in an explainable, unobtrusive, and objective manner regardless of the identity, gender, age, and ethnicity of the subject. Recently, much progress has been made towards this goal, specially for systems based on facial analysis [1]. Such systems leverage the fact that the facial modality acts as a mirror of the health condition which may expose symptomatic signs of particular diseases, including mental health conditions. For instance, Giannakakis et al. [2] explored facial cues obtained from eye activity, mouth activity and head movements for the recognition and analysis of stress and anxiety states.
An emerging field for automatic health care diagnosis methods is depression detection. Major Depressive Disorder, also known as depression, is a common mental disorder with an immense economic burden. Such mental disorder is associated with a negative state of mind which persists for a long time. It may cause alterations in appetite [3], sleep disturbances [4], limited ability to concentrate [3], headache [5], backache [5], stomach ache [6], anxiety [7], loss of pleasure and/or interest in persons or things [3]. In severe cases, depression leads to suicidal behavior and substance abuse [8]. Furthermore, depression may amplify the chances of developing and sometimes contribute to the progress of serious clinical states, such as diabetes, cardiovascular disease, and cancer [9].
Despite the gravity of depression, there are effective treatments for this disorder. Typical treatments include antidepressants, mood stabilizers, and psychotherapeutic approaches. Consequently, an accurate diagnosis of depression and its severity is crucial for immediate treatment and reduction of negative consequences. Normally, clinical practice is based on Diagnostic and Statistical Manual of Mental Disorders (DSM-5) specifications [10] that are analyzed under structured interviews. The severity of depression is determined by employing self-report inventories, e.g., Beck Depression Inventory (BDI), or an inventory such as Hamilton Depression Rating (HAM-D), usually managed by a clinician with experience in treating psychiatric patients. However, some studies have shown that clinicians have difficulties to recognize depression [11], [12].
Inefficient depression detection has resulted in an alarming number of false-positives that posed serious consequences including the death of patients [12]. Moreover, the clinical interventions are normally labor-intensive, expensive and require considerable expertise in managing depressive states.
From this perspective, decision support diagnosis systems based on machine learning can provide an accurate and objective prediction of depression levels, contributing to the evaluation and monitoring of patients. Such methods may focus on visual-based nonverbal cues for depression detection. Indeed, studies have shown a set of visual cues which are correlated with depression [3], [13], [14]. These cues basically comprise specific facial expressions and dynamics, and limited levels of positive social behaviors (e.g., absence of smiles [3]). In this context, a commonly accepted approach is to automatically predict depression levels by exploring the facial information of subjects in videos.
Although several deep learning (DL) models have been proposed for depression detection from video-based facial analysis [15], [16], [17], [18], [19], [20], [21], these architectures may not consistently achieve a high level of performance. There are at least two reasons for this issue. The first one is that DL models are failing to encode the spatio-temporal information from facial expressions along depression levels. Two video sequences with distinct labels can exhibit small differences in the variations of facial expression. In this case, the use of models that explore fixed-range temporal information decreases the ability to produce discriminative representations. The second reason is the limited amount of annotated training data that is available to design the predictive architectures. When applied to depression detection, effective DL models for video representation based on 3D Convolutional Neural Networks (3D-CNNs) require optimizing a large number of parameters. Therefore, the risk of overfitting is high because of the relatively small size of training datasets.
In other applications domains, e.g., action recognition [22], [23], diverse 3D architectures have been proposed to capture spatio-temporal features [23], [24], [25], [26]. However, their high level of performance is typically achieved at the expense of high computational complexity. Hence, training these architectures requires large amounts of training data and computational resources. Some authors have proposed to reduce the computational cost of 3D models by using different forms of spatio-temporal convolutions [22], [27], [28], [29], but such approaches explore fixedtemporal information. We argue that this decreases the potential for generating discriminative feature representations for depression detection.
In this paper, an effective architecture, named Maximization and Differentiation Network (MDN), is proposed to explore facial expression variations at different temporal scales. This DL model is composed of a maximization block and a difference block. Given an input, the maximization block is employed to capture smooth transitions of facial structures, while the difference block encodes sudden spatio-temporal variations. These blocks do not rely on 3D filters, and their generated features are combined in a way that leads to a robust feature representation for depression detection. We design our MDN module by using residuallike structure since such skip connections have shown their effectiveness for training CNNs. For experimental validation, our MDN module is integrated into a 3D ResNet-like architecture, although it can be incorporated into other CNN architectures.
The main contributions of this paper are: The definition of maximization and difference blocks that encode the complementary smooth and sudden facial expression variations, respectively. The combination of the maximization and difference blocks into an efficient MDN module such that a wide range of spatio-temporal facial variations can be explored without employing complex 3D filters. An extensive experimental study indicating that our proposed MDN provides a cost-effective solution, outperforming different 3D-CNNs and state-of-theart models on two publicly available benchmarking datasets: AVEC2013 and AVEC2014. Experiments also show that, for deeper networks, our MDN reduces the number of parameters by around 3:3 Â , and improves the performance over 3D ResNet. The rest of this paper is organized as follows. Section 2 provides some background on models for depression detection, as well as for spatio-temporal recognition. Our proposed MDN is presented in Section 3. Finally, Sections 4 and 5 describe the experimental methodology (datasets, protocols and performance metrics), and results for validation while Section 6 draws the conclusions of the present work.

Automatic Depression Estimation
People affected by depression have been demonstrated to exhibit higher chances of facial expression disturbances due to mood variations [3]. For example, the authors in [13] report the restriction of facial expressiveness (or emotional variability) associated with depressive states. In this context, sad facial expressions are shown to be more predominant [30], while depressed patients show to have limited eye contact [3] and less intense smiles [31]. The number of head movements and their intensity is also statistically lower when compared with healthy subjects [14]. All these visual cues have the potential to be automatically explored to support in the detection, diagnosis, and assessment of depression by using videos containing human faces and machine learning approaches. In the literature, conventional machine learning approaches have been proposed, often by applying hand-engineered feature representations, e.g., Local Binary Patterns (LBP), followed by regression analysis, such as Support Vector Regression (SVR). In contrast, DL approaches perform end-to-end learning, typically using a 2D-CNN followed by a recurrent network, or using a 3D-CNN, where a regression layer generates the output.

Hand-Engineered Methods
Recently, events like the Audio-Visual Emotion Challenge and Workshop, AVEC 2013 [32] and 2014 [33], have increased the interest and number of contributions in automated depression analysis. The baseline facial descriptor in AVEC2013 [32] was the Local Phase Quantization (LPQ), and SVR was employed for prediction of depression levels. The following researches on AVEC2013 dataset are relied on LPQ [34], LBP [35], Pyramid of Histogram of Gradients (PHOG) [36], Local Phase Quantization from Three Orthogonal Planes (LPQ-TOP) [37], and Canonical Correlation Analysis (CCA) [38]. The baseline facial descriptor in AVEC 2014 [33] was the Local Gabor Binary Patterns from Three Orthogonal Planes (LGBP-TOP). Following work from Jan et al. [39] extract three different texture representations employing LBP, LPQ, and Edge Orientation Histograms (EOH), while temporally mapping their variations using Motion History Histogram (MHH). Kaya et al. [40] compute LGBP-TOP and LPQ features, and further analyze them by using CCA whereas Dhall et al. [41] and Jain et al. [42] employ Fisher Vectors (FV) to derive the depression levels.

Deep Learning Methods
More recently, some deep neural networks have been proposed to address depression detection, and other applications in affective computing. In particular, Zhu et al. [15] proposed a DL architecture which is comprised of two streams using facial images and optical flow as inputs. In [43], the authors presented a Deep Transformation Learning (DTL) scheme to project facial features into a new feature subspace with the purpose to capture the non-linearity of the data. Jazaery et al. [18] used two Convolutional 3D (C3D) networks [25] to capture spatio-temporal features at two different scales. Extending on this idea, Melo et al. [19] employed two C3D to extract spatial and temporal features from two different facial areas. Jan et al. [21] employed a 2D-CNN to explore appearance information, while the variations of the features are encoded using Feature Dynamic History Histograms (FDHH).
Following recent trends, Residual Networks [44] have also been explored in depression detection. For example, the depression level was predicted by using a 50-layer residual network (ResNet-50), and deep distribution learning [20], whereas a ResNet-50 was used in [17] with an attention mechanism to combine facial features. In [16], four ResNet-50 were employed to estimate depression levels while providing the facial regions that provide most information about depression. Finally, Song et al. [45] presented an approach to explore behavior primitives (facial action units, head pose, and gaze directions) by transforming onedimensional signals into their spectral representations. These representations are then fed to a DL network that performs the final regression of the depression levels. The majority of these methods exploit spatial and temporal information separately by using 2D-CNNs and some approach to explore the facial features. However, such an approach may deteriorate the intrinsic spatio-temporal relationships.

Modelling Spatio-Temporal Information
To directly encode the facial appearance and dynamics for depression detection in videos, it is essential to produce efficient representations. Several DL models have been proposed to model spatio-temporal information. Tran et al. [25] proposed the architecture called C3D which was one of the first methods to capture spatial and temporal information using 3D-CNN. Carreira et al. [24] proposed Inflated 3D-ConvNet (I3D) which is a transformation of 2D Inception model into 3D-CNN by inflating all the filters and pooling kernels. In [26], authors explored the effectiveness of diverse 3D-CNN architectures based on residual networks (3D ResNet). Feichtenhofer et al. [23] presented SlowFast network that is composed of a slow path to explore spatial semantics and a fast path to explore motion at fine temporal resolution.
In general, all these architectures have structures with fixed temporal depth. In this case, it is difficult to generate effective features representations for depression detection since the difference of spatio-temporal variations between the depression levels is often small. Moreover, the number of model parameters to optimize is typically very large, which increases the chances of overfitting due to the limited amount of annotated training data that is available for depression detection. Some authors have proposed different techniques of spatio-temporal convolutions. In particular, Tran et al. [27] factorized 3D convolution into two cascaded operations, a 2D convolution (spatial) and a 1D convolution (temporal). Xie et al. [28] investigated various forms of 3D-CNNs where Top-Heavy I3D, which employs 2D structures in the lower layers, and 3D structures in the upper layers, presents better performance. In [29], the authors proposed a Pseudo-3D Residual Network (P3D ResNet) by using a spatio-temporal decomposition on a residual learning module. Finally, Jiang et al. [22] proposed to encode spatio-temporal and motion features jointly using 2D and 1D CNNs. The proposed MDN module also decomposes the 3D convolution operation, but our approach does not employ 1D convolutions. Instead, we use 2D convolutions and two functions without trainable parameters to capture features at multiple ranges.

THE PROPOSED MAXIMIZATION AND DIFFERENTIATION NETWORK
The face of a person suffering from depression exhibits specific spatio-temporal patterns of variation. The goal of automatic depression detection from videos is to encode the facial expression variations that carry the most relevant discriminative information. In this context, our proposed approach captures the spatial and temporal information without using 3D filters, allowing to limit model complexity. We propose a maximization block to summarize spatiotemporal information, and a difference block to encode the details of the spatio-temporal variations. These blocks are combined into the MDN module.

Maximization Block
The idea of the maximization block is to model global spatial and temporal variations. Using a function that summarizes such variations in a cascade with 2D convolutional layers allows the module to extract relevant spatio-temporal features, which can improve the performance of a depression detection model. As the block is based on the max function, it has the potential to capture smooth facial variations. Given that for an input feature map the semantic information is redundant along the temporal depth, we claim that such information can be summarized employing a simple operation without the using of trainable parameters. Let X 2 R NÂT ÂHÂW ÂC represents an input feature map, where N; T; H; W and C are the batch size, temporal depth, height, width, and the number of channels, respectively. We formally define the operation as where V is the spatio-temporal representation, l is the length of the sliding window used to perform max pool along depth axis, and t; h; w denote the depth and the spatial dimensions, respectively. Note that this representation employs the same dimensions of the input feature map. Instead of exploring spatio-temporal variations with structures that employ fixed temporal depths, our proposed block uses different ranges of dynamics, contributing to capture supplementary information for depression representation. As shown in Fig. 1, the maximization block is composed of N branches, each one can operating in a distinct range, i.e., l 1 ; l 2 ; . . . ; l N . It is important to note that a higher number of branches increases the number of parameters, which in turn increases the model training times. On the other hand, a small number of branches may decrease the capabilities of the model. Let x i denote the output of branch i, then the block's output can be expressed by where z represents the final feature map, Hfg is a fusion function carried out by a 1 Â 1 Â 1 convolutional layer, S is the operation that concatenates the output of each branch, and N refers to the number of branches. This procedure encodes a set of spatio-temporal information in a single map, which can convey in its texture information about movement, favoring the exploitation of the dynamics by a set of 2D filters. Moreover, as our approach is based on structures with variable temporal depth, the use of 2D filters rather than 3D filters, avoids an exponential increase in the number of parameters and decreases the risk of overfitting.

Difference Block
To generate a robust representation of facial variations, it is important to encode sudden transitions of facial structures. These transitions can, for example, assist the model to analyze segments of a video with similar facial expression variations. Motivated by this, we propose a structure called difference block that explores the velocity of facial expression variations.
Let X 2 R NÂT ÂHÂW ÂC define the input feature maps, the first step of the difference block is to compute the absolute value of the difference between the feature maps. This operation is defined by where H t is the output of the operation, t is the temporal depth, and i represents ith order difference. Similar to maximization block, the difference block is formed by N branches which obtain velocity of the spatio-temporal variations by performing difference of order i 1 ; i 2 ; . . . ; i N . In our implementation, we keep the depth size of the output equal to the input feature map adding zeros to the input features when carrying out the operation.
As the difference block is designed to explore short variations, lower order differences should be employed, such as 1, 2 and 3. The difference block with high order is useful to explore long-term variations. As we can see in Fig. 1, 2D filters explore the spatial dependencies in the feature maps generated in this process. Finally, the block's output is generated using the following equation: where h n is the output of the nth branch, N is the number of branches, S is the concatenation operator, and H is the fusion function. We perform the concatenation along the channels' axis in both difference and maximization blocks.

MDN Module
The combination of the maximization with difference blocks generates the MDN module. Observe that the maximization block and difference block can explore distinct spatio-temporal information and can also operate in different temporal ranges. With that, our MDN module has the potential to encode spatial and temporal information from smooth and sudden facial variations. Such ability can significantly boost the performance of a model for automatic depression detection.
As shown in Fig. 1, the outputs of the maximization and difference blocks are merged using a linear combination, which fuses the features of the two blocks by addition. For that, it is necessary to make sure that the dimensions of the output feature maps of the blocks are the same. We use the fusion function H in the blocks to adjust the feature maps. The advantage of this approach is to reinforce the complementary behavior of the blocks. Then, an additional 1 Â 1 Â 1 convolutional layer is employed to adjust the number of channels to match the input feature maps, since we employ our MDN module inside structures with residual-like connections which additionally fuse the features, as illustrated in Fig. 2. Given this later convolutional layer, we consider our MDN module with two layers.

Datasets
To evaluate the performance of our proposed MDN, we conduct extensive experiments on two publicly available benchmark datasets, namely the Audio-Visual Emotion Challenge 2013 and 2014 (AVEC2013 [32] and AVEC2014 [33]) depression sub-challenge datasets. These datasets were employed in the AVEC sub-challenge, where the goal was to estimate the score of individuals on the Beck Depression Inventory (BDI-II). According to the BDI-II score, the severity of depression can be classified in four levels: minimal (0 À 13), mild (14 À 19), moderate (20 À 28), and severe (29 À 63).
Although it is possible to find other publicly available datasets for depression assessment, such as AVEC 2016 [46], these datasets only provide feature sets of the individuals. Our proposed architecture is designed to explore spatio-temporal dependencies directly from facial videos. To the best of our knowledge, AVEC2013 and AVEC2014 datasets are the only ones that currently provide with raw facial video data. For this reason, and following the state-of-the-art, we benchmark our experiments in these two datasets.
The AVEC2013 dataset is derived from a subset of the audio-visual depressive language corpus (AViD-Corpus). The subjects were recorded during an interaction with a computer performing diverse tasks, such as counting from 1 to 10. In total, the dataset contains 150 video clips allocated into three different partitions: training, development and test sets. Each set consists of 50 videos which have a label related to depression score of subjects. The videos have duration ranging between 20 and 50 minutes with an average video length of 25 minutes.
The AVEC2014 dataset is also a subset of AViD-Corpus. For this dataset, two tasks named Freeform and Northwind are performed while the subjects of the videos are recorded. In the Freeform task, the subjects answer questions such as discuss a sad childhood memory. In the Northwind task, subjects read audibly an excerpt from a fable. In both tasks, the videos are allocated into three partitions: training, development, and test sets. Each set contains 50 videos with a ground truth numerical label for every video. The dataset is formed of 300 videos that range between 6 and 248 seconds. For both datasets, the frame rate of the videos is 30 frames per second (fps).

Experimental Setup
Since our MDN module is designed to be embedded in structures with identity shortcut connections, the proposed architecture is based on 3D residual networks [26], although the MDN module could be also employed in other different 3D networks (e.g., I3D or C3D [24], [25]). In this subsection, we describe the resulting deep network architecture.
The MDN architecture is a convolutional network which explores spatio-temporal variations using maximization and difference structures. Employing our MDN module, five networks are built with sizes of 18, 34, 50, 100 and 152 layers. The details of the networks are presented in Table 1. All the networks have the first layer (conv1) with one block and the others with different number of blocks. Only conv1 uses typical 3D convolution because it employs a different temporal kernel depth. Moreover, we employ a different temporal depth for each channel of the maximization block, considering the input size, to benefit the exploitation of the features. However, when the temporal information of the input is equal to 1, we set the depth to 1. Observe that the networks employ MDN module with a maximization block composed by 3 branches whereas difference block have 2.
In the next section, we analyze the effect of changing the number of branches and the temporal range that is explored by the blocks.
After the sequence of convolutions, the average pooling layer with kernel size 4 Â 4 Â 1 produces a 256-dimensional feature vector which is fed to the last layer. As the depression detection from facial videos can be considered as a regression problem, our last layer is composed by one fully connected layer and a linear regression function that we implemented using an additional fully connected layer with one neuron.
Training. Due to the fact that there is a limited amount of training data in the AVEC2013 and AVEC2014 datasets to train a deep architecture from scratch, the proposed MDN networks are initially trained on face recognition. The networks are pre-trained on the VGGFace2 dataset that includes 3.31 million images of 9,131 identities [47]. In this process, an image is selected from the dataset and replicated 16 times in order to make a clip that is fed into the model. We employ Stochastic Gradient Descent (SGD) with momentum of 0.9, weight decay 0.0001, and an initial learning rate of 0.01. The learning rate is multiplied by 0.1 after every 10 epochs. At this stage, the input values are per channel subtracted by the average value of VGGFace2. In this face recognition pre-training, the last layer of the models is a classification layer that is removed in the next stage.
For the fine-tuning stage, the ADAM optimization algorithm is adopted with an initial learning rate of 0.001, and a weight decay of 0.00001, and this rate is multiplied by 0.1 after each epoch where the limit is set to 0.00001. To build one training sample, a frame inside the video is randomly chosen and the subsequent frames are collected, where the frame sampling is empirically set to 7.5 fps. We pre-process the input by using the Multi-task Cascaded Convolutional Network (MTCNN) [48] for face detection and alignment in each frame of the video that are subsequently resized. This results in samples of 112 pixels Â112 pixels Â16 frames. For data augmentation, each sample is horizontally flipped with 25 percent probability, randomly rotated to 10 degrees with 25 percent probability, and turned upside down with 25 percent probability. The training samples created in this process are labelled using the same depression score as their original videos. Testing.
In the testing, we analyze the facial video by dividing it using a sliding window, with non-overlapping clips of 16 frames each. The final estimated depression score for an individual in a sample (input video) is obtained by simply averaging the predicted depression scores of all clips that compose the test video.
Evaluation Measures. For performance evaluation of the proposed architecture and a fair comparison with the stateof-the-art methods, two metrics are employed: Mean Absolute Error (MAE) and Root Mean Square Error (RMSE).

RESULTS AND DISCUSSION
In this section, we show the efficiency of the proposed approach in exploring spatial temporal dependencies from facial dynamics. First, we present the pre-training strategy and an analysis of different configurations of the MDN module. In the sequence, we provide different networks using our proposed module and compare them with standard 3D ResNet, other 3D schemes, and the state-of-the-art methods. Next, we perform cross database and error analysis and visualize the features generated by our architecture and the activation maps. Finally, we evaluate our method for pain estimation.

Pre-Training of MDNs
Properly initialized weights for fine-tuning towards depression detection can significantly improve the performance of deep networks. Table 2 reports results of MDN-50 pretrained on ImageNet [52], and VGGFace2, as well as without pre-training. The results clearly indicate that MDN-50 achieves significantly better performance when pre-trained on large datasets. The model achieves its best performance when pretrained on VGGFace2, although the results are very competitive on AVEC2014. This is expected since VGGFace2, AVEC2013, and AVEC2014 are face datasets. Therefore, once the MDN is pre-trained on VGGFace2, the MDN develops the ability to explore facial structures which can be considered as basis to encode the spatio-temporal variations in faces. Table 1 shows the definitions of the networks that consider an MDN module with 5 branches, using 3 branches to explore smooth information and 2 branches to capture the sudden temporal variations. However, the MDN module can be configured with a different number of branches and orders in both maximization and difference blocks. In this section, we study the effect of changing the number of branches in the maximization and difference blocks as well as the value of the temporal range that is analyzed. We conduct the study considering several configurations of the MDN module for MDN-50, i.e., MDN model with 50 layers. Since both datasets, AVEC2014 and AVEC2013, contain similar face videos and the analysis requires the training of several models and a long training process, we performed this analysis solely on the AVEC2014 dataset. Table 3 reports the performance of the MDN-50 employing various configurations for MDN module. Specifically, we analyze the models for depth in range of 1 l 4 where the temporal depth of input features is considered to define the values of depth in each layer of the model. Regarding order values, we define i n ¼ n where n is the nth branch. The first model employs MDN module without difference block whereas the second one uses the module without maximization block. Results in Table 3 indicate that the models achieve similar results. Observe that the third model, which employs both blocks with the same configuration, achieves better results than both models, indicating the importance of exploring smooth and sudden information. Moreover, the networks with MDN module using an order equal to one, and one branch for exploring smooth information, achieve a better performance by using the sequence of 4, 3, 2 and 1 as depth values. Applying this sequence in the maximization block normally contributes to improve the performance of the model. In general, increasing the number of branches also improves the results of the model. However, the value of depth and order should be carefully chosen. For instance, the model using MDN module which captures temporal variations with values of depth equal to 1 and 2, and the sequence of 4, 3, 2 and 1 as depth values, outperformed all the models with just two branches, one for maximization block and other for difference block. However, this is not true for the other models using 2 branches for difference block and 1 for maximization block. Comparing the results when the MDN module is formed by using two maximization blocks and one difference block with this one composed by one maximization block and two difference blocks, we can see that the performance is competitive. Similar findings can be observed when we employ three branches for the maximization (or difference) block, and two for the difference (or maximization) block.

MDN Module Branch Number Analysis
As can be seen from Table 3, the performance of the models using the MDN module with two branches for the maximization/difference block and three branches for the difference/maximization block is very similar when compared with the ones using four branches, two for each block. Moreover, the model with MDN module employing order equal to 1 and 2 combined with a maximization block that uses three branches achieves the best result in terms of RMSE. Based on these results, in the subsequent experiments, we decided to specify our MDN module using two branches in the difference block (Order = [1,2]) and three branches in the maximization block (see the last entry of Table 3).

Comparison With 3D Models
In order to show the efficiency of our approach, we present results of the proposed architecture and other 3D models. We begin by comparing our method with 3D ResNet in terms of RMSE, MAE and computational complexity. In addition, we also compare our architecture with Inflated 3D ConvNet (I3D) and Temporal 3D ConvNet (T3D) models. All 3D ResNet, I3D, and T3D models are trained following the same procedure as our proposed method -we first pretrained the model on VGGFace2 dataset, and then finetune it on either AVEC2014 or AVEC2013 datasets. Table 4 reports the results for several MDN configurations and 3D ResNets on AVEC2013. As can be seen, the performance of the MDN improves with the increase of the network depth, except for MDN-152 in terms of MAE, where MDN-100 achieves better results. As the difference of performance between MDN-152 and MDN-100 is low, we understand that the MDN-152 could already be starting to overfit. It is also possible to observe that MDN-100 and MDN-152 achieve a considerable improvement of performance when compared with the smaller MDN-18. Table 4 also shows that the MDN outperforms 3D ResNet for depression detection in terms of RMSE. Considering MAE, the results achieved by 3D ResNet-18 are slightly better than MDN-18, while 3D ResNet-34 and MDN-34 obtain the same results. However, for the deeper models, the MDN consistently outperforms the 3D ResNet approaches by a large margin both in terms of RMSE and MAE. For instance, the MDN-100 significantly reduces the MAE by 0.63 compared to 3D ResNet-101. From these results, we argue that MDN models are a better option for depression detection than their 3D ResNet architecture counterparts.

Analyses on AVEC2014
In Table 4, we show the results for MDN networks and 3D ResNet on AVEC2014. As it can be seen, the results of the MDN models improve again with the increase of the network depth, excluding the MDN-18 and MDN-34, that, in terms of MAE, achieve the same results. For this dataset, the MDN-152 achieve the best results, reducing the RMSE by 1.43 compared to MDN-18. Therefore, we might conclude that MDN-152 does not seem to overfit for this dataset.
We also show in Table 4 that the MDN outperforms 3D ResNet in terms of RMSE. Analyzing MAE, the results achieved by 3D ResNet-34 are better than MDN-34, but for the other models, the MDN outperforms the 3D ResNet approaches in terms of RMSE and MAE. For example, the MDN-152 significantly reduces the RMSE by 0.30 compared to 3D ResNet-152. The results in Table 4 indicate that the MDN architecture could be overcoming problems such as ambiguity and overfitting more accurately than 3D ResNet, especially for models with larger network size. Table 4 presents the computational complexity comparison between the proposed MDN and 3D ResNet architectures. The number of parameters of MDN models is considerably less than 3D ResNet. The MDN-18 and MDN-34 have almost 5 times less parameters than 3D ResNet-18 and 3D ResNet-34 whereas the deeper MDN models (with 100 and 152 layers) have almost 3.3 times less parameters than the deeper 3D ResNet models. We also show the number of Floating Point Operations (FLOP) of the architectures as a measure of computational cost. MDN models present a considerable smaller number of FLOPs than 3D ResNet models. E.g., in the case of models with 34 layers, the FLOP value of MDN decreases approximately 2 times when compared with 3D ResNet.

Comparison With Other 3D Methods
We compare our method with other well-known 3D models, I3D and T3D. The I3D model [24] is composed of a basic structure called inception module, which is obtained by inflating 2D filters and pooling kernels of a 2D version of the module. The T3D model [49] contains structures called temporal transition layers which are responsible for capturing temporal information in different ranges. These two architectures have been successfully employed in action recognition, and the comparison with such models is important to measure the capabilities of the proposed MDN architecture. Table 5 shows a direct comparison between the performance of I3D [24], T3D [49], and our three best models (MDN-50, MDN-100, and MDN-152). When compared with T3D, the I3D has competitive results with a smaller number of parameters. On the other hand, our proposed networks outperform the I3D model on both datasets (except for  P. and F. represent parameters (Â10 6 ) and FLOPs (Â10 9 ), respectively.
MDN-50 on AVEC2014 in terms of MAE). We can observe that the difference of performance is higher in terms of RMSE. Regarding the T3D model, our models achieve better results where the difference in terms of RMSE on AVEC2013 is 1.13, considering the MDN-100 model. In Table 5, we also present the computational complexity of I3D, T3D and MDN architectures. Compared to T3D, the MDN-50, MDN-100 and MDN-152 models use fewer parameters and require a smaller number of FLOP computations. The I3D model employs even less parameters when compared with our MDN models, and requires around 6.99 FLOPs, at the cost of a worse performance. These results are expected since our model is designed to explore both sudden and smooth temporal information.
In summary, the structures of the MDN module based on complementary functions and diverse depths demonstrated good potential to capture spatio-temporal variations in facial expressions. The proposed architecture can learn how to obtain a rich representation of the facial expression variations even with limited training data. The results of the proposed models indicate good performance to explore appearance and dynamics of facial videos for depression detection.

Comparison With State-of-the-Art
We compare the performance of our three networks, MDN-50, MDN-100, and MDN-152, with the state-of-the-art methods for depression detection on AVEC2013 and AVEC2014 datasets. Table 6 shows the performance of our proposed method compared with baselines and state-of-the-art methods on AVEC2013 dataset. The methods based on hand-engineered representations are [32], [35], [36], [37], [38], [50]. All these methods are outperformed by our MDN networks. Zhu et al. [15] proposed a method based on two-stream networks which uses RGB frames and optical flow as input. The proposed models achieve better results than this method, indicating that having structures with capability of capturing multiple ranges of information is effective for depression detection. In [18] and [19], the authors explore different facial regions using two C3D models. The MDN models outperform both methods, demonstrating the power of the model in exploring diverse facial regions. When compared with the models that employ one or more ResNet-50, MDN-50 achieves very competitive results, although MDN-50 employs fewer parameters.

Comparisons on AVEC2013
MDN-100 and MDN-152 outperform the method in [20] that is based on distribution learning. In [16], the authors employ four ResNet-50 to explore facial areas, MDN-100 outperforms such method whereas MDN-152 obtains better results in terms of RMSE, and competitive performance in terms of MAE. We believe that such results confirm the importance of capturing directly spatio-temporal information with the MDN module rather than only appearance information. Song et al. [45] explore multiple behavior signals using Fourier transforms and a CNN. It can be observed that the performance of MDN-100 and MDN-152 surpasses this model, although, in terms of MAE, for MDN-152, the results are competitive. Finally, the authors in [51] employ a two-stream network where a temporal pooling method captures dynamic information into an image map. As we can see, MDN-100 and MDN-152 achieve better results in terms of RMSE, and competitive results in terms of MAE when compared with the method in [51]. Table 7 reports the comparative results of our proposed models and the state-of-the-art on AVEC2014 dataset. Our methods outperform the schemes based on hand-crafted features that are [33], [39], [40]. The authors in [43] apply Deep Transformation Learning (DTL) to encode deep features. The MDN models yield lower values of RMSE and MAE than this method. We can observe that MDN-50 achieves good results where, in terms of RMSE, it is only outperformed by the methods in [21], [51] which employ many more parameters. MDN-100 and MDN-152 achieve  better results than the method in [17] which explore facial features with attention mechanisms. MDN-152 outperforms the methods in [15], [16], [18], [19], [20], [51]. The authors in [21] employ VGG network to explore facial images and use Feature Dynamic History Histogram (FDHH) to map the changes in the features. MDN-100 and MDN-152 achieve better results compared to this approach. Observe that our models outperform methods with different pooling schemes for facial static features, providing a good alternative approach to these methods. From the results in Tables 6  and 7, we can claim that our architecture is an efficient option to capture spatio-temporal information related to depressive behaviors from facial videos.

Task-Based Comparisons on AVEC2014
In Table 8, we present the performance of the proposed model for each task and for the combination of them as well as the results presented in [45]. As mentioned previously, the AVEC2014 have two tasks: Freeform and Northwind which are considered in the analysis. The results of our architecture for each task are very similar, indicating that our method keeps a good performance for exploring spatiotemporal variations regardless of the task. The combination of the tasks is carried out by a simple score fusion scheme considering all the values generated in both videos. It is important to note that each participant has one sample in each task (Freeform/Northwind) with the same depression score. Therefore, our method generates predictions by analyzing both samples. As we can see, the performance of the method improves with the fusion of the tasks, since the score fusion act as a regularizer that minimizes the effect of outliers. When compared against the method in [45], we observe that our architecture outperforms such method in task-based or combination of tasks approach. The results suggest that our architecture can produce more discriminative features using facial videos than models based on the analysis of high-level behavior features.

Cross-Database Analysis
To assess the generalization capability of MDN-152, we perform cross-database validation on AVEC2013 and AVEC2014 datasets. In this procedure, the model is trained on the source database and tested on the target database. Table 9 presents the results of this experiment. As can be seen, when the source is AVEC2013 dataset, the performance of the model degrades slightly when compared with AVEC2014 as the source database. However, in both cases, the results are competitive with the ones shown in Table 4. The representations learned by our proposed model provide good generalization ability. Fig. 3 shows an analysis of our MDN based on depression feature maps generated by maximization and difference blocks. To facilitate the analysis and visualization, we consider the MDN module employed in the res2_1 layer. Fig. 3a presents a frame from the RGB input clip which is being analyzed, while the output of the max pooling layer, the one after the conv1 layer, that is fed into the res2_1 layer is shown in Fig. 3b. This provides insight into the input type for the MDN module. From the output of the maximization block, it can be noticed that the scheme spreads more energy along the face of the subject compared to the original input features. It demonstrates that the block is paying attention to global spatio-temporal information which increases the potential to explore smooth facial variations. With respect to the difference block, it can be seen that it captures the motion of feature maps and, since the block is based on first and second-order differences, this allows the module to explore sudden spatio-temporal variations. The complementary characteristics of the blocks builds a module with potential to explore rich spatio-temporal variations.

Visualization of Activation Maps
In order to interpret how the MDN architecture predicts depression scores from faces, we visualize class activation maps produced using the Grad-CAM method [53].   shows an example of class activations maps produced for 4 distinct depression levels, that were interpolated and overlaid onto the corresponding facial images. The facial areas that most contribute to the prediction are represented by lighter colors. As shown in the figure, our model pays high attention to an area from the eyes to the chin. Interestingly, the most active area for all cases is the region that covers the mouth. It is important to observe that manifestations of depression include slow speech, fewer smiles, mouth shape, etc., which are characteristics that the model may explore. These visualizations show that MDN presents a different behavior when compared to models like in [16] which changes the most important facial area in accordance with the depression level. This indicates that MDN may rely on more optimal facial regions to explore spatio-temporal variations.

Error Analysis
In order to provide more information about the capabilities of the proposed architecture in a way that can be translated into clinical practice, we present the error for each sample (videos in the testing set) of the AVEC2013 and AVEC2014 datasets. We depict the errors in Fig. 5, ordered from the video presenting the smallest error to the one presenting the largest error. By observing the figure, we can conclude that the probability of error is approximately equally distributed from 0 to 12.5, with only a few outliers over that value. More concretely, the model achieves error less than 6.0 for more than 60 percent of the samples for both AVEC2013 and AVEC2014. It is worth noting that the error around 6.0 indicates a misclassification of the depression severity only between adjacent categories and for scores at the border of the class (e.g., the predicted level is 18, mild severity level, and the actual score is 12 which is the minimal level). The worst case is when a subject with minimal level of depression is classified with a severe level or the other way around. This is the case when the error is greater than 16. Our proposed model produced error greater than 16 only on 1 and 5 videos on AVEC2013 and AVEC2014, respectively. These results show that our architecture generalizes well, and the probability of grave misclassification is small.

Pain Estimation
To further validate the ability of our proposed MDN to capture and leverage spatio-temporal information, additional experiments are conducted for pain intensity estimation. The dataset employed is the well-known UNBC-McMaster Shoulder Pain Expression Archive Database [54]. It includes 200 face videos of 25 subjects, each one annotated using PSPI score at frame-level in range of 0 À 15. For fair comparison with the state-of-the-art schemes, we report the performance of our approach in terms of Mean Squared Error (MSE) and MAE, where leave-one-subjectout cross-validation strategy is adopted. As the input of the MDN is a clip (16 frames), we define the ground truth as the mean of pain intensity of each frame inside the clip. In Table 10, we show the results of MDN-152 compared with six methods presented in the literature. As we can see, MDN outperforms five methods. For instance, our method obtains better results than the method in [59], where such method uses around 138 million parameters whereas MDN   employs only 52 million parameters. Our approach achieves competitive results when compared with the method in [60], but that method uses 586.8 million parameters, which means more than 11 times the number of parameters of MDN, demonstrating the efficiency of MDN to explore the spatio-temporal information in other related problems.
In order to show the capabilities of our architecture in estimating different intensities of pain, Fig. 6 shows the ground truth and the predictions of MDN on a video of a subject. As shown, MDN detects the intensities of pain in a satisfactory way, and follows the different transitions of levels of pain. These results indicate that MDN has good potential to explore face expression variations related to pain.

CONCLUSION
We presented the Maximization and Differentiation Network (MDN) for encoding spatio-temporal variations of face videos for automatic depression detection. The proposed method is composed of a maximization block to model smooth facial expression variations and a difference block to encode sudden facial variations. The combination of these blocks forms the MDN module which explore multiple temporal information without 3D convolutions. We incorporated our MDN module in 3D ResNet-type architectures to generate our novel MDN architecture. We evaluated the performance of the proposed method on the two benchmark datasets for depression detection from facial videos, namely, AVEC2013 and AVEC2014. The experiments demonstrated the improvement in performance against 3D ResNet as well as T3D and I3D models. Our architecture also outperformed the state-of-the-art approaches for depression detection. As a future work, we intend to investigate other complementary modalities (e.g., audio and video-based biosignals), integrating these signals in our proposed architecture in order to further improve the performance of the model.