University of Oulu

G. Trigeorgis, P. Snape, M. A. Nicolaou, E. Antonakos and S. Zafeiriou, "Mnemonic Descent Method: A Recurrent Process Applied for End-to-End Face Alignment," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 4177-4187. doi: 10.1109/CVPR.2016.453

Mnemonic Descent Method : a recurrent process applied for end-to-end face alignment

Saved in:
Author: Trigeorgis, George1; Snape, Patrick1; Nicolaou, Mihalis A.2;
Organizations: 1Department of Computing, Imperial College London, UK
2Department of Computing, Goldsmiths, University of London, UK
3Center for Machine Vision and Signal Analysis, University of Oulu, Finland
Format: article
Version: accepted version
Access: open
Online Access: PDF Full Text (PDF, 5.4 MB)
Persistent link:
Language: English
Published: Institute of Electrical and Electronics Engineers, 2016
Publish Date: 2019-02-28


Cascaded regression has recently become the method of choice for solving non-linear least squares problems such as deformable image alignment. Given a sizeable training set, cascaded regression learns a set of generic rules that are sequentially applied to minimise the least squares problem. Despite the success of cascaded regression for problems such as face alignment and head pose estimation, there are several shortcomings arising in the strategies proposed thus far. Specifically, (a) the regressors are learnt independently, (b) the descent directions may cancel one another out and (c) handcrafted features (e.g., HoGs, SIFT etc.) are mainly used to drive the cascade, which may be sub-optimal for the task at hand. In this paper, we propose a combined and jointly trained convolutional recurrent neural network architecture that allows the training of an end-to-end to system that attempts to alleviate the aforementioned drawbacks. The recurrent module facilitates the joint optimisation of the regressors by assuming the cascades form a nonlinear dynamical system, in effect fully utilising the information between all cascade levels by introducing a memory unit that shares information across all levels. The convolutional module allows the network to extract features that are specialised for the task at hand and are experimentally shown to outperform hand-crafted features. We show that the application of the proposed architecture for the problem of face alignment results in a strong improvement over the current state-of-the-art.

see all

ISBN Print: 978-1-4673-8851-1
Pages: 4177 - 4187
DOI: 10.1109/CVPR.2016.453
Host publication: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Conference: IEEE computer society conference on computer vision and pattern recognition
Type of Publication: A4 Article in conference proceedings
Field of Science: 113 Computer and information sciences
Funding: G. Trigeorgis was supported by EPSRC DTA award at Imperial College London. The work of P. Snape was partially funded by an EPSRC DTA and by the European Community Horizon 2020 [H2020/2014-2020] under grant agreement no. 688520 (TeSLA). The work of E. Antonakos was partially funded by the EPSRC project EP/J017787/1 (4DFAB). The work of S. Zafeiriou was funded by the FiDiPro program of Tekes (project number: 1849/31/2015). We thank the NVIDIA Corporation for donating a Tesla K40 GPU used in this work.
Copyright information: © 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.