Self-supervised 2D face presentation attack detection via temporal sequence sampling

Muhammad, Usman; Yu, Zitong; Komulainen, Jukka

Self-supervised 2D face presentation attack detection via temporal sequence sampling

Muhammad, Usman; Yu, Zitong; Komulainen, Jukka (2022-03-04)

Avaa tiedosto

nbnfi-fe2022041929477.pdf (1.313Mt)

nbnfi-fe2022041929477_meta.xml (33.02Kt)

nbnfi-fe2022041929477_solr.xml (32.38Kt)

Lataukset:

URL:

https://doi.org/10.1016/j.patrec.2022.03.001

Muhammad, Usman

Yu, Zitong

Komulainen, Jukka

Elsevier

04.03.2022

Usman Muhammad, Zitong Yu, Jukka Komulainen, Self-supervised 2D face presentation attack detection via temporal sequence sampling, Pattern Recognition Letters, Volume 156, 2022, Pages 15-22, ISSN 0167-8655, https://doi.org/10.1016/j.patrec.2022.03.001

https://creativecommons.org/licenses/by/4.0/
© 2022 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
https://creativecommons.org/licenses/by/4.0/

doi:https://doi.org/10.1016/j.patrec.2022.03.001

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi-fe2022041929477

Tiivistelmä

Abstract

Conventional 2D face biometric systems are vulnerable to presentation attacks performed with different face artefacts, e.g., printouts, video-replays and wearable 3D masks. The research focus in face presentation attack detection (PAD) has been recently shifting towards end-to-end learning of deep representations directly from annotated data rather than designing hand-crafted (low-level) features. However, even the state-of-the-art deep learning based face PAD models have shown unsatisfying generalization performance when facing unknown attacks or acquisition conditions due to lack of representative training and tuning data available in the existing public benchmarks. To alleviate this issue, we propose a video pre-processing technique called Temporal Sequence Sampling (TSS) for 2D face PAD by removing the estimated inter-frame 2D affine motion in the view and encoding the appearance and dynamics of the resulting smoothed video sequence into a single RGB image. Furthermore, we leverage the features of a Convolutional Neural Network (CNN) by introducing a self-supervised representation learning scheme, where the labels are automatically generated by the TSS method as the stabilized frames accumulated over video clips of different temporal lengths provide the supervision. The learnt feature representations are then fine-tuned for the downstream task using labelled face PAD data. Our extensive experiments on four public benchmarks, namely Replay-Attack, MSU-MFSD, CASIA-FASD and OULU-NPU, demonstrate that the proposed framework provides promising generalization capability and encourage further study in this domain.

Kokoelmat

Avoin saatavuus [31988]

Ellei muuten mainita, aineiston lisenssi on https://creativecommons.org/licenses/by/4.0/