A dual-branch neural network for DeepFake video detection by detecting spatial and temporal inconsistencies
Kuang, Liang; Wang, Yiting; Hang, Tian; Chen, Beijing; Zhao, Guoying (2022-07-12)
Kuang, L., Wang, Y., Hang, T. et al. A dual-branch neural network for DeepFake video detection by detecting spatial and temporal inconsistencies. Multimed Tools Appl 81, 42591–42606 (2022). https://doi.org/10.1007/s11042-021-11539-y
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021. This version of the article has been accepted for publication, after peer review (when applicable) and is subject to Springer Nature’s AM terms of use, but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: http://dx.doi.org/10.1007/s11042-021-11539-y
https://rightsstatements.org/vocab/InC/1.0/
https://urn.fi/URN:NBN:fi-fe202301306556
Tiivistelmä
Abstract
It has become a research hotspot to detect whether a video is natural or DeepFake. However, almost all the existing works focus on detecting the inconsistency in either spatial or temporal. In this paper, a dual-branch (spatial branch and temporal branch) neural network is proposed to detect the inconsistency in both spatial and temporal for DeepFake video detection. The spatial branch aims at detecting spatial inconsistency by the effective EfficientNet model. The temporal branch focuses on temporal inconsistency detection by a new network model. The new temporal model considers optical flow as input, uses the EfficientNet to extract optical flow features, utilize the Bidirectional Long-Short Term Memory (Bi-LSTM) network to capture the temporal inconsistency of optical flow. Moreover, the optical flow frames are stacked before inputting into the EfficientNet. Finally, the softmax scores of two branches are combined with a binary-class linear SVM classifier. Experimental results on the compressed FaceForensics++ dataset and Celeb-DF dataset show that: (a) the proposed dual-branch network model performs better than some recent spatial and temporal models for the Celeb-DF dataset and all the four manipulation methods in FaceForensics++ dataset since these two branches can complement each other; (b) the use of optical flow inputs, Bi-LSTM and dual-branches can greatly improve the detection performance by the ablation experiments.
Kokoelmat
- Avoin saatavuus [31657]