University of Oulu

Kuang, L., Wang, Y., Hang, T. et al. A dual-branch neural network for DeepFake video detection by detecting spatial and temporal inconsistencies. Multimed Tools Appl 81, 42591–42606 (2022).

A dual-branch neural network for DeepFake video detection by detecting spatial and temporal inconsistencies

Saved in:
Author: Kuang, Liang1,2; Wang, Yiting3; Hang, Tian1;
Organizations: 1School of Computer, Nanjing University of Information Science and Technology, Nanjing, 210044, China
2School of IoT Engineering, Jiangsu Vocational College of Information Technology, Wuxi, 214153, China
3Warwick Manufacturing Group, University of Warwick, Coventry, CV4 7AL, UK
4Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology (CICAEET), Nanjing University of Information Science and Technology, Nanjing, 210044, China
5Center for Machine Vision and Signal Analysis, University of Oulu, 90014, Oulu, Finland
Format: article
Version: accepted version
Access: open
Online Access: PDF Full Text (PDF, 0.9 MB)
Persistent link:
Language: English
Published: Springer Nature, 2022
Publish Date: 2023-01-30


It has become a research hotspot to detect whether a video is natural or DeepFake. However, almost all the existing works focus on detecting the inconsistency in either spatial or temporal. In this paper, a dual-branch (spatial branch and temporal branch) neural network is proposed to detect the inconsistency in both spatial and temporal for DeepFake video detection. The spatial branch aims at detecting spatial inconsistency by the effective EfficientNet model. The temporal branch focuses on temporal inconsistency detection by a new network model. The new temporal model considers optical flow as input, uses the EfficientNet to extract optical flow features, utilize the Bidirectional Long-Short Term Memory (Bi-LSTM) network to capture the temporal inconsistency of optical flow. Moreover, the optical flow frames are stacked before inputting into the EfficientNet. Finally, the softmax scores of two branches are combined with a binary-class linear SVM classifier. Experimental results on the compressed FaceForensics++ dataset and Celeb-DF dataset show that: (a) the proposed dual-branch network model performs better than some recent spatial and temporal models for the Celeb-DF dataset and all the four manipulation methods in FaceForensics++ dataset since these two branches can complement each other; (b) the use of optical flow inputs, Bi-LSTM and dual-branches can greatly improve the detection performance by the ablation experiments.

see all

Series: Multimedia tools and applications
ISSN: 1380-7501
ISSN-E: 1573-7721
ISSN-L: 1380-7501
Volume: 81
Issue: 29
Pages: 42591 - 42606
DOI: 10.1007/s11042-021-11539-y
Type of Publication: A1 Journal article – refereed
Field of Science: 113 Computer and information sciences
Funding: This work was supported by the National Natural Science Foundation of China under Grant 62072251, Natural Science Research Project of Jiangsu Universities under Grant 20KJB520021, Higher Vocational Education Teaching Fusion Production Integration Platform Construction Projects of Jiangsu Province under Grant No. 2019(26), the PAPD fund.
Copyright information: © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021. This version of the article has been accepted for publication, after peer review (when applicable) and is subject to Springer Nature’s AM terms of use, but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: