University of Oulu

K. Ma, Y. Feng, B. Chen and G. Zhao, "End-to-End Dual-Branch Network Towards Synthetic Speech Detection," in IEEE Signal Processing Letters, vol. 30, pp. 359-363, 2023, doi: 10.1109/LSP.2023.3262419

End-to-end dual-branch network towards synthetic speech detection

Saved in:
Author: Ma, Kaijie1,2,3; Feng, Yifan1,2,3; Chen, Beijing1,2,3;
Organizations: 1Engineering Research Center of Digital Forensics, Ministry of Education, Nanjing University of Information Science & Technology, Nanjing, China
2School of Computer Science, Nanjing University of Information Science &Technology, Nanjing, China
3Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology, Nanjing University of Information Science & Technology, Nanjing, China
4Center for Machine Vision and Signal Analysis, University of Oulu, Oulu, Finland
Format: article
Version: accepted version
Access: open
Online Access: PDF Full Text (PDF, 0.6 MB)
Persistent link: http://urn.fi/urn:nbn:fi-fe2023061555562
Language: English
Published: Institute of Electrical and Electronics Engineers, 2023
Publish Date: 2023-06-15
Description:

Abstract

Synthetic speech attacks bring more threats to Automatic Speaker Verification (ASV) systems, thus many synthetic speech detection (SSD) systems have been proposed to help the ASV system resist synthetic speech attacks. However, existing SSD systems still lack the generalization ability for the attacks generated by unknown synthesis algorithms. This letter proposes an end-to-end ensemble system, namely Dual-Branch Network, in which linear frequency cepstral coefficients (LFCC) and constant Q transform (CQT) are used as the input of two branches respectively. In addition, four fusion strategies are compared for the fusion of two branches to obtain an optimal one; multi-task learning and convolutional block attention module (CBAM) are introduced into the Dual-Branch Network to help the network learn the common forgery features from different forgery types of speech and enhance the representation power of learned features. Experimental results on the ASVspoof 2019 logical access (LA) dataset demonstrate that the proposed system outperforms existing state-of-the-art systems on both t-DCF and EER scores and has good generalization for unknown forgery types of synthetic speech.

see all

Series: IEEE signal processing letters
ISSN: 1070-9908
ISSN-E: 1558-2361
ISSN-L: 1070-9908
Volume: 30
Pages: 359 - 363
DOI: 10.1109/LSP.2023.3262419
OADOI: https://oadoi.org/10.1109/LSP.2023.3262419
Type of Publication: A1 Journal article – refereed
Field of Science: 213 Electronic, automation and communications engineering, electronics
Subjects:
Funding: This work was supported in part by the National Natural Science Foundation of China under Grant 62072251, in part by the Academy of Finland for ICT 2023 project TrustFace under Grant 345948, and in part by Infotech Oulu.
Academy of Finland Grant Number: 345948
Detailed Information: 345948 (Academy of Finland Funding decision)
Copyright information: © 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.