End-to-end dual-branch network towards synthetic speech detection |
|
Author: | Ma, Kaijie1,2,3; Feng, Yifan1,2,3; Chen, Beijing1,2,3; |
Organizations: |
1Engineering Research Center of Digital Forensics, Ministry of Education, Nanjing University of Information Science & Technology, Nanjing, China 2School of Computer Science, Nanjing University of Information Science &Technology, Nanjing, China 3Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology, Nanjing University of Information Science & Technology, Nanjing, China
4Center for Machine Vision and Signal Analysis, University of Oulu, Oulu, Finland
|
Format: | article |
Version: | accepted version |
Access: | open |
Online Access: | PDF Full Text (PDF, 0.6 MB) |
Persistent link: | http://urn.fi/urn:nbn:fi-fe2023061555562 |
Language: | English |
Published: |
Institute of Electrical and Electronics Engineers,
2023
|
Publish Date: | 2023-06-15 |
Description: |
AbstractSynthetic speech attacks bring more threats to Automatic Speaker Verification (ASV) systems, thus many synthetic speech detection (SSD) systems have been proposed to help the ASV system resist synthetic speech attacks. However, existing SSD systems still lack the generalization ability for the attacks generated by unknown synthesis algorithms. This letter proposes an end-to-end ensemble system, namely Dual-Branch Network, in which linear frequency cepstral coefficients (LFCC) and constant Q transform (CQT) are used as the input of two branches respectively. In addition, four fusion strategies are compared for the fusion of two branches to obtain an optimal one; multi-task learning and convolutional block attention module (CBAM) are introduced into the Dual-Branch Network to help the network learn the common forgery features from different forgery types of speech and enhance the representation power of learned features. Experimental results on the ASVspoof 2019 logical access (LA) dataset demonstrate that the proposed system outperforms existing state-of-the-art systems on both t-DCF and EER scores and has good generalization for unknown forgery types of synthetic speech. see all
|
Series: |
IEEE signal processing letters |
ISSN: | 1070-9908 |
ISSN-E: | 1558-2361 |
ISSN-L: | 1070-9908 |
Volume: | 30 |
Pages: | 359 - 363 |
DOI: | 10.1109/LSP.2023.3262419 |
OADOI: | https://oadoi.org/10.1109/LSP.2023.3262419 |
Type of Publication: |
A1 Journal article – refereed |
Field of Science: |
213 Electronic, automation and communications engineering, electronics |
Subjects: | |
Funding: |
This work was supported in part by the National Natural Science Foundation of China under Grant 62072251, in part by the Academy of Finland for ICT 2023 project TrustFace under Grant 345948, and in part by Infotech Oulu. |
Academy of Finland Grant Number: |
345948 |
Detailed Information: |
345948 (Academy of Finland Funding decision) |
Copyright information: |
© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. |