University of Oulu

X. Wu, X. Zhang, X. Feng, M. B. López and L. Liu, "Audio-Visual Kinship Verification: A New Dataset and a Unified Adaptive Adversarial Multimodal Learning Approach," in IEEE Transactions on Cybernetics, doi: 10.1109/TCYB.2022.3220040

Audio-visual kinship verification : a new dataset and a unified adaptive adversarial multimodal learning approach

Saved in:
Author: Wu, Xiaoting1,2; Zhang, Xueyi3; Feng, Xiaoyi2,4;
Organizations: 1Center for Machine Vision and Signal Analysis, University of Oulu, 90570 Oulu, Finland
2School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710060, China
3College of System Engineering, National University of Defense Technology, Changsha 410073, Hunan, China
4Research and Development Institute, Northwestern Polytechnical University (Shenzhen), Shenzhen 518063, China
5Cognitive Technologies for Intelligence, VTT Technical Research Centre of Finland, 90570 Oulu, Finland
Format: article
Version: published version
Access: open
Online Access: PDF Full Text (PDF, 3.4 MB)
Persistent link:
Language: English
Published: Institute of Electrical and Electronics Engineers, 2022
Publish Date: 2023-09-13


Facial kinship verification refers to automatically determining whether two people have a kin relation from their faces. It has become a popular research topic due to potential practical applications. Over the past decade, many efforts have been devoted to improving the verification performance from human faces only while lacking other biometric information, for example, speaking voice. In this article, to interpret and benefit from multiple modalities, we propose for the first time to combine human faces and voices to verify kinship, which we refer it as the audio-visual kinship verification study. We first establish a comprehensive audio-visual kinship dataset that consists of familial talking facial videos under various scenarios, called TALKIN-Family. Based on the dataset, we present the extensive evaluation of kinship verification from faces and voices. In particular, we propose a deep-learning-based fusion method, called unified adaptive adversarial multimodal learning (UAAML). It consists of the adversarial network and the attention module on the basis of unified multimodal features. Experiments show that audio (voice) information is complementary to facial features and useful for the kinship verification problem. Furthermore, the proposed fusion method outperforms baseline methods. In addition, we also evaluate the human verification ability on a subset of TALKIN-Family. It indicates that humans have higher accuracy when they have access to both faces and voices. The machine-learning methods could effectively and efficiently outperform the human ability. Finally, we include the future work and research opportunities with the TALKIN-Family dataset.

see all

Series: IEEE transactions on cybernetics
ISSN: 2168-2267
ISSN-E: 2168-2275
ISSN-L: 2168-2267
Issue: Online first
DOI: 10.1109/tcyb.2022.3220040
Type of Publication: A1 Journal article – refereed
Field of Science: 113 Computer and information sciences
Funding: This work was supported in part by the National Key Research and Development Program of China under Grant 2021YFB3100800; in part by the Academy of Finland under Grant 331883 and Grant 346208- 6G Flagship; in part by the National Natural Science Foundation of China under Grant 61872379; in part by the University of Oulu and the Academy of Finland under Grant Profi6 336449; in part by the Key Research and Development Program of Shaanxi under Grant 2022ZDLGY06-07; and in part by the Shenzhen Science and Technology Program under Grant GJHZ20200731095204013.
Academy of Finland Grant Number: 331883
Detailed Information: 331883 (Academy of Finland Funding decision)
346208 (Academy of Finland Funding decision)
Copyright information: © The Author(s) 2022. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see