University of Oulu

Changchong Sheng, Matti Pietikäinen, Qi Tian, and Li Liu. 2021. Cross-modal Self-Supervised Learning for Lip Reading: When Contrastive Learning meets Adversarial Training. Proceedings of the 29th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, 2456–2464. DOI:

Cross-modal self-supervised learning for lip reading : when contrastive learning meets adversarial training

Saved in:
Author: Sheng, Changchong1; Pietikäinen, Matti1; Tian, Qi2;
Organizations: 1University of Oulu, Oulu, Finland
2Xidian University, Xi’an, China
Format: article
Version: accepted version
Access: open
Online Access: PDF Full Text (PDF, 1 MB)
Persistent link:
Language: English
Published: Association for Computing Machinery, 2021
Publish Date: 2022-03-01


The goal of this work is to learn discriminative visual representations for lip reading without access to manual text annotation. Recent advances in cross-modal self-supervised learning have shown that the corresponding audio can serve as a supervisory signal to learn effective visual representations for lip reading. However, existing methods only exploit the natural synchronization of the video and the corresponding audio. We find that both video and audio are actually composed of speech-related information, identity-related information, and modal information. To make the visual representations (i) more discriminative for lip reading and (ii) indiscriminate with respect to the identities and modals, we propose a novel self-supervised learning framework called Adversarial Dual-Contrast Self-Supervised Learning (ADC-SSL), to go beyond previous methods by explicitly forcing the visual representations disentangled from speech-unrelated information. Experimental results clearly show that the proposed method outperforms state-of-the-art cross-modal self-supervised baselines by a large margin. Besides, ADC-SSL can outperform its supervised counterpart without any finetune.

see all

ISBN: 978-1-4503-8651-7
Pages: 2456 - 2464
DOI: 10.1145/3474085.3475415
Host publication: 29th ACM International Conference on Multimedia, MM 2021
Conference: Acm multimedia
Type of Publication: A4 Article in conference proceedings
Field of Science: 113 Computer and information sciences
Funding: This work was partially supported by the Academy of Finland under grant 331883, Outstanding Talents of “Ten Thousand Talents Plan” in Zhejiang Province (project no. 2018R51001), and the Natural Science Foundation of China (project no. 61976196).
Academy of Finland Grant Number: 331883
Detailed Information: 331883 (Academy of Finland Funding decision)
Copyright information: © 2021 Association for Computing Machinery. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in Proceedings of the 29th ACM International Conference on Multimedia,