University of Oulu

Yu Liu, Li Liu, Yanming Guo, Michael S. Lew. Learning visual and textual representations for multimodal matching and classification. Pattern Recognition, Volume 84, 2018, Pages 51-67, ISSN 0031-3203.

Learning visual and textual representations for multimodal matching and classification

Saved in:
Author: Liu, Yu1; Liu, Li2,3; Guo, Yanming2;
Organizations: 1Department of Computer Science, Leiden University, Leiden 2333 CA, The Netherlands
2College of System Engineering, National University of Defense Technology, Changsha, Hunan 410073, China
3Center for Machine Vision and Signal Analysis, University of Oulu, Oulu 8000, Finland
Format: article
Version: accepted version
Access: open
Online Access: PDF Full Text (PDF, 3.1 MB)
Persistent link:
Language: English
Published: Elsevier, 2018
Publish Date: 2020-07-02


Multimodal learning has been an important and challenging problem for decades, which aims to bridge the modality gap between heterogeneous representations, such as vision and language. Unlike many current approaches which only focus on either multimodal matching or classification, we propose a unified network to jointly learn multimodal matching and classification (MMC-Net) between images and texts. The proposed MMC-Net model can seamlessly integrate the matching and classification components. It first learns visual and textual embedding features in the matching component, and then generates discriminative multimodal representations in the classification component. Combining the two components in a unified model can help in improving their performance. Moreover, we present a multi-stage training algorithm by minimizing both of the matching and classification loss functions. Experimental results on four well-known multimodal benchmarks demonstrate the effectiveness and efficiency of the proposed approach, which achieves competitive performance for multimodal matching and classification compared to state-of-the-art approaches.

see all

Series: Pattern recognition
ISSN: 0031-3203
ISSN-E: 1873-5142
ISSN-L: 0031-3203
Volume: 84
Pages: 51 - 67
DOI: 10.1016/j.patcog.2018.07.001
Type of Publication: A1 Journal article – refereed
Field of Science: 113 Computer and information sciences
Funding: This work was supported mainly by the LIACS Media Lab at Leiden University: grant 20 060 02026 and in part by the China Scholarship Council: grant 20140 60 60010. We are also grateful to the support of NVIDIA with the donation of GPU cards.
Copyright information: © 2018 Published by Elsevier Ltd. This manuscript version is made available under the CC-BY-NC-ND 4.0 license