Yu Liu, Yanming Guo, Li Liu, Erwin M. Bakker, Michael S. Lew, CycleMatch: A cycle-consistent embedding network for image-text matching, Pattern Recognition, Volume 93, 2019, Pages 365-379, ISSN 0031-3203, https://doi.org/10.1016/j.patcog.2019.05.008
CycleMatch : a cycle-consistent embedding network for image-text matching
|Author:||Liu, Yu1; Guo, Yanming2; Liu, Li2,3;|
1Department of Computer Science, Leiden University, Leiden, 2333 CA, The Netherlands
2College of System Engineering, National University of Defense Technology, Changsha, Hunan 410073, China
3Center for Machine Vision and Signal Analysis, University of Oulu, Oulu 8000, Finland
|Online Access:||PDF Full Text (PDF, 3.8 MB)|
|Persistent link:|| http://urn.fi/urn:nbn:fi-fe2020120399215
|Publish Date:|| 2021-05-04
In numerous multimedia and multi-modal tasks from image and video retrieval to zero-shot recognition to multimedia question and answering, bridging image and text representations plays an important and in some cases an indispensable role. To narrow the modality gap between vision and language, prior approaches attempt to discover their correlated semantics in a common feature space. However, these approaches omit the intra-modal semantic consistency when learning the inter-modal correlations. To address this problem, we propose cycle-consistent embeddings in a deep neural network for matching visual and textual representations. Our approach named as CycleMatch can maintain both inter-modal correlations and intra-modal consistency by cascading dual mappings and reconstructed mappings in a cyclic fashion. Moreover, in order to achieve a robust inference, we propose to employ two late-fusion approaches: average fusion and adaptive fusion. Both of them can effectively integrate the matching scores of different embedding features, without increasing the network complexity and training time. In the experiments on cross-modal retrieval, we demonstrate comprehensive results to verify the effectiveness of the proposed approach. Our approach achieves state-of-the-art performance on two well-known multi-modal datasets, Flickr30K and MSCOCO.
|Pages:||365 - 379|
|Type of Publication:||
A1 Journal article – refereed
|Field of Science:||
113 Computer and information sciences
This work was supproted by the LIACS Media Lab at Leiden University under Grant 2006002026 and the National Natural Science Foundation of China under Grant 61872379. We are also grateful to the support of NVIDIA with the donation of GPU cards.
© 2019 Elsevier Ltd. All rights reserved. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/.