University of Oulu

Mohamed, M., Oussalah, M. A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics. Lang Resources & Evaluation 54, 457–485 (2020). https://doi.org/10.1007/s10579-019-09466-4

A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics

Saved in:
Author: Mohamed, Muhidin1; Oussalah, Mourad2
Organizations: 1Department of Computer Science, EAS, Aston University, Birmingham B4 7ET, UK
2Centre for Ubiquitous Computing, Faculty of Information Technology Computer Science, University of Oulu, P.O. Box 4500, 90014 Oulu, Finland
Format: article
Version: published version
Access: open
Online Access: PDF Full Text (PDF, 0.7 MB)
Persistent link: http://urn.fi/urn:nbn:fi-fe202001202588
Language: English
Published: Springer Nature, 2020
Publish Date: 2020-01-20
Description:

Abstract

In this paper, we propose a hybrid approach for sentence paraphrase identification. The proposal addresses the problem of evaluating sentence-to-sentence semantic similarity when the sentences contain a set of named-entities. The essence of the proposal is to distinguish the computation of the semantic similarity of named-entity tokens from the rest of the sentence text. More specifically, this is based on the integration of word semantic similarity derived from WordNet taxonomic relations, and named-entity semantic relatedness inferred from Wikipedia entity co-occurrences and underpinned by Normalized Google Distance. In addition, the WordNet similarity measure is enriched with word part-of-speech (PoS) conversion aided with a Categorial Variation database (CatVar), which enhances the lexico-semantics of words. We validated our hybrid approach using two different datasets; Microsoft Research Paraphrase Corpus (MSRPC) and TREC-9 Question Variants. In our empirical evaluation, we showed that our system outperforms baselines and most of the related state-of-the-art systems for paraphrase detection. We also conducted a misidentification analysis to disclose the primary sources of our system errors.

see all

Series: Language resources and evaluation
ISSN: 1574-020X
ISSN-E: 1574-0218
ISSN-L: 1574-020X
Volume: 54
Pages: 457 - 485
DOI: 10.1007/s10579-019-09466-4
OADOI: https://oadoi.org/10.1007/s10579-019-09466-4
Type of Publication: A1 Journal article – refereed
Field of Science: 113 Computer and information sciences
Subjects:
Copyright information: © The Author(s) 2019. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
  https://creativecommons.org/licenses/by/4.0/