A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics

Muhidin Mohamed*, Mourad Oussalah

*Corresponding author for this work

Research output: Contribution to journalArticle

Abstract

In this paper, we propose a hybrid approach for sentence paraphrase identification. The proposal addresses the problem of evaluating sentence-to-sentence semantic similarity when the sentences contain a set of named-entities. The essence of the proposal is to distinguish the computation of the semantic similarity of named-entity tokens from the rest of the sentence text. More specifically, this is based on the integration of word semantic similarity derived from WordNet taxonomic relations, and named-entity semantic relatedness inferred from Wikipedia entity co-occurrences and underpinned by Normalized Google Distance. In addition, the WordNet similarity measure is enriched with word part-of-speech (PoS) conversion aided with a Categorial Variation database (CatVar), which enhances the lexico-semantics of words. We validated our hybrid approach using two different datasets; Microsoft Research Paraphrase Corpus (MSRPC) and TREC-9 Question Variants. In our empirical evaluation, we showed that our system outperforms baselines and most of the related state-of-the-art systems for paraphrase detection. We also conducted a misidentification analysis to disclose the primary sources of our system errors.

Original languageEnglish
JournalLanguage Resources and Evaluation
Early online date16 Apr 2019
DOIs
Publication statusE-pub ahead of print - 16 Apr 2019

Fingerprint

heuristics
semantics
Wikipedia
search engine
Paraphrase
Heuristics
Entity
Semantic Similarity
evaluation
WordNet

Bibliographical note

© The Author(s) 2019. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Keywords

  • Named-entity semantic relatedness
  • Paraphrase identification
  • Wikipedia
  • Word category subsumption
  • WordNet

Cite this

@article{c8162650acb94c4abbd9f1edeaf58543,
title = "A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics",
abstract = "In this paper, we propose a hybrid approach for sentence paraphrase identification. The proposal addresses the problem of evaluating sentence-to-sentence semantic similarity when the sentences contain a set of named-entities. The essence of the proposal is to distinguish the computation of the semantic similarity of named-entity tokens from the rest of the sentence text. More specifically, this is based on the integration of word semantic similarity derived from WordNet taxonomic relations, and named-entity semantic relatedness inferred from Wikipedia entity co-occurrences and underpinned by Normalized Google Distance. In addition, the WordNet similarity measure is enriched with word part-of-speech (PoS) conversion aided with a Categorial Variation database (CatVar), which enhances the lexico-semantics of words. We validated our hybrid approach using two different datasets; Microsoft Research Paraphrase Corpus (MSRPC) and TREC-9 Question Variants. In our empirical evaluation, we showed that our system outperforms baselines and most of the related state-of-the-art systems for paraphrase detection. We also conducted a misidentification analysis to disclose the primary sources of our system errors.",
keywords = "Named-entity semantic relatedness, Paraphrase identification, Wikipedia, Word category subsumption, WordNet",
author = "Muhidin Mohamed and Mourad Oussalah",
note = "{\circledC} The Author(s) 2019. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.",
year = "2019",
month = "4",
day = "16",
doi = "10.1007/s10579-019-09466-4",
language = "English",

}

A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics. / Mohamed, Muhidin; Oussalah, Mourad.

In: Language Resources and Evaluation, 16.04.2019.

Research output: Contribution to journalArticle

TY - JOUR

T1 - A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics

AU - Mohamed, Muhidin

AU - Oussalah, Mourad

N1 - © The Author(s) 2019. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

PY - 2019/4/16

Y1 - 2019/4/16

N2 - In this paper, we propose a hybrid approach for sentence paraphrase identification. The proposal addresses the problem of evaluating sentence-to-sentence semantic similarity when the sentences contain a set of named-entities. The essence of the proposal is to distinguish the computation of the semantic similarity of named-entity tokens from the rest of the sentence text. More specifically, this is based on the integration of word semantic similarity derived from WordNet taxonomic relations, and named-entity semantic relatedness inferred from Wikipedia entity co-occurrences and underpinned by Normalized Google Distance. In addition, the WordNet similarity measure is enriched with word part-of-speech (PoS) conversion aided with a Categorial Variation database (CatVar), which enhances the lexico-semantics of words. We validated our hybrid approach using two different datasets; Microsoft Research Paraphrase Corpus (MSRPC) and TREC-9 Question Variants. In our empirical evaluation, we showed that our system outperforms baselines and most of the related state-of-the-art systems for paraphrase detection. We also conducted a misidentification analysis to disclose the primary sources of our system errors.

AB - In this paper, we propose a hybrid approach for sentence paraphrase identification. The proposal addresses the problem of evaluating sentence-to-sentence semantic similarity when the sentences contain a set of named-entities. The essence of the proposal is to distinguish the computation of the semantic similarity of named-entity tokens from the rest of the sentence text. More specifically, this is based on the integration of word semantic similarity derived from WordNet taxonomic relations, and named-entity semantic relatedness inferred from Wikipedia entity co-occurrences and underpinned by Normalized Google Distance. In addition, the WordNet similarity measure is enriched with word part-of-speech (PoS) conversion aided with a Categorial Variation database (CatVar), which enhances the lexico-semantics of words. We validated our hybrid approach using two different datasets; Microsoft Research Paraphrase Corpus (MSRPC) and TREC-9 Question Variants. In our empirical evaluation, we showed that our system outperforms baselines and most of the related state-of-the-art systems for paraphrase detection. We also conducted a misidentification analysis to disclose the primary sources of our system errors.

KW - Named-entity semantic relatedness

KW - Paraphrase identification

KW - Wikipedia

KW - Word category subsumption

KW - WordNet

UR - http://www.scopus.com/inward/record.url?scp=85064629533&partnerID=8YFLogxK

U2 - 10.1007/s10579-019-09466-4

DO - 10.1007/s10579-019-09466-4

M3 - Article

AN - SCOPUS:85064629533

ER -