High quality graph-based similarity search

Weiren Yu, Julie Ann McCann

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

SimRank is an influential link-based similarity measure that has been used in many fields of Web search and sociometry. The best-of-breed method by Kusumoto et. al., however, does not always deliver high-quality results, since it fails to accurately obtain its diagonal correction matrix D. Besides, SimRank is also limited by an unwanted "connectivity trait": increasing the number of paths between nodes a and b often incurs a decrease in score s(a,b). The best-known solution, SimRank++, cannot resolve this problem, since a revised score will be zero if a and b have no common in-neighbors. In this paper, we consider high-quality similarity search. Our scheme, SR#, is efficient and semantically meaningful: (1) We first formulate the exact D, and devise a "varied-D" method to accurately compute SimRank in linear memory. Moreover, by grouping computation, we also reduce the time of from quadratic to linear in the number of iterations. (2) We design a "kernel-based" model to improve the quality of SimRank, and circumvent the "connectivity trait" issue. (3) We give mathematical insights to the semantic difference between SimRank and its variant, and correct an argument: "if D is replaced by a scaled identity matrix, top-K rankings will not be affected much". The experiments confirm that SR# can accurately extract high-quality scores, and is much faster than the state-of-the-art competitors.
Original languageEnglish
Title of host publicationSIGIR '15 : Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
Place of PublicationNew York, NY (US)
PublisherACM
Pages83-93
Number of pages11
ISBN (Print)978-1-4503-3621-5
DOIs
Publication statusPublished - 9 Aug 2015
Event38th International ACM SIGIR Conference on Research and Development in Information Retrieval - Santiago, Chile
Duration: 6 Aug 201513 Aug 2015

Conference

Conference38th International ACM SIGIR Conference on Research and Development in Information Retrieval
Abbreviated titleSIGIR 2015
CountryChile
CitySantiago
Period6/08/1513/08/15

Fingerprint

Semantics
Data storage equipment
Experiments

Keywords

  • link analysis
  • graph-based similarity
  • high quality search

Cite this

Yu, W., & McCann, J. A. (2015). High quality graph-based similarity search. In SIGIR '15 : Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 83-93). New York, NY (US): ACM. https://doi.org/10.1145/2766462.2767720
Yu, Weiren ; McCann, Julie Ann. / High quality graph-based similarity search. SIGIR '15 : Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY (US) : ACM, 2015. pp. 83-93
@inproceedings{7a3f688e455a4edc85681bf3047e2a59,
title = "High quality graph-based similarity search",
abstract = "SimRank is an influential link-based similarity measure that has been used in many fields of Web search and sociometry. The best-of-breed method by Kusumoto et. al., however, does not always deliver high-quality results, since it fails to accurately obtain its diagonal correction matrix D. Besides, SimRank is also limited by an unwanted {"}connectivity trait{"}: increasing the number of paths between nodes a and b often incurs a decrease in score s(a,b). The best-known solution, SimRank++, cannot resolve this problem, since a revised score will be zero if a and b have no common in-neighbors. In this paper, we consider high-quality similarity search. Our scheme, SR#, is efficient and semantically meaningful: (1) We first formulate the exact D, and devise a {"}varied-D{"} method to accurately compute SimRank in linear memory. Moreover, by grouping computation, we also reduce the time of from quadratic to linear in the number of iterations. (2) We design a {"}kernel-based{"} model to improve the quality of SimRank, and circumvent the {"}connectivity trait{"} issue. (3) We give mathematical insights to the semantic difference between SimRank and its variant, and correct an argument: {"}if D is replaced by a scaled identity matrix, top-K rankings will not be affected much{"}. The experiments confirm that SR# can accurately extract high-quality scores, and is much faster than the state-of-the-art competitors.",
keywords = "link analysis, graph-based similarity, high quality search",
author = "Weiren Yu and McCann, {Julie Ann}",
year = "2015",
month = "8",
day = "9",
doi = "10.1145/2766462.2767720",
language = "English",
isbn = "978-1-4503-3621-5",
pages = "83--93",
booktitle = "SIGIR '15 : Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval",
publisher = "ACM",
address = "United States",

}

Yu, W & McCann, JA 2015, High quality graph-based similarity search. in SIGIR '15 : Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY (US), pp. 83-93, 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, 6/08/15. https://doi.org/10.1145/2766462.2767720

High quality graph-based similarity search. / Yu, Weiren; McCann, Julie Ann.

SIGIR '15 : Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY (US) : ACM, 2015. p. 83-93.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - High quality graph-based similarity search

AU - Yu, Weiren

AU - McCann, Julie Ann

PY - 2015/8/9

Y1 - 2015/8/9

N2 - SimRank is an influential link-based similarity measure that has been used in many fields of Web search and sociometry. The best-of-breed method by Kusumoto et. al., however, does not always deliver high-quality results, since it fails to accurately obtain its diagonal correction matrix D. Besides, SimRank is also limited by an unwanted "connectivity trait": increasing the number of paths between nodes a and b often incurs a decrease in score s(a,b). The best-known solution, SimRank++, cannot resolve this problem, since a revised score will be zero if a and b have no common in-neighbors. In this paper, we consider high-quality similarity search. Our scheme, SR#, is efficient and semantically meaningful: (1) We first formulate the exact D, and devise a "varied-D" method to accurately compute SimRank in linear memory. Moreover, by grouping computation, we also reduce the time of from quadratic to linear in the number of iterations. (2) We design a "kernel-based" model to improve the quality of SimRank, and circumvent the "connectivity trait" issue. (3) We give mathematical insights to the semantic difference between SimRank and its variant, and correct an argument: "if D is replaced by a scaled identity matrix, top-K rankings will not be affected much". The experiments confirm that SR# can accurately extract high-quality scores, and is much faster than the state-of-the-art competitors.

AB - SimRank is an influential link-based similarity measure that has been used in many fields of Web search and sociometry. The best-of-breed method by Kusumoto et. al., however, does not always deliver high-quality results, since it fails to accurately obtain its diagonal correction matrix D. Besides, SimRank is also limited by an unwanted "connectivity trait": increasing the number of paths between nodes a and b often incurs a decrease in score s(a,b). The best-known solution, SimRank++, cannot resolve this problem, since a revised score will be zero if a and b have no common in-neighbors. In this paper, we consider high-quality similarity search. Our scheme, SR#, is efficient and semantically meaningful: (1) We first formulate the exact D, and devise a "varied-D" method to accurately compute SimRank in linear memory. Moreover, by grouping computation, we also reduce the time of from quadratic to linear in the number of iterations. (2) We design a "kernel-based" model to improve the quality of SimRank, and circumvent the "connectivity trait" issue. (3) We give mathematical insights to the semantic difference between SimRank and its variant, and correct an argument: "if D is replaced by a scaled identity matrix, top-K rankings will not be affected much". The experiments confirm that SR# can accurately extract high-quality scores, and is much faster than the state-of-the-art competitors.

KW - link analysis

KW - graph-based similarity

KW - high quality search

UR - http://dl.acm.org/citation.cfm?id=2767720

U2 - 10.1145/2766462.2767720

DO - 10.1145/2766462.2767720

M3 - Conference contribution

SN - 978-1-4503-3621-5

SP - 83

EP - 93

BT - SIGIR '15 : Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

PB - ACM

CY - New York, NY (US)

ER -

Yu W, McCann JA. High quality graph-based similarity search. In SIGIR '15 : Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY (US): ACM. 2015. p. 83-93 https://doi.org/10.1145/2766462.2767720