A classification approach for less popular webpages based on latent semantic analysis and rough set model

Jun Wang*, Jiaxu Peng, Ou Liu

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Nowadays, with the explosive growth of web information, the webpage classification faces great challenge. Computers have difficulty in understanding the semantic meaning of textual or non-textual webpages. Fortunately, Web 2.0 based collaborative tagging system brings new opportunities to solve this problem. It abstracts structured tags from unstructured content in webpages. However, large numbers of webpages on the Internet are less popular. Their tagging information is sparse, which makes their topic unclear and leads to ambiguous classification. Illuminated by the "ambiguous classification", we name the less popular webpage "hesitant webpage". In this paper, we propose an advanced approach for hesitant webpages classification. Firstly, hesitant webpages are divided into bridges, hubs and attached webpages according to their roles on the Internet. Secondly, attached webpages are classified by mining and extending their information in two perspectives. One is the latent semantic analysis (LSA) which is applied to fully explore the semantic meaning of sparse tags. It promotes accurate cognition of webpages semantically close to attached webpages. Another is the proposed density-relation-based rough set model which measures the affiliation degree of attached webpages in different categories. Experiment on real data shows that our approach effectively classifies the hesitant webpages base on the semantic meaning.

Original languageEnglish
Pages (from-to)642-648
Number of pages7
JournalExpert Systems with Applications
Volume42
Issue number1
Early online date19 Aug 2014
DOIs
Publication statusPublished - 1 Jan 2015

Keywords

  • Complex network analysis
  • Latent semantic analysis
  • Rough set
  • Webpage classification

Fingerprint

Dive into the research topics of 'A classification approach for less popular webpages based on latent semantic analysis and rough set model'. Together they form a unique fingerprint.

Cite this