TY - JOUR
T1 - A classification approach for less popular webpages based on latent semantic analysis and rough set model
AU - Wang, Jun
AU - Peng, Jiaxu
AU - Liu, Ou
PY - 2015/1/1
Y1 - 2015/1/1
N2 - Nowadays, with the explosive growth of web information, the webpage classification faces great challenge. Computers have difficulty in understanding the semantic meaning of textual or non-textual webpages. Fortunately, Web 2.0 based collaborative tagging system brings new opportunities to solve this problem. It abstracts structured tags from unstructured content in webpages. However, large numbers of webpages on the Internet are less popular. Their tagging information is sparse, which makes their topic unclear and leads to ambiguous classification. Illuminated by the "ambiguous classification", we name the less popular webpage "hesitant webpage". In this paper, we propose an advanced approach for hesitant webpages classification. Firstly, hesitant webpages are divided into bridges, hubs and attached webpages according to their roles on the Internet. Secondly, attached webpages are classified by mining and extending their information in two perspectives. One is the latent semantic analysis (LSA) which is applied to fully explore the semantic meaning of sparse tags. It promotes accurate cognition of webpages semantically close to attached webpages. Another is the proposed density-relation-based rough set model which measures the affiliation degree of attached webpages in different categories. Experiment on real data shows that our approach effectively classifies the hesitant webpages base on the semantic meaning.
AB - Nowadays, with the explosive growth of web information, the webpage classification faces great challenge. Computers have difficulty in understanding the semantic meaning of textual or non-textual webpages. Fortunately, Web 2.0 based collaborative tagging system brings new opportunities to solve this problem. It abstracts structured tags from unstructured content in webpages. However, large numbers of webpages on the Internet are less popular. Their tagging information is sparse, which makes their topic unclear and leads to ambiguous classification. Illuminated by the "ambiguous classification", we name the less popular webpage "hesitant webpage". In this paper, we propose an advanced approach for hesitant webpages classification. Firstly, hesitant webpages are divided into bridges, hubs and attached webpages according to their roles on the Internet. Secondly, attached webpages are classified by mining and extending their information in two perspectives. One is the latent semantic analysis (LSA) which is applied to fully explore the semantic meaning of sparse tags. It promotes accurate cognition of webpages semantically close to attached webpages. Another is the proposed density-relation-based rough set model which measures the affiliation degree of attached webpages in different categories. Experiment on real data shows that our approach effectively classifies the hesitant webpages base on the semantic meaning.
KW - Complex network analysis
KW - Latent semantic analysis
KW - Rough set
KW - Webpage classification
UR - http://www.scopus.com/inward/record.url?scp=84908547484&partnerID=8YFLogxK
UR - https://www.sciencedirect.com/science/article/pii/S0957417414004898?via%3Dihub
U2 - 10.1016/j.eswa.2014.08.013
DO - 10.1016/j.eswa.2014.08.013
M3 - Article
AN - SCOPUS:84908547484
SN - 0957-4174
VL - 42
SP - 642
EP - 648
JO - Expert Systems with Applications
JF - Expert Systems with Applications
IS - 1
ER -