Optimal mathematical programming and variable neighborhood search for k-modes categorical data clustering

Yiyong Xiao, Changhao Huang, Jiaoying Huang*, Ikou Kaku, Yuchun Xu

*Corresponding author for this work

Research output: Contribution to journalArticle

Abstract

The conventional k-modes algorithm and its variants have been extensively used for categorical data clustering. However, these algorithms have some drawbacks, e.g., they can be trapped into local optima and sensitive to initial clusters/modes. Our numerical experiments even showed that the k-modes algorithm could not identify the optimal clustering results for some special datasets regardless the selection of the initial centers. In this paper, we developed an integer linear programming (ILP) approach for the k-modes clustering, which is independent to the initial solution and can obtain directly the optimal results for small-sized datasets. We also developed a heuristic algorithm that implements iterative partial optimization in the ILP approach based on a framework of variable neighborhood search, known as IPO-ILP-VNS, to search for near-optimal results of medium and large sized datasets with controlled computing time. Experiments on 38 datasets, including 27 synthesized small datasets and 11 known benchmark datasets from the UCI site were carried out to test the proposed ILP approach and the IPO-ILP-VNS algorithm. The experimental results outperformed the conventional and other existing enhanced k-modes algorithms in literature, updated 9 of the UCI benchmark datasets with new and improved results.

Original languageEnglish
Pages (from-to)183-195
Number of pages13
JournalPattern Recognition
Volume90
Early online date29 Jan 2019
DOIs
Publication statusE-pub ahead of print - 29 Jan 2019

Fingerprint

Mathematical programming
Linear programming
Heuristic algorithms
Experiments

Bibliographical note

© 2019, Elsevier. Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/

Funding: This work is partially supported by the National Natural Science Foundation of China under grant nos. 71871003 and 61376042.

Keywords

  • Categorical clustering
  • Data mining
  • Integer linear programming
  • Variable neighborhood search

Cite this

Xiao, Yiyong ; Huang, Changhao ; Huang, Jiaoying ; Kaku, Ikou ; Xu, Yuchun. / Optimal mathematical programming and variable neighborhood search for k-modes categorical data clustering. In: Pattern Recognition. 2019 ; Vol. 90. pp. 183-195.
@article{ceed3408e9b24c9faf9bdf16d57c4898,
title = "Optimal mathematical programming and variable neighborhood search for k-modes categorical data clustering",
abstract = "The conventional k-modes algorithm and its variants have been extensively used for categorical data clustering. However, these algorithms have some drawbacks, e.g., they can be trapped into local optima and sensitive to initial clusters/modes. Our numerical experiments even showed that the k-modes algorithm could not identify the optimal clustering results for some special datasets regardless the selection of the initial centers. In this paper, we developed an integer linear programming (ILP) approach for the k-modes clustering, which is independent to the initial solution and can obtain directly the optimal results for small-sized datasets. We also developed a heuristic algorithm that implements iterative partial optimization in the ILP approach based on a framework of variable neighborhood search, known as IPO-ILP-VNS, to search for near-optimal results of medium and large sized datasets with controlled computing time. Experiments on 38 datasets, including 27 synthesized small datasets and 11 known benchmark datasets from the UCI site were carried out to test the proposed ILP approach and the IPO-ILP-VNS algorithm. The experimental results outperformed the conventional and other existing enhanced k-modes algorithms in literature, updated 9 of the UCI benchmark datasets with new and improved results.",
keywords = "Categorical clustering, Data mining, Integer linear programming, Variable neighborhood search",
author = "Yiyong Xiao and Changhao Huang and Jiaoying Huang and Ikou Kaku and Yuchun Xu",
note = "{\circledC} 2019, Elsevier. Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/ Funding: This work is partially supported by the National Natural Science Foundation of China under grant nos. 71871003 and 61376042.",
year = "2019",
month = "1",
day = "29",
doi = "10.1016/j.patcog.2019.01.042",
language = "English",
volume = "90",
pages = "183--195",
journal = "Pattern Recognition",
issn = "0031-3203",
publisher = "Elsevier",

}

Optimal mathematical programming and variable neighborhood search for k-modes categorical data clustering. / Xiao, Yiyong; Huang, Changhao; Huang, Jiaoying; Kaku, Ikou; Xu, Yuchun.

In: Pattern Recognition, Vol. 90, 29.01.2019, p. 183-195.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Optimal mathematical programming and variable neighborhood search for k-modes categorical data clustering

AU - Xiao, Yiyong

AU - Huang, Changhao

AU - Huang, Jiaoying

AU - Kaku, Ikou

AU - Xu, Yuchun

N1 - © 2019, Elsevier. Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/ Funding: This work is partially supported by the National Natural Science Foundation of China under grant nos. 71871003 and 61376042.

PY - 2019/1/29

Y1 - 2019/1/29

N2 - The conventional k-modes algorithm and its variants have been extensively used for categorical data clustering. However, these algorithms have some drawbacks, e.g., they can be trapped into local optima and sensitive to initial clusters/modes. Our numerical experiments even showed that the k-modes algorithm could not identify the optimal clustering results for some special datasets regardless the selection of the initial centers. In this paper, we developed an integer linear programming (ILP) approach for the k-modes clustering, which is independent to the initial solution and can obtain directly the optimal results for small-sized datasets. We also developed a heuristic algorithm that implements iterative partial optimization in the ILP approach based on a framework of variable neighborhood search, known as IPO-ILP-VNS, to search for near-optimal results of medium and large sized datasets with controlled computing time. Experiments on 38 datasets, including 27 synthesized small datasets and 11 known benchmark datasets from the UCI site were carried out to test the proposed ILP approach and the IPO-ILP-VNS algorithm. The experimental results outperformed the conventional and other existing enhanced k-modes algorithms in literature, updated 9 of the UCI benchmark datasets with new and improved results.

AB - The conventional k-modes algorithm and its variants have been extensively used for categorical data clustering. However, these algorithms have some drawbacks, e.g., they can be trapped into local optima and sensitive to initial clusters/modes. Our numerical experiments even showed that the k-modes algorithm could not identify the optimal clustering results for some special datasets regardless the selection of the initial centers. In this paper, we developed an integer linear programming (ILP) approach for the k-modes clustering, which is independent to the initial solution and can obtain directly the optimal results for small-sized datasets. We also developed a heuristic algorithm that implements iterative partial optimization in the ILP approach based on a framework of variable neighborhood search, known as IPO-ILP-VNS, to search for near-optimal results of medium and large sized datasets with controlled computing time. Experiments on 38 datasets, including 27 synthesized small datasets and 11 known benchmark datasets from the UCI site were carried out to test the proposed ILP approach and the IPO-ILP-VNS algorithm. The experimental results outperformed the conventional and other existing enhanced k-modes algorithms in literature, updated 9 of the UCI benchmark datasets with new and improved results.

KW - Categorical clustering

KW - Data mining

KW - Integer linear programming

KW - Variable neighborhood search

UR - http://www.scopus.com/inward/record.url?scp=85060700978&partnerID=8YFLogxK

UR - https://www.sciencedirect.com/science/article/pii/S0031320319300482?via%3Dihub

U2 - 10.1016/j.patcog.2019.01.042

DO - 10.1016/j.patcog.2019.01.042

M3 - Article

AN - SCOPUS:85060700978

VL - 90

SP - 183

EP - 195

JO - Pattern Recognition

JF - Pattern Recognition

SN - 0031-3203

ER -