The conventional k-modes algorithm and its variants have been extensively used for categorical data clustering. However, these algorithms have some drawbacks, e.g., they can be trapped into local optima and sensitive to initial clusters/modes. Our numerical experiments even showed that the k-modes algorithm could not identify the optimal clustering results for some special datasets regardless the selection of the initial centers. In this paper, we developed an integer linear programming (ILP) approach for the k-modes clustering, which is independent to the initial solution and can obtain directly the optimal results for small-sized datasets. We also developed a heuristic algorithm that implements iterative partial optimization in the ILP approach based on a framework of variable neighborhood search, known as IPO-ILP-VNS, to search for near-optimal results of medium and large sized datasets with controlled computing time. Experiments on 38 datasets, including 27 synthesized small datasets and 11 known benchmark datasets from the UCI site were carried out to test the proposed ILP approach and the IPO-ILP-VNS algorithm. The experimental results outperformed the conventional and other existing enhanced k-modes algorithms in literature, updated 9 of the UCI benchmark datasets with new and improved results.
Bibliographical note© 2019, Elsevier. Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/
Funding: This work is partially supported by the National Natural Science Foundation of China under grant nos. 71871003 and 61376042.
- Categorical clustering
- Data mining
- Integer linear programming
- Variable neighborhood search