Random centroid initialization for improving centroid-based clustering

Vadim Romanuke

doi:10.31181/dmame622023742

Authors

Vadim Romanuke Polish Naval Academy, 69 Śmidowicza Street, Gdynia, Poland https://orcid.org/0000-0003-3543-3087

DOI:

https://doi.org/10.31181/dmame622023742

Keywords:

Centroid-based clustering, k-means++, centroid initialization, random initialization, algorithm multiple runs

Abstract

A method for improving centroid-based clustering is suggested. The improvement is built on diversification of the k-means++ initialization. The k-means++ algorithm claimed to be a better version of k-means is tested by a computational set-up, where the dataset size, the number of features, and the number of clusters are varied. The statistics obtained on the testing have shown that, in roughly 50 % of instances to cluster, k-means++ outputs worse results than k-means with random centroid initialization. The impact of the random centroid initialization solidifies as both the dataset size and the number of features increase. In order to reduce the possible underperformance of k-means++, the k-means algorithm is run on a separate processor core in parallel to running the k-means++ algorithm, whereupon the better result is selected. The number of k-means++ algorithm runs is set not less than that of k-means. By incorporating the seeding method of random centroid initialization, the k-means++ algorithm gains about 0.05 % accuracy in every second instance to cluster.

Downloads

Download data is not yet available.

References

Arthur, D., & Vassilvitskii, S. (2006). How Slow is the k-means Method?, in: Proceedings of the Twenty-Second Annual Symposium on Computational Geometry (SCG’06), 144–153.

Arthur, D., & Vassilvitskii, S. (2007). K-means++: The Advantages of Careful Seeding, in: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’07), 1027–1035.

Bottou, L., & Bengio, Y. (1994). Convergence properties of the K-means algorithms, in: Proceedings of the 7th International Conference on Neural Information Processing Systems (NIPS’94), 585–592.

Celebi, M. E., Kingravi, H. A., & Vela, P. A. (2013). A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications, 40 (1), 200–210.

Chakrabarty, A., & Swagatam, D. (2022). On strong consistency of kernel k-means: A Rademacher complexity approach. Statistics & Probability Letters, 182, Article ID 109291. https://doi.org/10.1016/j.spl.2021.109291

Fränti, P., & Sieranoja, S. (2019). How much can k-means be improved by using better initialization and repeats? Pattern Recognition, 93, 95–112.

Gonzalez, T. (1985). Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38, 293–306.

Hamerly, G. (2010). Making k-means even faster, in: Proceedings of the 2010 SIAM International Conference on Data Mining, 130–140.

Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C, 28 (1), 100–108.

Ikotun, A. M., Ezugwu, A. E., Abualigah, L., Abuhaija, B., & Heming, J. (2023). K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Information Sciences, 622, 178–210.

Jafar, M. N., Khan, M. R., Sultan, H., & Ahmed, N. (2020a). Interval valued fuzzy soft sets and algorithm of IVFSS applied to the risk analysis of prostate cancer. International Journal of Computer Applications, 177 (38), 18–26.

Jafar, M. N., Farooq, A., Javed, K., & Nawaz, N. (2020b). Similarity measures of tangent, cotangent and cosines in neutrosophic environment and their application in selection of academic programs. International Journal of Computer Applications, 177 (46), 17–24.

Jafar, M. N., Saeed, M., Saqlain, M., & Yang, M.-S. (2021). Trigonometric similarity measures for neutrosophic hypersoft sets with application to renewable energy source selection. IEEE Access, 9, 129178–129187.

Jafar, M. N., Saeed, M., Khan, K. M., Alamri, F. S., & Khalifa, H. A. E. W. (2022). Distance and similarity measures using max-min operators of neutrosophic hypersoft sets with application in site selection for solid waste management systems. IEEE Access, 10, 11220-11235. https://doi.org/10.1109/ACCESS.2022.3144306

Jafar, M. N., & Saeed, M. (2022). Matrix theory for neutrosophic hypersoft set and applications in multiattributive multicriteria decision-making problems. Journal of Mathematics, 2022, Article ID 6666408. https://doi.org/10.1155/2022/6666408

Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A. Y. (2002). An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24 (7), 881–892.

Kanungo, T., Mount, D., Netanyahu, N., Piatko, C., Silverman, R., & Wu, A. (2004). A local search approximation algorithm for k-means clustering. Computational Geometry: Theory and Applications, 28 (2–3), 89–112.

Li, S. (2011). A 1.488 Approximation Algorithm for the Uncapacitated Facility Location Problem, in: Automata, Languages and Programming. Lecture Notes in Computer Science, vol. 6756. Springer, pp. 77–88. https://doi.org/10.1016/j.ic.2012.01.007

Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28, 129–137.

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations, in: Proceedings of the Fifth Berkeley Symposium on Mathematics, Statistics and Probability, 1, 281–296.

Mahajan, M., Nimbhorkar, P., & Varadarajan, K. (2009). The Planar k-Means Problem is NP-Hard, in: Lecture Notes in Computer Science, vol. 5431. Springer, 274–285. https://doi.org/10.1007/978-3-642-00202-1_24

Megiddo, N., & Tamir, A. (1982). On the complexity of locating linear facilities in the plane. Operations Research Letters, 1 (5), 194–197.

Ostrovsky, R., Rabani, Y., Schulman, L. J., & Swamy, C. (2006). The effectiveness of Lloyd-type methods for the k-means problem, in: Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pp. 165–174. https://doi.org/10.1109/FOCS.2006.75

Phillips, S. J. (2002). Acceleration of K-Means and Related Clustering Algorithms, in: Mount, D. M., & Stein, C. (eds.), Lecture Notes in Computer Science, vol. 2409. Springer, 166–177. https://doi.org/10.1007/3-540-45643-0_13

Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (2007). Section 16.1. Gaussian Mixture Models and k-Means Clustering, in: Numerical Recipes: The Art of Scientific Computing (3rd edition). Cambridge, New York, NY: Cambridge University Press.

Romanuke, V. V. (2018a). Decision making criteria hybridization for finding optimal decisions’ subset regarding changes of the decision function. Journal of Uncertain Systems, 12 (4), 279–291.

Romanuke, V. V. (2018b). Optimization of a dataset for a machine learning task by clustering and selecting closest-to-the-centroid objects. Herald of Khmelnytskyi national university. Technical sciences, 6 (1), 263–265.

Romanuke, V. V. (2019). Fast-and-smoother uplink power control algorithm based on distance ratios for wireless data transfer systems. Studies in Informatics and Control, 28 (2), 147–156.

Romanuke, V. V. (2021). Refinement of acyclic-and-asymmetric payoff aggregates of pure strategy efficient Nash equilibria in finite noncooperative games by maximultimin and superoptimality. Decision Making: Applications in Management and Engineering, 4 (2), 178–199.

Vattani, A. (2011). k-means requires exponentially many iterations even in the plane. Discrete and Computational Geometry, 45 (4), 596–616.