Scholarly Research Excellence

Digital Open Science Index

Commenced in January 2007 Frequency: Monthly Edition: International Publications Count: 29151


Select areas to restrict search in scientific publication database:
10007221
Clustering Categorical Data Using the K-Means Algorithm and the Attribute’s Relative Frequency
Abstract:
Clustering is a well known data mining technique used in pattern recognition and information retrieval. The initial dataset to be clustered can either contain categorical or numeric data. Each type of data has its own specific clustering algorithm. In this context, two algorithms are proposed: the k-means for clustering numeric datasets and the k-modes for categorical datasets. The main encountered problem in data mining applications is clustering categorical dataset so relevant in the datasets. One main issue to achieve the clustering process on categorical values is to transform the categorical attributes into numeric measures and directly apply the k-means algorithm instead the k-modes. In this paper, it is proposed to experiment an approach based on the previous issue by transforming the categorical values into numeric ones using the relative frequency of each modality in the attributes. The proposed approach is compared with a previously method based on transforming the categorical datasets into binary values. The scalability and accuracy of the two methods are experimented. The obtained results show that our proposed method outperforms the binary method in all cases.
Digital Object Identifier (DOI):

References:

[1] Jiawei Han, Jian Pei, Micheline Kamber, “Data Mining: Concepts and Techniques”, Elsevier, 3rd edition, 2011, 744 p.
[2] Charu C. Aggarwal, “Data Mining: the textbook”, Springer 2015, 734 pages.
[3] GuojunGan, Chaoqun Ma, Jianhong Wu, “Data Clustering: Theory, Algorithms, and Applications”, ASA-SIAM Series on Statistics and Applied Probability, 2007.
[4] Zhexue Huang, “Extension to the k-means algorithm for clustering large data sets with categorical values.” Data Mining and Knowledge Discovery 2, 283-304 (1998).
[5] Fuyuan Cao, Jiye Liang, Deyu Li, Liang Bai, Chuangyin Dang, “A dissimilarity measure for the k-modes clustering algorithm”, Knowledge Based Systems 26 (2012), Elsevier, pp 120-127.
[6] Z. He, X. Xu, S. Deng, ”Squeezer: an efficient algorithm for clustering categorical data” Journal of Computational Science and Technology 17 (5) (2002) 611-624.
[7] Z. He, X. Xu, S. Deng, “Scalable algorithms for clustering large datasets with mixed type attributes”, International Journal of Intelligent Systems 20 (10) (2005) 1077-1089.
[8] Z. X, Huang, M. K Ng, “A fuzzy k-modes algorithm for clustering categorical data”, IEEE transactions on Fuzzy systems 7(4) (1999) 446-452.
[9] D. W Kim, K. H Lee, D. Lee, “Fuzzy clustering of categorical data using fuzzy centroids”, Pattern recognition letters 25 (2004) 1263-1271.
[10] M. K Ng, M. J Li, Z. X Huang, Z. Y He “On the impact of dissimilarity measure in k-modes clustering algorithm.” IEEE transactions on Pattern Analysis and Machine Intelligence 29 (3) (2007) 503-507.
[11] D. Gibson, J. Kleinberg, P. Raghavan, “Clustering categorical data: an approach based on dynamical systems”, Proceedings of the 24th VLDB Conference, New York, 1998, pp 311-322.
[12] S. Guha, R. Rastogi, K. Shim, “ROCK: a robust clustering algorithm for categorical attributes”Proceedings of the IEEEInternationalConference on Data Engineering, Sydney, Australia 1999, pp 512-521.
[13] Ng M. K., Li M. J, Huang J. H, He Z, “On the impact of dissimilarity measure in k-modes clustering algorithm.” IEEE transactions on Pattern Analysis and Machine Intelligence 29 (3); 503-507, 2007.
[14] A. Chaturvedi, Paul E. Green and J.D Caroll, “K-modes clustering.”, Journal of classification, Vol.18, No 1, pp 35-55, 2001.
[15] Ralambondrainy, H, “A conceptual version of the k-means algorithm.” Pattern recognition Letters 16, 1147-1157, 1995.
[16] Semeh Ben Salem, Sami Naouali, “Reducing the multidimensionality of OLAP cubes with Genetic Algorithms and Multiple Correspondence Analysis”, international conference on Advanced Wireless, Information, and Communication Technologies (AWICT 2015), Tunisia.
[17] Semeh Ben Salem, Sami Naouali, “Towards Reducing the multidimensionality of OLAP cubes using the Evolutionary Algorithms and Factor Analysis Method”, International Journal of Data Mining and Knowledge Management Process (IJDKM 2016).
[18] Semeh Ben Salem and Sami Naouali, “Pattern Recognition Approach in Multidimensional Databases: Application to the Global Terrorism Database” International Journal of Advanced Computer Science and Applications (IJACSA), 7(8), 2016.

Vol:12 No:12 2018Vol:12 No:11 2018Vol:12 No:10 2018Vol:12 No:09 2018Vol:12 No:08 2018Vol:12 No:07 2018Vol:12 No:06 2018Vol:12 No:05 2018Vol:12 No:04 2018Vol:12 No:03 2018Vol:12 No:02 2018Vol:12 No:01 2018
Vol:11 No:12 2017Vol:11 No:11 2017Vol:11 No:10 2017Vol:11 No:09 2017Vol:11 No:08 2017Vol:11 No:07 2017Vol:11 No:06 2017Vol:11 No:05 2017Vol:11 No:04 2017Vol:11 No:03 2017Vol:11 No:02 2017Vol:11 No:01 2017
Vol:10 No:12 2016Vol:10 No:11 2016Vol:10 No:10 2016Vol:10 No:09 2016Vol:10 No:08 2016Vol:10 No:07 2016Vol:10 No:06 2016Vol:10 No:05 2016Vol:10 No:04 2016Vol:10 No:03 2016Vol:10 No:02 2016Vol:10 No:01 2016
Vol:9 No:12 2015Vol:9 No:11 2015Vol:9 No:10 2015Vol:9 No:09 2015Vol:9 No:08 2015Vol:9 No:07 2015Vol:9 No:06 2015Vol:9 No:05 2015Vol:9 No:04 2015Vol:9 No:03 2015Vol:9 No:02 2015Vol:9 No:01 2015
Vol:8 No:12 2014Vol:8 No:11 2014Vol:8 No:10 2014Vol:8 No:09 2014Vol:8 No:08 2014Vol:8 No:07 2014Vol:8 No:06 2014Vol:8 No:05 2014Vol:8 No:04 2014Vol:8 No:03 2014Vol:8 No:02 2014Vol:8 No:01 2014
Vol:7 No:12 2013Vol:7 No:11 2013Vol:7 No:10 2013Vol:7 No:09 2013Vol:7 No:08 2013Vol:7 No:07 2013Vol:7 No:06 2013Vol:7 No:05 2013Vol:7 No:04 2013Vol:7 No:03 2013Vol:7 No:02 2013Vol:7 No:01 2013
Vol:6 No:12 2012Vol:6 No:11 2012Vol:6 No:10 2012Vol:6 No:09 2012Vol:6 No:08 2012Vol:6 No:07 2012Vol:6 No:06 2012Vol:6 No:05 2012Vol:6 No:04 2012Vol:6 No:03 2012Vol:6 No:02 2012Vol:6 No:01 2012
Vol:5 No:12 2011Vol:5 No:11 2011Vol:5 No:10 2011Vol:5 No:09 2011Vol:5 No:08 2011Vol:5 No:07 2011Vol:5 No:06 2011Vol:5 No:05 2011Vol:5 No:04 2011Vol:5 No:03 2011Vol:5 No:02 2011Vol:5 No:01 2011
Vol:4 No:12 2010Vol:4 No:11 2010Vol:4 No:10 2010Vol:4 No:09 2010Vol:4 No:08 2010Vol:4 No:07 2010Vol:4 No:06 2010Vol:4 No:05 2010Vol:4 No:04 2010Vol:4 No:03 2010Vol:4 No:02 2010Vol:4 No:01 2010
Vol:3 No:12 2009Vol:3 No:11 2009Vol:3 No:10 2009Vol:3 No:09 2009Vol:3 No:08 2009Vol:3 No:07 2009Vol:3 No:06 2009Vol:3 No:05 2009Vol:3 No:04 2009Vol:3 No:03 2009Vol:3 No:02 2009Vol:3 No:01 2009
Vol:2 No:12 2008Vol:2 No:11 2008Vol:2 No:10 2008Vol:2 No:09 2008Vol:2 No:08 2008Vol:2 No:07 2008Vol:2 No:06 2008Vol:2 No:05 2008Vol:2 No:04 2008Vol:2 No:03 2008Vol:2 No:02 2008Vol:2 No:01 2008
Vol:1 No:12 2007Vol:1 No:11 2007Vol:1 No:10 2007Vol:1 No:09 2007Vol:1 No:08 2007Vol:1 No:07 2007Vol:1 No:06 2007Vol:1 No:05 2007Vol:1 No:04 2007Vol:1 No:03 2007Vol:1 No:02 2007Vol:1 No:01 2007