본문 바로가기
HOME> 논문 > 논문 검색상세

논문 상세정보

K-means 알고리즘 기반 클러스터링 인덱스 비교 연구
A Performance Comparison of Cluster Validity Indices based on K-means Algorithm

심요성   (고려대학교 산업시스템정보공학과UU0000159  ); 정지원   (고려대학교 산업시스템정보공학과UU0000159  ); 최인찬   (고려대학교 산업시스템정보공학과UU0000159  );
  • 초록

    The K-means algorithm is widely used at the initial stage of data analysis in data mining process, partly because of its low time complexity and the simplicity of practical implementation. Cluster validity indices are used along with the algorithm in order to determine the number of clusters as well as the clustering results of datasets. In this paper, we present a performance comparison of sixteen indices, which are selected from forty indices in literature, while considering their applicability to nonhierarchical clustering algorithms. Data sets used in the experiment are generated based on multivariate normal distribution. In particular, four error types including standardization, outlier generation, error perturbation, and noise dimension addition are considered in the comparison. Through the experiment the effects of varying number of points, attributes, and clusters on the performance are analyzed. The result of the simulation experiment shows that Calinski and Harabasz index performs the best through the all datasets and that Davis and Bouldin index becomes a strong competitor as the number of points increases in dataset.

  • 주제어

    Data Mining .   Cluster Analysis .   Nonhierarchical Clustering .   K-means .   Cluster Validity Index.  

  • 참고문헌 (40)

    1. Calinski T. and Harabasz, J., 'A Dendrite Method for Cluster Analysis,'Communications in Statistics, Vol. 3, No. 1, 1974, pp. 1-27 
    2. Davies D.L. and Bouldin, D.W., 'A Cluster Separation measure,'IEEE Transactions on Pattern analysis and Machine Intelligence, Vol. PAMI 1, No. 2, 1979, pp. 224-227 
    3. Dimitriadou, E., Dolnicar, S. and Weingessel, A., 'An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets,'Psychometrika, Vol. 67, No. 1, 2002 
    4. Edwards, A.W.F. and L. Cavalli Sforza, 'A Method for Cluster Analysis,'Biometrika, Vol. 56, 1965, pp. 362-375 
    5. Jain, A.K., Murty, M.N. and Flynn, P.J., 'Data Clustering: A Review,'ACM Computing Surveys, Vol. 31, No. 3, 1999 
    6. Johnson, S.C., 'Hierarchical Clustering Schemes,'Psychometrika, Vol. 32, 1967, pp. 241-254 
    7. Lingoes, J.C. and Cooper, T., 'PEP-I: A FORTRAN IV (G) program for Guttman-Lingoes Nonmetric Probability Clustering,'Behavioral Science, Vol. 16, 1971, pp. 259-261 
    8. Milligan, G.W. and Cooper, M.C., 'An Examination of Procedures for Determining the Number of Clusters in a Data Set,'Psychometrika, Vol. 50, No. 2, 1985, pp. 159-179 
    9. Ray, A.A., SAS user's guide: Statistics, Cary, North Carolina: SAS Institue, 1982 
    10. Scott, A.J. and Symons, M.J., 'Clustering Methods based on Likelihood Ratio Criteria,'Biometrics, Vol. 27, 1971, pp. 387- 397 
    11. Bock H.H., 'On Tests Concerning the Existence of a Classification,'In First International Symposium on Data Analysis and Informatics, Vol. 2, 1977, pp. 449-464, Rocquencourt, France: IRIA 
    12. Wolfe, J.H., 'Pattern Clustering by Multivariate Mixture Analysis,'Multivariate Behavioral Research, Vol. 5, 1970, pp. 329-350 
    13. Ball, G.H. and Hall, D.J., 'ISODATA, A Novel Method of Data Analysis and Pattern Classification,'Menlo Park: Stanford Research Institute. (NTIS No. AD 699616), 1965 
    14. Berry, M.J.A. and Linoff, G.S., Mastering Data Mining - The Art and Science of Customer Relationship Management, John Wiley and Sons, Inc. 2000 
    15. Hubert, L.J. and Levin, J.R., 'A General Statistical Framework for Assessing Categorical Clustering in Free Recall,'Psychological Bulletin, Vol. 83, 1976, pp. 1072- 1080 
    16. MacQueen, J.B., 'Some Methods for Classification and Analysis of Multivariate Observations,'Proceedings of 5 th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, Vol. 1, 1967, pp. 281-297 
    17. Duda, R.O. and Hart, P.E., Pattern Classification and Scene Analysis, New York: Wiley, 1973 
    18. Marriot, F.H.B., 'Practical Problems in a Method of Cluster Analysis,'Biometrics, Vol. 27, 1975, pp. 456-460 
    19. Halkidi, M. and Vazirgiannis, M., 'Clustering Validity Assessment: Finding the Optimal Partitioning of a Data Set,'Proceedings IEEE International Conference on Data Mining, 2001, pp. 187-194 
    20. Ratkowsky, D.A. and Lance, G.N., 'A Criterion for determining the number of groups in a classification,'Australian Computer Journal, Vol. 10, 1978, pp. 115-117 
    21. Beale, E.M.L., Cluster Analysis, London: Scientific Control Systems, 1969 
    22. Bezdek, J.C. and Pal, N.R., 'Some New Indexes of Cluster Validity,'IEEE Transactions on Systems, Man, and Cybernetics-PART B: CYBERNETICS, Vol. 28, No. 3, 1998 
    23. Sneath, P.H.A., 'A Method for Testing the Distinctness of Clusters: A Test of the Disjunction of two Clusters in Euclidean Space as Measured by their Overlap,'Mathematical Geology, Vol. 9, 1977, pp. 123-143 
    24. Frey, T. and Groenewoud, H.V., 'A cluster Analysis of the D-squared Matrix of White Spruce Stands in Saskatchewan based on the Maximum Minimum Principle,'Journal of Ecology, Vol. 60, 1972, pp. 873-886 
    25. Milligan G.W., 'A Monte Carlo Study of Thirty Internal Criterion Measures for Cluster Analysis,'Psychometrika, Vol. 46, 1981, pp. 187-199 
    26. Day, N.E., 'Estimating the Components of a Mixture of Normal Distributions,'Biometrika, Vol. 56, 1969, pp. 463-474 
    27. McClain, J.O. and Rao, V.R., 'CLUSTISZ: A Program to Test for the Quality of Clustering of a Set of Objects,'Journal of Marketing Research, Vol. 12, 1975, pp. 456-460 
    28. Milligan, G.W., 'An Examination of the Effect of six Types of Error Perturbation on Fifteen Clustering Algorithms,'Psychometrika, Vol. 45, No. 3, 1980, pp. 325-342 
    29. Mountford, M.D., 'A Test for the Difference between clusters, In G.P. Patil, E.C. Pielou, and W.E. Waters(EDs.),'Statistical Ecology, Vol. 3, 1970, pp. 237-257, University Park, Pa.: Pennsylvania State University Press 
    30. Ray, S. and Turi, R.H., 'Determination of Number of Clusters in k-means Clustering and Application in Colour Image Segmentation,'in Proceedings of the 4th International Conference on Advances in Pattern Recognition and Digital Techniques, 1999, pp. 137-143 
    31. Baker, F.B. and Hubert, L.J., 'Measuring the Power of Hierarchical Cluster Analysis,'Journal of the American Statistical Association, Vol. 70, 1975, pp. 31-38 
    32. Hartigan, J.A., Clustering Algorithms, New Work, Wiley, 1975 
    33. Milligan, G.W., 'An Algorithm for Generating Artificial Test Clusters,'Psychometrika, Vol. 50, No. 1, 1985, pp. 123-127 
    34. Day, W.H.E., Complexity Theory: An Introduction for Practitioners of Classification, Clustering and Classification, P. Arabie and L. Hubert, Eds. World Scientific Publishing Co., Inc., River Edge, NJ., 1992 
    35. Friedman, H.P. and Rubin, J., 'On Some Invariant Criteria for Grouping Data,'Journal of the American Statistical Association, Vol. 62, 1967, pp. 1159-1178 
    36. Forgy, E., 'Cluster Analysis of Multivariate Data: Effciency vs. Interpretability of Classifications,'Biometrics, Vol. 21, 1965, 768 
    37. Gnanadesikan, R., Kettenring, J.R. and Landwehr, J.M., 'Interpreting and Assessing the Results of Cluster Analyses,'Bulletin of the International Statistical Institute, Vol. 47, 1977, pp. 451-463 
    38. Rohlf, F.J., 'Methods of Comparing Classifications,'Annual Review of Ecology and Systematics, Vol. 5, 1974, pp. 101-113 
    39. Kurita, T., 'An Efficient Agglomerative Clustering Algorithm using a Heap,'Pattern Recognition, Vol. 24, No. 3, 1991, pp. 205-209 
    40. Mojena, R., 'Hierarchical Grouping Methods and Stopping Rules: An Evaluation,'The Computer Journal, Vol. 20, 1977, pp. 359-363 

 저자의 다른 논문

  • 정지원 (1)

    1. 2004 "데이터 마이닝 기반의 군사특기 분류 방법론 연구" 한국국방경영분석학회지 = Journal of the Military Operations Research Society of Korea 30 (1): 1~14    
  • 최인찬 (7)

 활용도 분석

  • 상세보기

    amChart 영역
  • 원문보기

    amChart 영역


  • NDSL :

유료 다운로드의 경우 해당 사이트의 정책에 따라 신규 회원가입, 로그인, 유료 구매 등이 필요할 수 있습니다. 해당 사이트에서 발생하는 귀하의 모든 정보활동은 NDSL의 서비스 정책과 무관합니다.

원문복사신청을 하시면, 일부 해외 인쇄학술지의 경우 외국학술지지원센터(FRIC)에서
무료 원문복사 서비스를 제공합니다.

NDSL에서는 해당 원문을 복사서비스하고 있습니다. 위의 원문복사신청 또는 장바구니 담기를 통하여 원문복사서비스 이용이 가능합니다.

이 논문과 함께 이용한 콘텐츠
이 논문과 함께 출판된 논문 + 더보기