Scientia Agricultura Sinica ›› 2012, Vol. 45 ›› Issue (21): 4534-4542.doi: 10.3864/j.issn.0578-1752.2012.21.023

• RESEARCH NOTES • Previous Articles    

A Dynamic Clustering Method with Missing Data

 XIAO  Jing, LUO  Ru-Jiu, SONG  Wen, TANG  Zai-Xiang, XU  Chen-Wu   

  1. 1.南通大学公共卫生学院流行病与卫生统计学教研室,江苏南通226019
    2.扬州大学江苏省作物遗传生理重点实验室,江苏扬州225009
    3.苏州大学医学部公共卫生学院流行病与卫生统计学教研室,江苏苏州215123
  • Received:2012-06-15 Online:2012-11-01 Published:2012-09-18

Abstract: 【Objective】 The aim of the study is to investigate a clustering method for clustering the data with missing values in practice research. 【Method】The paper introduces a maximum likelihood-based dynamic clustering method, which could configure a complete data set through the maximum likelihood estimation for the missing by statistics of the others. The parameters of missing data and different clusters are estimated by the maximum likelihood method implemented via expectation-maximization (EM) algorithm and the objects are classified by the Bayesian posterior probability. 【Result】 The results of simulation studies show that the proposed method not only has fast convergence speed but also accurately cluster the data with missing values. 【Conclusion】The proposed method was further validated by Fisher’s Iris dataset. The result indicated that the proposed method had a significant advantage on clustering accuracy compared to the delete missing data arithmetic and it is similar to complete data clustering algorithm.

Key words: cluster analysis, missing data, posterior probability, maximum likelihood estimation

[1]Wylie M P, Holtizman J. The non-line of sight problem in mobile location estimation//Proc. Fifth IEEE International Conference Universal Personal Communications(ICUPC), Cambridge, MA, 1996, 2: 827-831.

[2]张尧庭, 方开泰. 多元统计分析引论. 北京: 科学出版社, 1983: 401-457.

Zhang Y T, Fang K T. Introduction to Multivariate Statistical Analysis. Beijing: Science Press, 1983: 401-457. (in Chinese)

[3]Johnoson R A, Wichern D W. Applied Multivariate Statistical Analysis. New Jersey: Prentice-Hall, Inc, 1982: 532-560.

[4]Wang S C, Li X L, Tang H Y. Hybrid data clustering based on dependency structure and gibbs sampling. Lecture Notes in Computer Science, 2006, 4304: 1145-1151.

[5]高惠璇. 应用多元统计分析. 北京: 北京大学出版社, 2002.

Gao H X. Applied Multivariate Statistical Analysis. Beijing: Beijing University Press, 2002. (in Chinese)

[6]Quackenbush J. Computational analysis of microarray data. Nature Reviews Genetics, 2001, 2: 418-427.

[7]Speed T. Statistical Analysis of Gene Expression Microarray Data. London/Boca Raton: Chapman and Hall/CRC Press, 2003.

[8]MacQueen J B. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium, 1967, 1: 431-441.

[9]Hartigan J A. Clustering Algorithms. New York: John Wiley and Sons, Inc, 1975.

[10]Selim S Z, Alsultan K. A simulated annealing algorithm for the clustering problem. Pattern Recognition, 1991, 24(10): 1003-1008.

[11]Hartigan J A, Wong M A. A k-means clustering algorithm. Journal of Applied Statistics, 1979, 28: 100-108.

[12]Holland J H. Genetic algorithms. Scientific American, 1992, 267(1): 66-72.

[13]Cowgill M C, Harvey R J, Watson L. A genetic algorithm approach to cluster analysis. Computers and Mathematics with Applications, 1999, 37(7): 99-108.

[14]Maulik L, Bandyopadhyay S. Genetic algorithm-based clustering technique. Pattern Recognition, 2000, 33: 1455-1465.

[15]Gordon A D, Henderson J T. An algorithm for Euclidean sum of squares classification. Biometrics, 1977, 33: 355-362.

[16]顾世梁. 实现动态聚类全局最优的一种算法. 江苏农学院学报, 1996, 17: 57-65.

Gu S L. An algorithm to global optimization in non-hierarchical cluster analysis. Jiangsu Agricultural Research, 1996, 17: 57-65. (in Chinese)

[17]肖  静, 胡治球, 王学枫, 徐辰武. 一种基于似然极大的动态聚类方法及其应用. 作物学报, 2007, 33(1): 70-76.

Xiao J, Hu Z Q, Wang X F, Xu C W. A maximum likelihood-based dynamic clustering method and its application. Acta Agronomica Sinica, 2007, 33(1): 70-76. (in Chinese)

[18]McLachlan G J, Basford K E. Mixture Models: Inference and Applications to Clustering. New York: Marcel Dekker, 1988.

[19]Titterington D M, Smith A F M, Makov U E. Statistical Analysis of Finite Mixture Distributions. New York: John Wiley and Sons, Inc, 1985.

[20]Kim D W, Kang B Y. Iterative clustering analysis for grouping missing data in gene expression profiles. Advances in Knowledge Discovery and Data Mining, 2006, 3918: 129-138.

[21]Hattum P V, Hoijtink H. Market segmentation using brand strategy research: Bayesian Inference with respect to mixtures of log-linear models. Journal of Classification, 2009, 26: 297-328.

[22]Newgard C D, Lewis R J. The imputation of missing values in complex sampling databases: An innovative approach. In: Academic Emergency Medicine, Society for Academic Emergency Medicine, 2002, 9, 5484.

[23]Ouyang M, Welsh W J, Georgopoulos P. Guassian mixture clustering and imputation of microarray data. Bioinformatics, 2004, 20: 917-923.

[24]Rubin D B. Inference and missing data. Biometrika, 1976, 63(3): 581-592.

[25]Little R J A, Rubin D B. Statistical Analysis with Missing Data. New York: Wiley and Sons, Inc.1987.

[26]Rubin D B. Multiple imputations in sample survey. Journal of the American Statistical Association, 1978: 20-34.

[27]Chen J, Shao J. Nearest neighbor imputation for survey data. Journal of Official Statistics, 2000, 16(2): 113-131.

[28]杨  军, 赵  宇, 丁文兴. 抽样调查中缺失数据的插补方法. 数理统计与管理, 2008, 27(5): 821-832.

Yang J, Zhao Y, Ding W X. On imputation methods of missing data in survey sampling. Application of Statistics and Management, 2008, 27(5): 821-832. (in Chinese)

[29]Little R, Rubin D. Statistical Analysis with Missing Data(2nd ed.). New York: Wiley. 2002.

[30]Alessandro G, Di N. Missing data analysis with fuzzy C-Means: A study of its application in a psychological scenario. Expert Systems with Applications, 2011, 38(6): 6793-6797.

[31]Hunt L, Jorgensen M. Mixture model clustering for mixed data with missing information. Computational Statistics and Data Analysis, 2003, 41: 429-440.

[32]Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1977, 39: 1-38.

[33]Qu Y, Xu S Z. Supervised cluster analysis for microarray data based on multivariate Gaussian mixture. Bioinformatics, 2004, 20: 1905-1913.

[34]McLachlan G J, Basford K E. Mixture Models: Inference and Applications to Clustering. New York: Marcel Dekker, 1988.

[35]Little R J A, Rubin D B. Statistical Analysis with Missing Data. John Wiley, 2002.
[1] WANG XiuXiu,XING AiShuang,YANG Ru,HE ShouPu,JIA YinHua,PAN ZhaoE,WANG LiRu,DU XiongMing,SONG XianLiang. Comprehensive Evaluation of Phenotypic Characters of Nature Population in Upland Cotton [J]. Scientia Agricultura Sinica, 2022, 55(6): 1082-1094.
[2] DU JinXia,LI YiSha,LI MeiLin,CHEN WenHan,ZHANG MuQing. Evaluation of Resistance to Leaf Scald Disease in Different Sugarcane Genotypes [J]. Scientia Agricultura Sinica, 2022, 55(21): 4118-4130.
[3] NIE XingHua, ZHENG RuiJie, ZHAO YongLian, CAO QingQin, QIN Ling, XING Yu. Genetic Diversity Evaluation of Castanea in China Based on Fluorescently Labeled SSR [J]. Scientia Agricultura Sinica, 2021, 54(8): 1739-1750.
[4] LI KaiFeng,YIN YuHe,WANG Qiong,LIN TuanRong,GUO HuaChun. Correlation Analysis of Volatile Flavor Components and Metabolites Among Potato Varieties [J]. Scientia Agricultura Sinica, 2021, 54(4): 792-803.
[5] ZHANG BinBin,CAI ZhiXiang,SHEN ZhiJun,YAN Juan,MA RuiJuan,YU MingLiang. Diversity Analysis of Phenotypic Characters in Germplasm Resources of Ornamental Peaches [J]. Scientia Agricultura Sinica, 2021, 54(11): 2406-2418.
[6] Ge SONG,DongMei SHI,XiaoYing ZENG,GuangYi JIANG,Na JIANG,Qing YE. Quality Barrier Characteristics of Cultivated Layer for Sloping Farmland in Purple Hilly Region [J]. Scientia Agricultura Sinica, 2020, 53(7): 1397-1410.
[7] LI Ying,ZHANG ShuHang,GUO Yan,ZHANG XinFang,WANG GuangPeng. Catkin Phenotypic Diversity and Cluster Analysis of 211 Chinese Chestnut Germplasms [J]. Scientia Agricultura Sinica, 2020, 53(22): 4667-4682.
[8] QU YuJie, SUN JunLing, GENG XiaoLi, WANG Xiao, Zareen Sarfraz, JIA YinHua, PAN ZhaoE, HE ShouPu, GONG WenFang, WANG LiRu, PANG BaoYin, DU XiongMing. Correlation Between Genetic Distance of Parents and Heterosis in Upland Cotton [J]. Scientia Agricultura Sinica, 2019, 52(9): 1488-1501.
[9] ZHAO Yong,ZHAO PeiFang,HU Xin,ZHAO Jun,ZAN FengGang,YAO Li,ZHAO LiPing,YANG Kun,QIN Wei,XIA HongMing,LIU JiaYong. Evaluation of 317 Sugarcane Germplasm Based on Agronomic Traits Rating Data [J]. Scientia Agricultura Sinica, 2019, 52(4): 602-615.
[10] BAI YiXiong, ZHENG XueQing, YAO YouHua, YAO XiaoHua, WU KunLun. Genetic Diversity Analysis and Comprehensive Evaluation of Phenotypic Traits in Hulless Barley Germplasm Resources [J]. Scientia Agricultura Sinica, 2019, 52(23): 4201-4214.
[11] SHI TianTian, HE JieLi, GAO ZhiJun, CHEN Ling, WANG HaiGang, QIAO ZhiJun, WANG RuiYun. Genetic Diversity of Common Millet Resources Assessed with EST-SSR Markers [J]. Scientia Agricultura Sinica, 2019, 52(22): 4100-4109.
[12] SHI FangFang, ZHANG QingAn. Effects of Different Citric Acid Solutions on the Quality of Apricot Kernels During Debitterizing Mediated by Ultrasound Irradiation [J]. Scientia Agricultura Sinica, 2019, 52(17): 3034-3048.
[13] MA WanRu,FANG WeiMin,WANG HaiBin,ZHANG Fei,CHEN SuMei,CHEN FaDi,GUAN ZhiYong. Establishment of Appraisal System for the Stem and Branch Characteristics and Varieties Evaluation of Spray Cut Chrysanthemum [J]. Scientia Agricultura Sinica, 2019, 52(14): 2515-2524.
[14] LIU XiangYu, ZHAO Long, BAHARGUL·Xamxi, PENG Hua, ABDUREYIM·Ibrayim. Comprehensive Evaluation of Germplasm Resources of Upland Cotton in Xinjiang [J]. Scientia Agricultura Sinica, 2017, 50(24): 4679-4691.
[15] KUANG LiXue, NIE JiYun, LI ZhiXia, GUAN DiKai, WU YongLong, YAN Zhen, CHENG Yang. Factor Analysis and Cluster Analysis of Mineral Elements Contents in Different Apple Varieties [J]. Scientia Agricultura Sinica, 2017, 50(14): 2807-2815.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!