带有缺失数据的一种动态聚类方法

doi:10.3864/j.issn.0578-1752.2012.21.023

中国农业科学 ›› 2012, Vol. 45 ›› Issue (21): 4534-4542.doi: 10.3864/j.issn.0578-1752.2012.21.023

• 研究简报 • 上一篇

带有缺失数据的一种动态聚类方法

肖静, 骆如九, 宋雯, 汤在祥, 徐辰武

1.南通大学公共卫生学院流行病与卫生统计学教研室，江苏南通226019
2.扬州大学江苏省作物遗传生理重点实验室，江苏扬州225009
3.苏州大学医学部公共卫生学院流行病与卫生统计学教研室，江苏苏州215123

收稿日期:2012-06-15 出版日期:2012-11-01 发布日期:2012-09-18
通讯作者: 通信作者徐辰武，E-mail：qtls@yzu.edu.cn
作者简介:肖静，Tel：15996553775；E-mail：jxiaont@ntu.edu.cn
基金资助:
国家自然科学青年基金项目（31000539，31100882）、江苏省重点实验室开放课题（K10003）

A Dynamic Clustering Method with Missing Data

XIAO Jing, LUO Ru-Jiu, SONG Wen, TANG Zai-Xiang, XU Chen-Wu

1.南通大学公共卫生学院流行病与卫生统计学教研室，江苏南通226019
2.扬州大学江苏省作物遗传生理重点实验室，江苏扬州225009
3.苏州大学医学部公共卫生学院流行病与卫生统计学教研室，江苏苏州215123

Received:2012-06-15 Online:2012-11-01 Published:2012-09-18

摘要/Abstract

摘要： 【目的】探讨实际问题研究中的不完全数据聚类。【方法】利用相关变量的辅助信息，对缺失数据进行推估，确定其合理的替代值，从而构造出一个“完全”数据集。在此基础上以EM算法循环迭代，参数的估计值和缺失数据的替代值都将逐渐收敛，以相应的贝叶斯后验概率判别个体的归类，进而实现动态聚类。【结果】模拟研究表明，缺值替代法具有较好的收敛性，对有缺失的数据基本都可正确地聚类。【结论】Fisher的鸢尾花花类识别数据验证了缺值替代法的可行性，其聚类的准确性高于缺值删除法，基本接近完全数据聚类。

关键词: 聚类分析, 缺失数据, 后验概率, 极大似然估计

Abstract: 【Objective】 The aim of the study is to investigate a clustering method for clustering the data with missing values in practice research. 【Method】The paper introduces a maximum likelihood-based dynamic clustering method, which could configure a complete data set through the maximum likelihood estimation for the missing by statistics of the others. The parameters of missing data and different clusters are estimated by the maximum likelihood method implemented via expectation-maximization (EM) algorithm and the objects are classified by the Bayesian posterior probability. 【Result】 The results of simulation studies show that the proposed method not only has fast convergence speed but also accurately cluster the data with missing values. 【Conclusion】The proposed method was further validated by Fisher’s Iris dataset. The result indicated that the proposed method had a significant advantage on clustering accuracy compared to the delete missing data arithmetic and it is similar to complete data clustering algorithm.

Key words: cluster analysis, missing data, posterior probability, maximum likelihood estimation

肖静, 骆如九, 宋雯, 汤在祥, 徐辰武. 带有缺失数据的一种动态聚类方法[J]. 中国农业科学, 2012, 45(21): 4534-4542.

XIAO Jing, LUO Ru-Jiu, SONG Wen, TANG Zai-Xiang, XU Chen-Wu. A Dynamic Clustering Method with Missing Data[J]. Scientia Agricultura Sinica, 2012, 45(21): 4534-4542.

0
/ / 推荐

导出引用管理器 EndNote|Reference Manager|ProCite|BibTeX|RefWorks

链接本文: https://www.chinaagrisci.com/CN/10.3864/j.issn.0578-1752.2012.21.023

https://www.chinaagrisci.com/CN/Y2012/V45/I21/4534

参考文献

[1]Wylie M P, Holtizman J. The non-line of sight problem in mobile location estimation//Proc. Fifth IEEE International Conference Universal Personal Communications（ICUPC）, Cambridge, MA, 1996, 2: 827-831.

[2]张尧庭, 方开泰. 多元统计分析引论. 北京: 科学出版社, 1983: 401-457.

Zhang Y T, Fang K T. Introduction to Multivariate Statistical Analysis. Beijing: Science Press, 1983: 401-457. (in Chinese)

[3]Johnoson R A, Wichern D W. Applied Multivariate Statistical Analysis. New Jersey: Prentice-Hall, Inc, 1982: 532-560.

[4]Wang S C, Li X L, Tang H Y. Hybrid data clustering based on dependency structure and gibbs sampling. Lecture Notes in Computer Science, 2006, 4304: 1145-1151.

[5]高惠璇. 应用多元统计分析. 北京: 北京大学出版社, 2002.

Gao H X. Applied Multivariate Statistical Analysis. Beijing: Beijing University Press, 2002. (in Chinese)

[6]Quackenbush J. Computational analysis of microarray data. Nature Reviews Genetics, 2001, 2: 418-427.

[7]Speed T. Statistical Analysis of Gene Expression Microarray Data. London/Boca Raton: Chapman and Hall/CRC Press, 2003.

[8]MacQueen J B. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium, 1967, 1: 431-441.

[9]Hartigan J A. Clustering Algorithms. New York: John Wiley and Sons, Inc, 1975.

[10]Selim S Z, Alsultan K. A simulated annealing algorithm for the clustering problem. Pattern Recognition, 1991, 24(10): 1003-1008.

[11]Hartigan J A, Wong M A. A k-means clustering algorithm. Journal of Applied Statistics, 1979, 28: 100-108.

[12]Holland J H. Genetic algorithms. Scientific American, 1992, 267(1): 66-72.

[13]Cowgill M C, Harvey R J, Watson L. A genetic algorithm approach to cluster analysis. Computers and Mathematics with Applications, 1999, 37(7): 99-108.

[14]Maulik L, Bandyopadhyay S. Genetic algorithm-based clustering technique. Pattern Recognition, 2000, 33: 1455-1465.

[15]Gordon A D, Henderson J T. An algorithm for Euclidean sum of squares classification. Biometrics, 1977, 33: 355-362.

[16]顾世梁. 实现动态聚类全局最优的一种算法. 江苏农学院学报, 1996, 17: 57-65.

Gu S L. An algorithm to global optimization in non-hierarchical cluster analysis. Jiangsu Agricultural Research, 1996, 17: 57-65. (in Chinese)

[17]肖静, 胡治球, 王学枫, 徐辰武. 一种基于似然极大的动态聚类方法及其应用. 作物学报, 2007, 33(1): 70-76.

Xiao J, Hu Z Q, Wang X F, Xu C W. A maximum likelihood-based dynamic clustering method and its application. Acta Agronomica Sinica, 2007, 33(1): 70-76. (in Chinese)

[18]McLachlan G J, Basford K E. Mixture Models: Inference and Applications to Clustering. New York: Marcel Dekker, 1988.

[19]Titterington D M, Smith A F M, Makov U E. Statistical Analysis of Finite Mixture Distributions. New York: John Wiley and Sons, Inc, 1985.

[20]Kim D W, Kang B Y. Iterative clustering analysis for grouping missing data in gene expression profiles. Advances in Knowledge Discovery and Data Mining, 2006, 3918: 129-138.

[21]Hattum P V, Hoijtink H. Market segmentation using brand strategy research: Bayesian Inference with respect to mixtures of log-linear models. Journal of Classification, 2009, 26: 297-328.

[22]Newgard C D, Lewis R J. The imputation of missing values in complex sampling databases: An innovative approach. In: Academic Emergency Medicine, Society for Academic Emergency Medicine, 2002, 9, 5484.

[23]Ouyang M, Welsh W J, Georgopoulos P. Guassian mixture clustering and imputation of microarray data. Bioinformatics, 2004, 20: 917-923.

[24]Rubin D B. Inference and missing data. Biometrika, 1976, 63(3): 581-592.

[25]Little R J A, Rubin D B. Statistical Analysis with Missing Data. New York: Wiley and Sons, Inc.1987.

[26]Rubin D B. Multiple imputations in sample survey. Journal of the American Statistical Association, 1978: 20-34.

[27]Chen J, Shao J. Nearest neighbor imputation for survey data. Journal of Official Statistics, 2000, 16(2): 113-131.

[28]杨军, 赵宇, 丁文兴. 抽样调查中缺失数据的插补方法. 数理统计与管理, 2008, 27(5): 821-832.

Yang J, Zhao Y, Ding W X. On imputation methods of missing data in survey sampling. Application of Statistics and Management, 2008, 27(5): 821-832. (in Chinese)

[29]Little R, Rubin D. Statistical Analysis with Missing Data(2nd ed.). New York: Wiley. 2002.

[30]Alessandro G, Di N. Missing data analysis with fuzzy C-Means: A study of its application in a psychological scenario. Expert Systems with Applications, 2011, 38(6): 6793-6797.

[31]Hunt L, Jorgensen M. Mixture model clustering for mixed data with missing information. Computational Statistics and Data Analysis, 2003, 41: 429-440.

[32]Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1977, 39: 1-38.

[33]Qu Y, Xu S Z. Supervised cluster analysis for microarray data based on multivariate Gaussian mixture. Bioinformatics, 2004, 20: 1905-1913.

[34]McLachlan G J, Basford K E. Mixture Models: Inference and Applications to Clustering. New York: Marcel Dekker, 1988.

[35]Little R J A, Rubin D B. Statistical Analysis with Missing Data. John Wiley, 2002．

带有缺失数据的一种动态聚类方法

A Dynamic Clustering Method with Missing Data

PDF

赞

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0

[1]	王秀秀,邢爱双,杨茹,何守朴,贾银华,潘兆娥,王立如,杜雄明,宋宪亮. 陆地棉种质资源表型性状综合评价[J]. 中国农业科学, 2022, 55(6): 1082-1094.
[2]	杜金霞,李奕莎,李美霖,陈文浛,张木清. 甘蔗不同基因型对白条病抗性的评价[J]. 中国农业科学, 2022, 55(21): 4118-4130.
[3]	聂兴华, 郑瑞杰, 赵永廉, 曹庆芹, 秦岭, 邢宇. 利用荧光SSR分子标记评估中国栗属植物遗传多样性[J]. 中国农业科学, 2021, 54(8): 1739-1750.
[4]	李凯峰,尹玉和,王琼,林团荣,郭华春. 不同马铃薯品种挥发性风味成分及代谢产物相关性分析[J]. 中国农业科学, 2021, 54(4): 792-803.
[5]	张斌斌,蔡志翔,沈志军,严娟,马瑞娟,俞明亮. 观赏桃种质资源表型性状多样性评价[J]. 中国农业科学, 2021, 54(11): 2406-2418.
[6]	张晓,李曼,刘大同,江伟,张勇,高德荣. 扬麦系列品种品质性状分析及育种启示[J]. 中国农业科学, 2020, 53(7): 1309-1321.
[7]	宋鸽,史东梅,曾小英,蒋光毅,江娜,叶青. 紫色土坡耕地耕层质量障碍特征[J]. 中国农业科学, 2020, 53(7): 1397-1410.
[8]	李颖,张树航,郭燕,张馨方,王广鹏. 211份板栗种质资源花序表型多样性和聚类分析[J]. 中国农业科学, 2020, 53(22): 4667-4682.
[9]	曲玉杰, 孙君灵, 耿晓丽, 王骁, ZareenSarfraz, 贾银华, 潘兆娥, 何守朴, 龚文芳, 王立如, 庞保印, 杜雄明. 陆地棉亲本间遗传距离与杂种优势的相关性研究[J]. 中国农业科学, 2019, 52(9): 1488-1501.
[10]	赵勇,赵培方,胡鑫,赵俊,昝逢刚,姚丽,赵丽萍,杨昆,覃伟,夏红明,刘家勇. 基于农艺性状分级对317份甘蔗种质资源的评价[J]. 中国农业科学, 2019, 52(4): 602-615.
[11]	白羿雄, 郑雪晴, 姚有华, 姚晓华, 吴昆仑. 青稞种质资源表型性状的遗传多样性分析及综合评价[J]. 中国农业科学, 2019, 52(23): 4201-4214.
[12]	石甜甜, 何杰丽, 高志军, 陈凌, 王海岗, 乔治军, 王瑞云. 利用EST-SSR评估糜子资源遗传差异[J]. 中国农业科学, 2019, 52(22): 4100-4109.
[13]	史芳芳, 张清安. 超声耦合不同酸度柠檬酸脱苦溶液对苦杏仁品质特性的影响[J]. 中国农业科学, 2019, 52(17): 3034-3048.
[14]	马婉茹,房伟民,王海滨,张飞,陈素梅,陈发棣,管志勇. 多头切花菊品种茎、枝特性评价体系构建与品种评价[J]. 中国农业科学, 2019, 52(14): 2515-2524.
[15]	刘翔宇，赵龙，巴哈尔古丽·先木西，彭华，阿不都热衣木·玉拉音. 新疆陆地棉种质资源的综合评价[J]. 中国农业科学, 2017, 50(24): 4679-4691.