中国农业科学 ›› 2007, Vol. 40 ›› Issue (10): 2119-2127 .

• 作物遗传育种·种质资源 •    下一篇

微阵列表达谱监督聚类方法的比较研究

肖 静,杨泽峰,徐辰武   

  1. 扬州大学江苏省遗传生理重点实验室
  • 收稿日期:2006-09-04 修回日期:1900-01-01 出版日期:2007-10-10 发布日期:2007-10-10
  • 通讯作者: 徐辰武

Comparison among Gene Supervised Clustering Methods for DNA Microarray Expression Data

  

  1. 扬州大学江苏省遗传生理重点实验室
  • Received:2006-09-04 Revised:1900-01-01 Online:2007-10-10 Published:2007-10-10

摘要: 【目的】比较不同监督聚类方法的优劣及其适用场合。【方法】应用2种高斯混合模型聚类法(GMM)、K-最近邻居法(KNN)、二分类支持向量机器法(SVMs)以及5种多分类支持向量机器法(MC-SVMs),分别对计算机模拟数据以及两组实际微阵列数据进行聚类分析,采用假阳性(FP)、假阴性(FN)、聚类的准确性以及马修斯相关系数(MCC)等指标进行评价。【结果】(1)对成千上万基因表达谱数据,在服从高斯分布条件下,2种GMM法聚类准确性最高,且在训练样本容量较小的情况下,GMM-II法聚类准确性优于GMM-I法。(2)相比较而言,多分类MC-SVMs法稳健性较高,适用性最广,其对高维数据不敏感。不仅适用于成千上万基因表达谱数据的聚类,而且适用于以成千上万基因作为指标对少数几十个样本的聚类。(3)几种MC-SVMs法的表现,在样本容量较大时,宜采用OVO和DAGSVM法;样本容量较小时,OVR、WW和CS法聚类准确性和MCC值较高;样本容量适中时,5种MC-SVMs表现一致。【结论】建议根据数据的特征以及试验需要,同时选用至少两种方法进行试算,以便获得最佳聚类结果。

关键词: 微阵列, 监督聚类, K-最近邻居法, 支持向量机器

Abstract: Several typical supervised clustering methods, Gaussian mixture model-based supervised clustering (GMM), K-Nearest-Neighbor (KNN), binary support vector machines (SVMs) and multicategory support vector machines (MC-SVMs), were employed to classify the computer simulation data, yeast cell cycle microarray data and 60 human cancer cell lines (NCI-60) microarray data. False positive, false negative, true positive, true negative and clustering accuracy were compared among these methods. The results are as follows. (1) For classify thousands of gene expression data, the performances of two GMM methods have the maximal clustering accuracy and the least overall FP+FN error numbers based on the assumption that the whole set of microarray data is a finite mixture of multivariate Gaussian distributions. Furthermore, when the number of training sample is very small, the clustering accuracy of GMMⅡ method have superiority over GMMⅠ method. (2) In the general, the superior classification performance of the MC-SVMs are more robust and more practical, which are less sensitive to the curse of dimensionality and not only next to GMM method in clustering accuracy to thousands of gene expression data, but also more robustness to a small number of high-dimensional gene expression samples than other techniques. (3)Among MC-SVMs, in case of large sample sizes, OVO and DAGSVM perform better; In case of moderate sample sizes, five MC-SVMs methods perform very similar; Otherwise, OVR, WW and CS yield the better results when sample sizes are small. (4) A suggestion for the supervised clustering microarray data is that one should consider the data feature and experiment when choose an appropriate method. Two kinds of these methods should be trial calculation to obtain better clustering result.

Key words: Microarray, Supervised Clustering, K-Nearest-Neighbor, Support Vector Machines