中国农业科学 ›› 2025, Vol. 58 ›› Issue (15): 2960-2979.doi: 10.3864/j.issn.0578-1752.2025.15.003

• 作物遗传育种·种质资源·分子遗传学 • 上一篇    下一篇

基于G2PSE堆叠集成的全基因组选择方法

庄润杰1,2(), 刘慧铭1,2, 王诗雨1,2, 吕婉萍1,2, 温永仙1,2,*()   

  1. 1 福建农林大学计算机与信息学院,福州 350002
    2 福建农林大学统计及应用研究所,福州 350002
  • 收稿日期:2025-02-07 接受日期:2025-05-16 出版日期:2025-08-01 发布日期:2025-07-30
  • 通信作者:
    温永仙,E-mail:
  • 联系方式: 庄润杰,E-mail:2215372440@qq.com。
  • 基金资助:
    福建省自然科学基金(2021J01126); 国家自然科学基金(32071892); 福建农林大学科技创新专项基金(KFB22094XA)

Genomic Selection Method Based on G2PSE Stacking Ensemble

ZHUANG RunJie1,2(), LIU HuiMing1,2, WANG ShiYu1,2, LÜ WanPing1,2, WEN YongXian1,2,*()   

  1. 1 College of Computer and Information Science, Fujian Agriculture and Forestry University, Fuzhou 350002
    2 Institute of Statistics and Applications, Fujian Agriculture and Forestry University, Fuzhou 350002
  • Received:2025-02-07 Accepted:2025-05-16 Published:2025-08-01 Online:2025-07-30

摘要:

【目的】 全基因组选择(genomic selection,GS)是一种通过全基因组标记信息预测个体表型或遗传值的核心技术,在农业育种和遗传研究中具有重要的理论价值和实践意义。然而,高维特征冗余和非线性关系建模是全基因组选择中的关键挑战。提出一种从基因型到表型的堆叠集成模型(genotype to phenotype stacking ensemble,G2PSE),旨在提高预测精度和泛化能力,为高维基因组数据分析提供高效的解决方案。【方法】 构建G2PSE堆叠集成模型框架,综合应用十折交叉验证、集成学习、特征选择(LAR算法)和特征增强策略。模型采用随机森林(RF)、支持向量回归(SVR)和梯度提升回归(GBR)作为基学习器,并以普通最小二乘回归(OLSR)作为元学习器,同时,评估随机森林、支持向量回归和神经网络等元学习器对模型性能的影响。G2PSE模型包含3种核心子模型:(1)全特征堆叠集成(AFSE),充分利用所有SNP特征;(2)LAR特征堆叠集成(LFSE),通过特征选择减少冗余信息,提升泛化能力;(3)LAR特征增强堆叠集成(LFESE),结合特征选择与增强策略,在高维数据环境中优化预测能力。探讨3种特征增强变体(AFESE、HFESEⅠ、HFESEⅡ)的性能。最终,模型在小麦、大豆、罗非鱼3个物种的多性状数据集上进行试验评估,并进一步利用Pepper203数据集进行独立测试集评估,验证模型的鲁棒性。【结果】 G2PSE模型在皮尔逊相关系数(PCC)和平均绝对误差(MAE)2项指标上显著优于传统方法和单一机器学习模型。3种核心子模型中,LFESE通过结合特征选择与增强策略表现最佳,LFSE通过特征选择减少了冗余信息并增强了泛化能力,而AFSE在全面捕获基因型全局信息方面具有显著优势。此外,3种特征增强变体模型进一步验证了特征质量相较于特征数量对提升预测性能的重要性。试验还表明,在元学习器选择中,线性回归模型的表现最佳,而在计算效率上,LFESE和LFSE子模型展示了较为均衡的性能。且合理的特征选择阈值对模型性能至关重要,其中,低维数据集的最优阈值为10%—20%,而高维数据集的最优阈值为1%。最后,在独立测试集上的评估证明LFESE子模型具有最佳的泛化能力。【结论】 G2PSE模型通过集成学习、特征选择与增强策略显著提升了全基因组选择的预测性能。

关键词: 全基因组选择, 堆叠集成, 特征选择, 特征增强, 农业育种

Abstract:

【Objective】 Genomic selection (GS) is a core technology for predicting individual phenotypes or genetic values from genome-wide marker information, which has important theoretical value and practical significance in agricultural breeding and genetic research. However, high-dimensional feature redundancy and nonlinear relationship modeling are key challenges in genomic selection. A genotype to phenotype stacking ensemble (G2PSE) is proposed, aiming to improve the prediction accuracy and generalization ability, and provide an efficient solution for high-dimensional genomic data analysis. 【Method】 The G2PSE stacking ensemble model framework was constructed, incorporating ten-fold cross-validation, ensemble learning, feature selection (LAR algorithm), and feature enhancement strategies. The model employed random forests (RF), support vector regression (SVR), and gradient boosting regression (GBR) as base learners, with ordinary least squares regression (OLSR) as the meta-learner. Additionally, the impact of meta-learners such as random forest, support vector regression, and neural networks on model performance was evaluated. The G2PSE model consisted of three core submodels: (1) All-feature stacking ensemble (AFSE), which fully utilized all SNP features; (2) LAR-feature stacking ensemble (LFSE), which reduced redundant information through feature selection to improve generalization; (3) LAR-feature enhanced stacking ensemble (LFESE), which combined feature selection with enhancement strategies to optimize prediction capability in high-dimensional data environments. The performance of three feature enhancement variants (AFESE, HFESEⅠ, HFESEⅡ) was explored. Finally, the model was evaluated experimentally on multi-trait datasets of three species, namely wheat, soybean, and tilapia, and further evaluated on an independent test set using the Pepper203 dataset to validate the robustness of the model. 【Result】 The G2PSE model significantly outperformed traditional methods and single machine learning models in two metrics, Pearson correlation coefficient (PCC) and mean absolute error (MAE). Among the three core submodels, LFESE performed the best by combining the feature selection and enhancement strategies, LFSE reduced redundant information and enhanced the generalization ability by feature selection, and AFSE had a significant advantage in comprehensively capturing genotypic global information. In addition, the three feature enhancement variant models further validated the importance of feature quality compared to feature quantity in improving prediction performance. The experiments also showed that the linear regression model performed best in meta-learner selection, while the LFESE and LFSE submodels demonstrated a more balanced performance in terms of computational efficiency. And a reasonable feature selection threshold was crucial for model performance, where the optimal threshold for low-dimensional datasets was 10%-20%, while the optimal threshold for high-dimensional datasets was 1%. Finally, the evaluation on an independent test set proved that the LFESE submodel had the best generalization ability. 【Conclusion】 The G2PSE model significantly improves genomic selection prediction performance through ensemble learning, feature selection, and enhancement strategies.

Key words: genomic selection, stacking ensemble, feature selection, feature enhancement, agricultural breeding