基于G2PSE堆叠集成的全基因组选择方法

doi:10.3864/j.issn.0578-1752.2025.15.003

摘要/Abstract

摘要：

【目的】全基因组选择（genomic selection，GS）是一种通过全基因组标记信息预测个体表型或遗传值的核心技术，在农业育种和遗传研究中具有重要的理论价值和实践意义。然而，高维特征冗余和非线性关系建模是全基因组选择中的关键挑战。提出一种从基因型到表型的堆叠集成模型（genotype to phenotype stacking ensemble，G2PSE），旨在提高预测精度和泛化能力，为高维基因组数据分析提供高效的解决方案。【方法】构建G2PSE堆叠集成模型框架，综合应用十折交叉验证、集成学习、特征选择（LAR算法）和特征增强策略。模型采用随机森林（RF）、支持向量回归（SVR）和梯度提升回归（GBR）作为基学习器，并以普通最小二乘回归（OLSR）作为元学习器，同时，评估随机森林、支持向量回归和神经网络等元学习器对模型性能的影响。G2PSE模型包含3种核心子模型：（1）全特征堆叠集成（AFSE），充分利用所有SNP特征；（2）LAR特征堆叠集成（LFSE），通过特征选择减少冗余信息，提升泛化能力；（3）LAR特征增强堆叠集成（LFESE），结合特征选择与增强策略，在高维数据环境中优化预测能力。探讨3种特征增强变体（AFESE、HFESEⅠ、HFESEⅡ）的性能。最终，模型在小麦、大豆、罗非鱼3个物种的多性状数据集上进行试验评估，并进一步利用Pepper203数据集进行独立测试集评估，验证模型的鲁棒性。【结果】 G2PSE模型在皮尔逊相关系数（PCC）和平均绝对误差（MAE）2项指标上显著优于传统方法和单一机器学习模型。3种核心子模型中，LFESE通过结合特征选择与增强策略表现最佳，LFSE通过特征选择减少了冗余信息并增强了泛化能力，而AFSE在全面捕获基因型全局信息方面具有显著优势。此外，3种特征增强变体模型进一步验证了特征质量相较于特征数量对提升预测性能的重要性。试验还表明，在元学习器选择中，线性回归模型的表现最佳，而在计算效率上，LFESE和LFSE子模型展示了较为均衡的性能。且合理的特征选择阈值对模型性能至关重要，其中，低维数据集的最优阈值为10%—20%，而高维数据集的最优阈值为1%。最后，在独立测试集上的评估证明LFESE子模型具有最佳的泛化能力。【结论】 G2PSE模型通过集成学习、特征选择与增强策略显著提升了全基因组选择的预测性能。

关键词: 全基因组选择, 堆叠集成, 特征选择, 特征增强, 农业育种

Abstract:

【Objective】 Genomic selection (GS) is a core technology for predicting individual phenotypes or genetic values from genome-wide marker information, which has important theoretical value and practical significance in agricultural breeding and genetic research. However, high-dimensional feature redundancy and nonlinear relationship modeling are key challenges in genomic selection. A genotype to phenotype stacking ensemble (G2PSE) is proposed, aiming to improve the prediction accuracy and generalization ability, and provide an efficient solution for high-dimensional genomic data analysis. 【Method】 The G2PSE stacking ensemble model framework was constructed, incorporating ten-fold cross-validation, ensemble learning, feature selection (LAR algorithm), and feature enhancement strategies. The model employed random forests (RF), support vector regression (SVR), and gradient boosting regression (GBR) as base learners, with ordinary least squares regression (OLSR) as the meta-learner. Additionally, the impact of meta-learners such as random forest, support vector regression, and neural networks on model performance was evaluated. The G2PSE model consisted of three core submodels: (1) All-feature stacking ensemble (AFSE), which fully utilized all SNP features; (2) LAR-feature stacking ensemble (LFSE), which reduced redundant information through feature selection to improve generalization; (3) LAR-feature enhanced stacking ensemble (LFESE), which combined feature selection with enhancement strategies to optimize prediction capability in high-dimensional data environments. The performance of three feature enhancement variants (AFESE, HFESEⅠ, HFESEⅡ) was explored. Finally, the model was evaluated experimentally on multi-trait datasets of three species, namely wheat, soybean, and tilapia, and further evaluated on an independent test set using the Pepper203 dataset to validate the robustness of the model. 【Result】 The G2PSE model significantly outperformed traditional methods and single machine learning models in two metrics, Pearson correlation coefficient (PCC) and mean absolute error (MAE). Among the three core submodels, LFESE performed the best by combining the feature selection and enhancement strategies, LFSE reduced redundant information and enhanced the generalization ability by feature selection, and AFSE had a significant advantage in comprehensively capturing genotypic global information. In addition, the three feature enhancement variant models further validated the importance of feature quality compared to feature quantity in improving prediction performance. The experiments also showed that the linear regression model performed best in meta-learner selection, while the LFESE and LFSE submodels demonstrated a more balanced performance in terms of computational efficiency. And a reasonable feature selection threshold was crucial for model performance, where the optimal threshold for low-dimensional datasets was 10%-20%, while the optimal threshold for high-dimensional datasets was 1%. Finally, the evaluation on an independent test set proved that the LFESE submodel had the best generalization ability. 【Conclusion】 The G2PSE model significantly improves genomic selection prediction performance through ensemble learning, feature selection, and enhancement strategies.

Key words: genomic selection, stacking ensemble, feature selection, feature enhancement, agricultural breeding

庄润杰, 刘慧铭, 王诗雨, 吕婉萍, 温永仙. 基于G2PSE堆叠集成的全基因组选择方法[J]. 中国农业科学, 2025, 58(15): 2960-2979.

ZHUANG RunJie, LIU HuiMing, WANG ShiYu, LÜ WanPing, WEN YongXian. Genomic Selection Method Based on G2PSE Stacking Ensemble[J]. Scientia Agricultura Sinica, 2025, 58(15): 2960-2979.

0
/ / 推荐

导出引用管理器 EndNote|Reference Manager|ProCite|BibTeX|RefWorks

链接本文: https://www.chinaagrisci.com/CN/10.3864/j.issn.0578-1752.2025.15.003

https://www.chinaagrisci.com/CN/Y2025/V58/I15/2960

图/表 12

表1

图1

图2

图3

图4

表2

图5

图6

表3

表4

Wheat2000上不同LAR筛选阈值对于G2PSE模型预测性能的影响"

性状 Traits	筛选SNP数目 Number of selected SNPs	LFSE模型 LFSE model	LFESE模型 LFESE model	HFESEⅠ 模型 HFESEⅠ model	HFESEⅡ 模型 HFESEⅡ model
TKW	337	0.713 (0.562)	0.802 (0.467)	0.710 (0.572)	0.669 (0.589)
	1685	0.710 (0.560)	0.750 (0.643)	0.522 (0.974)	0.667 (0.590)
	3371	0.704 (0.567)	0.694 (0.648)	0.404 (1.099)	0.666 (0.591)
	6742	0.704 (0.568)	0.689 (0.662)	0.373 (1.163)	0.666 (0.591)
	13483	0.702 (0.569)	0.691 (0.660)	0.368 (1.171)	0.666 (0.591)
	20225	0.700 (0.571)	0.724 (0.612)	0.365 (1.176)	0.666(0.591)
TW	337	0.674 (0.564)	0.778 (0.476)	0.709 (0.573)	0.602 (0.613)
	1685	0.665 (0.572)	0.655 (0.822)	0.528 (0.966)	0.600 (0.614)
	3371	0.661 (0.572)	0.635 (0.737)	0.374 (1.170)	0.599 (0.615)
	6742	0.657 (0.577)	0.627 (0.731)	0.373 (1.162)	0.599 (0.615)
	13483	0.658 (0.576)	0.627 (0.733)	0.368 (1.170)	0.599 (0.615)
	20225	0.659 (0.576)	0.628 (0.731)	0.366 (1.175)	0.600 (0.615)
GL	337	0.772 (0.478)	0.839 (0.410)	0.754 (0.510)	0.741 (0.502)
	1685	0.765 (0.486)	0.738 (0.648)	0.522 (0.947)	0.740 (0.504)
	3371	0.763 (0.489)	0.699 (0.654)	0.478 (1.007)	0.739 (0.504)
	6742	0.762 (0.490)	0.703 (0.643)	0.445 (1.031)	0.739 (0.504)
	13483	0.764 (0.488)	0.704 (0.642)	0.442 (1.036)	0.739 (0.504)
	20225	0.761 (0.490)	0.702 (0.644)	0.440 (1.039)	0.739 (0.504)
GW	337	0.750 (0.514)	0.812 (0.451)	0.712 (0.557)	0.732 (0.526)
	1685	0.745 (0.518)	0.743 (0.654)	0.497 (1.010)	0.731 (0.526)
	3371	0.740 (0.524)	0.746 (0.582)	0.420 (1.086)	0.731 (0.527)
	6742	0.741 (0.523)	0.716 (0.634)	0.402 (1.136)	0.731 (0.527)
	13483	0.741 (0.523)	0.714 (0.636)	0.401 (1.137)	0.731 (0.527)
	20225	0.742 (0.523)	0.716 (0.634)	0.400 (1.137)	0.731 (0.527)
GH	337	0.687 (0.567)	0.787 (0.483)	0.677 (0.586)	0.682 (0.567)
	1685	0.686 (0.566)	0.707 (0.705)	0.463 (1.043)	0.681 (0.568)
	3371	0.691 (0.561)	0.695 (0.654)	0.370 (1.170)	0.680 (0.569)
	6742	0.684 (0.566)	0.698 (0.645)	0.395 (1.125)	0.680 (0.569)
	13483	0.686 (0.565)	0.698 (0.645)	0.392 (1.129)	0.680 (0.569)
	20225	0.689 (0.563)	0.699 (0.644)	0.391 (1.132)	0.680 (0.569)
GP	337	0.626 (0.604)	0.746 (0.512)	0.627 (0.622)	0.515 (0.667)
	1685	0.609 (0.618)	0.662 (0.800)	0.414 (1.175)	0.513 (0.668)
	3371	0.604 (0.619)	0.616 (0.759)	0.358 (1.167)	0.512 (0.668)
	6742	0.603 (0.619)	0.617 (0.757)	0.365 (1.165)	0.512 (0.668)
	13483	0.603 (0.619)	0.618 (0.756)	0.357 (1.177)	0.512 (0.668)
	20225	0.602 (0.621)	0.615 (0.760)	0.352 (1.185)	0.512 (0.668)
SDS	337	0.663 (0.599)	0.796 (0.477)	0.694 (0.575)	0.525 (0.694)
	1685	0.656 (0.607)	0.763 (0.606)	0.580 (0.862)	0.523 (0.695)
	3371	0.644 (0.621)	0.714 (0.626)	0.468 (0.979)	0.520 (0.697)
	6742	0.644 (0.623)	0.713 (0.629)	0.488 (0.948)	0.518 (0.698)
	13483	0.643 (0.622)	0.712 (0.631)	0.487 (0.949)	0.521 (0.696)
	20225	0.644 (0.621)	0.707 (0.636)	0.440 (1.025)	0.520 (0.697)
PHT	337	0.501 (0.664)	0.697 (0.566)	0.553 (0.673)	0.279 (0.755)
	1685	0.477 (0.672)	0.626 (0.844)	0.422 (1.131)	0.278 (0.755)
	3371	0.449 (0.682)	0.520 (0.879)	0.334 (1.167)	0.275 (0.756)
	6742	0.452 (0.682)	0.512 (0.894)	0.294 (1.230)	0.276 (0.756)
	13483	0.449 (0.681)	0.515 (0.890)	0.293 (1.229)	0.275 (0.756)
	20225	0.451 (0.683)	0.517 (0.883)	0.311 (1.186)	0.275 (0.756)

表4

表5

表6

参考文献 55

[1]	李棉燕, 王立贤, 赵福平. 机器学习在动物基因组选择中的研究进展. 中国农业科学, 2023, 56(18): 3682-3692. doi: 10.3864/j.issn.0578-1752.2023.18.015.
	LI M Y, WANG L X, ZHAO F P. Research progress on machine learning for genomic selection in animals. Scientia Agricultura Sinica, 2023, 56(18): 3682-3692. doi: 10.3864/j.issn.0578-1752.2023.18.015. (in Chinese)
[2]	VANRADEN P M. Efficient methods to compute genomic predictions. Journal of Dairy Science, 2008, 91(11): 4414-4423. doi: 10.3168/jds.2007-0980 pmid: 18946147
[3]	WHITTAKER J C, CURNOW R N, HALEY C S, THOMPSON R. Using marker-maps in marker-assisted selection. Genetical Research, 1995, 66(3): 255-265.
[4]	JAVID S, BIHAMTA M R, OMIDI M, ABBASI A R, ALIPOUR H, INGVARSSON P K. Genome-Wide Association Study (GWAS) and genome prediction of seedling salt tolerance in bread wheat (Triticum aestivum L.). BMC Plant Biology, 2022, 22(1): 581.
[5]	MEHER P K, RUSTGI S, KUMAR A. Performance of Bayesian and BLUP alphabets for genomic prediction: Analysis, comparison and results. Heredity, 2022, 128(6): 519-530. doi: 10.1038/s41437-022-00539-9 pmid: 35508540
[6]	HAILE T A, WALKOWIAK S, N’DIAYE A, CLARKE J M, HUCL P J, CUTHBERT R D, KNOX R E, POZNIAK C J. Genomic prediction of agronomic traits in wheat using different models and cross- validation designs. Theoretical and Applied Genetics, 2021, 134(1): 381-398.
[7]	KALER A S, PURCELL L C, BEISSINGER T, GILLMAN J D. Genomic prediction models for traits differing in heritability for soybean, rice, and maize. BMC Plant Biology, 2022, 22(1): 87.
[8]	XU Y, MA K X, ZHAO Y, WANG X, ZHOU K, YU G N, LI C, LI P C, YANG Z F, XU C W, XU S Z. Genomic selection: A breakthrough technology in rice breeding. The Crop Journal, 2021, 9(3): 669-677.
[9]	GUNUNDU R, SHIMELIS H, MASHILO J. Genomic selection and enablers for agronomic traits in maize: A review. Plant Breeding, 2023, 142(5): 573-593.
[10]	CESARANI A, MASUDA Y, TSURUTA S, NICOLAZZI E L, VANRADEN P M, LOURENCO D, MISZTAL I. Genomic predictions for yield traits in US Holsteins with unknown parent groups. Journal of Dairy Science, 2021, 104(5): 5843-5853. doi: 10.3168/jds.2020-19789 pmid: 33663836
[11]	ONOGI A, WATANABE T, OGINO A, KUROGI K, TOGASHI K. Genomic prediction with non-additive effects in beef cattle: Stability of variance component and genetic effect estimates against population size. BMC Genomics, 2021, 22(1): 512.
[12]	ABDOLLAHI-ARPANAHI R, LOURENCO D, LEGARRA A, MISZTAL I. Dissecting genetic trends to understand breeding practices in livestock: A maternal pig line example. Genetics, Selection, Evolution, 2021, 53(1): 89.
[13]	YIN C, ZHOU P, WANG Y W, YIN Z J, LIU Y. Using genomic selection to improve the accuracy of genomic prediction for multi-populations in pigs. Animal, 2024, 18(2): 101062.
[14]	MOTA L F M, ARIKAWA L M, SANTOS S W B, FERNANDES JÚNIOR G A, ALVES A A C, ROSA G J M, MERCADANTE M E Z, CYRILLO J N S G, CARVALHEIRO R, ALBUQUERQUE L G. Benchmarking machine learning and parametric methods for genomic prediction of feed efficiency-related traits in Nellore cattle. Scientific Reports, 2024, 14: 6404. doi: 10.1038/s41598-024-57234-4 pmid: 38493207
[15]	BANI S H, VAEZ T R, MANAFIAZAR G, MASOUDI A A, EHSANI A, SHAHINFAR S. Comparing machine learning algorithms and linear model for detecting significant SNPs for genomic evaluation of growth traits in F₂ chickens. Journal of Agricultural Science and Technology, 2024, 26(6): 1261-1274.
[16]	GRINBERG N F, ORHOBOR O I, KING R D. An evaluation of machine-learning for predicting phenotype: Studies in yeast, rice, and wheat. Machine Learning, 2020, 109(2): 251-277. doi: 10.1007/s10994-019-05848-5 pmid: 32174648
[17]	XIANG T, LI T, LI J L, LI X, WANG J. Using machine learning to realize genetic site screening and genomic prediction of productive traits in pigs. FASEB Journal, 2023, 37(6): e22961.
[18]	OMEKA W K M, LIYANAGE D S, LEE S, UDAYANTHA H M V, KIM G, GANESHALINGAM S, JEONG T, JONES D B, MASSAULT C, JERRY D R, LEE J. Genomic prediction model optimization for growth traits of olive flounder (Paralichthys olivaceus). Aquaculture Reports, 2024, 36: 102132.
[19]	MONTESINOS-LÓPEZ O A, GONZALEZ H N, MONTESINOS- LÓPEZ A, DAZA-TORRES M, LILLEMO M, MONTESINOS- LÓPEZ J C, CROSSA J. Comparing gradient boosting machine and Bayesian threshold BLUP for genome-based prediction of categorical traits in wheat breeding. The Plant Genome, 2022, 15(3): e20214.
[20]	ZHAO W, LAI X S, LIU D Y, ZHANG Z Y, MA P P, WANG Q S, ZHANG Z, PAN Y C. Applications of support vector machine in genomic prediction in pig and maize populations. Frontiers in Genetics, 2020, 11: 598318.
[21]	周铂涵, 梅步俊, 吕琦, 王志英, 苏蕊. 机器学习及其在动物遗传育种中的应用研究进展. 中国畜牧兽医, 2024, 51(12): 5348-5358. doi: 10.16431/j.cnki.1671-7236.2024.12.022
	ZHOU B H, MEI B J, LÜ Q, WANG Z Y, SU R. Research progress of machine learning and its application in animal genetics and breeding. China Animal Husbandry & Veterinary Medicine, 2024, 51(12): 5348-5358. (in Chinese)
[22]	MA W L, QIU Z X, SONG J, LI J J, CHENG Q, ZHAI J J, MA C. A deep convolutional neural network approach for predicting phenotypes from genotypes. Planta, 2018, 248(5): 1307-1318. doi: 10.1007/s00425-018-2976-9 pmid: 30101399
[23]	WANG K L, ALI ABID M, RASHEED A, CROSSA J, HEARNE S, LI H H. DNNGP, a deep neural network-based method for genomic prediction using multi-omics data in plants. Molecular Plant, 2023, 16(1): 279-293.
[24]	WU C L, ZHANG Y Y, YING Z W, LI L, WANG J, YU H, ZHANG M C, FENG X Z, WEI X H, XU X G. A transformer-based genomic prediction method fused with knowledge-guided module. Briefings in Bioinformatics, 2023, 25(1): bbad438.
[25]	MONTESINOS-LÓPEZ O A, MONTESINOS-LÓPEZ A, PÉREZ- RODRÍGUEZ P, BARRÓN-LÓPEZ J A, MARTINI J W R, FAJARDO-FLORES S B, GAYTAN-LUGO L S, SANTANA- MANCILLA P C, CROSSA J. A review of deep learning applications for genomic selection. BMC Genomics, 2021, 22(1): 19.
[26]	HASSANALI M, SOLTANAGHAEI M, JAVDANI GANDOMANI T, ZAMANI BOROUJENI F. Exploring stacking methods for software effort estimation with hyperparameter tuning. Cluster Computing, 2025, 28(4): 241.
[27]	ALZUBI R, RAMZAN N, ALZOUBI H, KATSIGIANNIS S. SNPs-based hypertension disease detection via machine learning techniques. 2018 24th International Conference on Automation and Computing (ICAC). September 6-7, 2018, Newcastle Upon Tyne, UK. IEEE, 2018: 1-6.
[28]	MEHARIE M G, MENGESHA W J, GARIY Z A, MUTUKU R N N. Application of stacking ensemble machine learning algorithm in predicting the cost of highway construction projects. Engineering, Construction and Architectural Management, 2022, 29(7): 2836-2853.
[29]	林泳恩, 孟越, 杜懿, 王大洋, 王大刚. 堆叠集成模型径流预报效果的影响因素研究. 水文, 2023, 43(1): 57-61.
	LIN Y E, MENG Y, DU Y, WANG D Y, WANG D G. Study on influence factors about runoff forecasting performance of stacking integrated model. Journal of China Hydrology, 2023, 43(1): 57-61. (in Chinese)
[30]	YOON T, KANG D. Multi-modal stacking ensemble for the diagnosis of cardiovascular diseases. Journal of Personalized Medicine, 2023, 13(2): 373.
[31]	LIANG M, CHANG T P, AN B X, DUAN X H, DU L L, WANG X Q, MIAO J, XU L Y, GAO X, ZHANG L P, LI J Y, GAO H J. A stacking ensemble learning framework for genomic prediction. Frontiers in Genetics, 2021, 12: 600040.
[32]	GU L L, YANG R Q, WANG Z Y, JIANG D, FANG M. Ensemble learning for integrative prediction of genetic values with genomic variants. BMC Bioinformatics, 2024, 25(1): 120.
[33]	YU T X, ZHANG W P, HAN J W, LI F Z, WANG Z H, CAO C Q. An ensemble learning approach for predicting phenotypes from genotypes. 2021 20th International Conference on Ubiquitous Computing and Communications (IUCC/CIT/DSCI/SmartCNS). December 20-22, 2021, London, United Kingdom. IEEE, 2021: 382-389.
[34]	LI S S, YU J, KANG H M, LIU J F. Genomic selection in Chinese Holsteins using regularized regression models for feature selection of whole genome sequencing data. Animals, 2022, 12(18): 2419.
[35]	冯盼峰, 温永仙. 基于随机森林算法的两阶段变量选择研究. 系统科学与数学, 2018, 38(1): 119-130. doi: 10.12341/jssms13325
	FENG P F, WEN Y X. Two-stage stepwise variable selection based on random forests. Journal of Systems Science and Mathematical Sciences, 2018, 38(1): 119-130. (in Chinese) doi: 10.12341/jssms13325
[36]	孙嘉利, 吴清太, 温阳俊, 张瑾. 基于FASTmrEMMA、最小角回归和随机森林的全基因组选择新算法. 南京农业大学学报, 2021, 44(2): 366-372.
	SUN J L, WU Q T, WEN Y J, ZHANG J. A new algorithm of genomics selection based on FASTmrEMMA, least angle regression and random forest. Journal of Nanjing Agricultural University, 2021, 44(2): 366-372. (in Chinese)
[37]	PILES M, BERGSMA R, GIANOLA D, GILBERT H, TUSELL L. Feature selection stability and accuracy of prediction models for genomic prediction of residual feed intake in pigs using machine learning. Frontiers in Genetics, 2021, 12: 611506.
[38]	MCLAREN C G, BRUSKIEWICH R M, PORTUGAL A M, COSICO A B. The International Rice Information System. A platform for meta-analysis of rice crop data. Plant Physiology, 2005, 139(2): 637-642. pmid: 16219924
[39]	CROSSA J, JARQUÍN D, FRANCO J, PÉREZ-RODRÍGUEZ P, BURGUEÑO J, SAINT-PIERRE C, VIKRAM P, SANSALONI C, PETROLI C, AKDEMIR D, et al. Genomic prediction of gene bank wheat landraces. G3, 2016, 6(7): 1819-1834.
[40]	XAVIER A, MUIR W M, RAINEY K M. Assessing predictive properties of genome-wide selection in soybeans. G3, 2016, 6(8): 2611-2616.
[41]	YOSHIDA G M, LHORENTE J P, CORREA K, SOTO J, SALAS D, YÁÑEZ J M. Genome-wide association study and cost-efficient genomic predictions for growth and fillet yield in Nile Tilapia (Oreochromis niloticus). G3, 2019, 9(8): 2597-2607.
[42]	LOZADA D N, SANDHU K S, BHATTA M. Ridge regression and deep learning models for genome-wide selection of complex traits in New Mexican Chile peppers. BMC Genomic Data, 2023, 24(1): 80.
[43]	李娟, 章明清, 许文江, 孔庆波, 姚宝全. 提高三元肥效模型建模成功率的主成分回归技术研究. 土壤学报, 2018, 55(2): 467-478.
	LI J, ZHANG M Q, XU W J, KONG Q B, YAO B Q. Principal component regression technology of ternary fertilizer response model for improving success rate of modeling. Acta Pedologica Sinica, 2018, 55(2): 467-478. (in Chinese)
[44]	NGUYEN T T, HUANG J, WU Q Y, NGUYEN T, LI M. Genome- wide association data classification and SNPs selection using two-stage quality-based Random Forests. BMC Genomics, 2015, 16(Suppl. 2): S5.
[45]	ZHAO W P, LI J C, ZHAO J, ZHAO D D, ZHU X Y. PDD_GBR: Research on evaporation duct height prediction based on gradient boosting regression algorithm. Radio Science, 2019, 54(11): 949-962.
[46]	ZHAO M, YE N. High-dimensional ensemble learning classification: An ensemble learning classification algorithm based on high- dimensional feature space reconstruction. Applied Sciences, 2024, 14(5): 1956.
[47]	HUANG C Y. Feature selection and feature stability measurement method for high-dimensional small sample data based on big data technology. Computational Intelligence and Neuroscience, 2021, 2021(1): 3597051.
[48]	JI Y H, LIANG Y, YANG Z Y, AI N. SW-Net: A novel few-shot learning approach for disease subtype prediction. Biocell, 2023, 47(3): 569-579.
[49]	FU G F, WANG G, DAI X T. An adaptive threshold determination method of feature screening for genomic selection. BMC Bioinformatics, 2017, 18(1): 212.
[50]	DENG Y, HU X L, LI B, ZHANG C X, HU W M. Multi-scale self-attention-based feature enhancement for detection of targets with small image sizes. Pattern Recognition Letters, 2023, 166: 46-52.
[51]	WANG Y H, DENG X L, LUO J Q, LI B L, XIAO S D. Cross-task feature enhancement strategy in multi-task learning for harvesting Sichuan pepper. Computers and Electronics in Agriculture, 2023, 207: 107726.
[52]	CHAN J Y, LEOW S M H, BEA K T, CHENG W K, PHOONG S W, HONG Z W, CHEN Y L. Mitigating the multicollinearity problem and its machine learning approach: A review. Mathematics, 2022, 10(8): 1283.
[53]	LI B, WANG Y Q, LI L S, LIU Y D. Research on apple origins classification optimization based on least-angle regression in instance selection. Agriculture, 2023, 13(10): 1868.
[54]	SHARMA J, JANGALE V, SHEKHAWAT R S, YADAV P. Improving genetic variant identification for quantitative traits using ensemble learning-based approaches. BMC Genomics, 2025, 26(1): 237.
[55]	TANAKA R, IWATA H. Bayesian optimization for genomic selection: A method for discovering the best genotype among a large number of candidates. Theoretical and Applied Genetics, 2018, 131(1): 93-105. doi: 10.1007/s00122-017-2988-z pmid: 28986680

数据集 Datasets	性状 Traits	个体数 Individuals	标记数 SNPs	遗传率 h²	数据来源 Data source
Wheat599	E1-GY	599	1279	0.832	https://github.com/AIBreeding/DNNGP?tab=readme-ov-file
	E2-GY	599	1279	0.729
	E3-GY	599	1279	0.689
	E4-GY	599	1279	0.711
Wheat2000	TKW	2000	33709	0.833	https://github.com/cma2015/DeepGS
	TW	2000	33709	0.754
	GL	2000	33709	0.881
	GW	2000	33709	0.848
	GH	2000	33709	0.839
	GP	2000	33709	0.625
	SDS	2000	33709	0.681
	PHT	2000	33709	0.434
Soy5014	HT	5014	4234	0.449	https://doi.org/10.1534/g3.116.032268
	R8	5014	4234	0.558
	YLD	5014	4234	0.485
Tilapia1125	HW	1125	32306	0.304	https://figshare.com/s/9b265a22b7e138c5a839
Pepper203	PHT	203	14922	0.610	https://bmcgenomdata.biomedcentral.com/articles/10.1186/s12863-023-01179-6
Pepper203	FT	203	14922	0.730

性状Traits	模型Model	条件数Condition number
HT	AFSE	14.823
	LFSE	16.598
	LFESE	490.901
	AFESE	5.436×10¹⁶
	HFESEⅠ	298.676
	HFESEⅡ	5.281×10¹⁶
R8	AFSE	7.709
	LFSE	8.837
	LFESE	495.610
	AFESE	5.506×10¹⁶
	HFESEⅠ	372.634
	HFESEⅡ	5.132×10¹⁶
YLD	AFSE	8.353
	LFSE	8.817
	LFESE	286.771
	AFESE	5.291×10¹⁶
	HFESEⅠ	286.676
	HFESEⅡ	5.489×10¹⁶

环境 Environment	筛选SNP数目 Number of selected SNPs	LFSE模型 LFSE model	LFESE模型 LFESE model	HFESEⅠ模型 HFESEⅠ model	HFESEⅡ模型 HFESEⅡ model
E1-GY	13	0.511(0.691)	0.489 (0.701)	0.581 (0.641)	0.297 (1.132)
	64	0.592 (0.634)	0.612 (0.616)	0.563 (0.655)	0.305 (1.112)
	128	0.604 (0.631)	0.636 (0.606)	0.543 (0.683)	0.321 (1.095)
	256	0.604 (0.625)	0.542 (0.670)	0.454 (1.251)	0.330 (1.074)
	512	0.597 (0.629)	0.407 (1.631)	0.333(1.728)	0.354 (1.030)
	767	0.596 (0.630)	0.129 (3.452)	0.079 (3.518)	0.353 (1.033)
E2-GY	13	0.429 (0.701)	0.467 (0.691)	0.518 (0.658)	0.339 (1.095)
	64	0.592 (0.624)	0.621 (0.622)	0.614 (0.625)	0.360 (1.033)
	128	0.607 (0.612)	0.698 (0.588)	0.681 (0.592)	0.373 (1.003)
	256	0.598 (0.622)	0.662 (0.634)	0.630 (0.662)	0.388 (0.968)
	512	0.564 (0.638)	0.521 (1.014)	0.409 (1.243)	0.431 (0.909)
	767	0.536 (0.648)	0.287 (2.026)	0.204 (2.484)	0.489 (0.838)
E3-GY	13	0.512 (0.678)	0.493 (0.681)	0.455 (0.692)	0.276 (1.070)
	64	0.512 (0.677)	0.498 (0.679)	0.501 (0.689)	0.274 (1.072)
	128	0.555 (0.655)	0.594 (0.628)	0.551 (0.672)	0.287 (1.059)
	256	0.511 (0.678)	0.496 (0.682)	0.523 (0.743)	0.280 (1.061)
	512	0.510 (0.681)	0.499 (0.679)	0.285 (1.394)	0.282 (1.061)
	767	0.510 (0.680)	0.496 (0.680)	0.132 (3.048)	0.279 (1.065)
E4-GY	13	0.457 (0.711)	0.477 (0.686)	0.514 (0.668)	0.291 (1.079)
	64	0.604 (0.630)	0.602 (0.633)	0.592 (0.644)	0.311 (1.047)
	128	0.639 (0.607)	0.653 (0.598)	0.648 (0.614)	0.319 (1.037)
	256	0.628 (0.607)	0.708 (0.594)	0.668 (0.637)	0.313 (1.044)
	512	0.586 (0.628)	0.434 (1.123)	0.403 (1.157)	0.323 (1.029)
	767	0.560 (0.641)	0.243 (2.231)	0.230 (2.387)	0.340 (1.011)

模型 Model	验证集Validation set		测试集Test set		差值绝对值Absolute difference
模型 Model	PCC	MAE	PCC	MAE	PCC	MAE
AFSE	0.639	0.566	0.520	0.580	0.119	0.014
LFSE	0.735	0.502	0.645	0.563	0.090	0.061
LFESE	0.693	0.544	0.734	0.517	0.041	0.027
AFESE	0.659	0.553	-0.023	0.690	0.682	0.137
HFESEⅠ	0.669	0.553	0.415	0.745	0.254	0.192
HFESEⅡ	0.659	0.554	0.527	0.625	0.132	0.071

模型 Model	验证集Validation set		测试集Test set		差值绝对值Absolute difference
模型 Model	PCC	MAE	PCC	MAE	PCC	MAE
AFSE	0.785	0.468	0.705	0.508	0.080	0.040
LFSE	0.806	0.444	0.756	0.501	0.050	0.057
LFESE	0.753	0.502	0.718	0.561	0.035	0.059
AFESE	0.769	0.467	0.434	0.701	0.335	0.234
HFESEⅠ	0.736	0.564	0.657	0.624	0.079	0.060
HFESEⅡ	0.768	0.469	0.712	0.521	0.056	0.052