中国农业科学 ›› 2023, Vol. 56 ›› Issue (9): 1617-1632.doi: 10.3864/j.issn.0578-1752.2023.09.001

• 作物遗传育种·种质资源·分子遗传学 • 上一篇    下一篇

全基因组关联分析中混合模型的原理、优化与应用

谭力治(), 赵毅强()   

  1. 中国农业大学生物学院,北京 100193
  • 收稿日期:2022-12-04 接受日期:2023-03-02 出版日期:2023-05-01 发布日期:2023-05-10
  • 通信作者: 赵毅强,E-mail:yiqiangz@cau.edu.cn
  • 联系方式: 谭力治,E-mail:tanlizhi@cau.edu.cn。
  • 基金资助:
    国家重点研发计划(2022YFF1000204)

Principle, Optimization and Application of Mixed Models in Genome- Wide Association Study

TAN LiZhi(), ZHAO YiQiang()   

  1. College of Biological Sciences, China Agricultural University, Beijing 100193
  • Received:2022-12-04 Accepted:2023-03-02 Published:2023-05-01 Online:2023-05-10

摘要:

全基因组关联分析(genome-wide association study,GWAS)是定位基因组中与性状显著关联的变异位点的有效方法。随着表型记录的完善、高通量基因型分型技术的发展,以及统计方法的改进,全基因组关联分析在人类疾病、动物植物遗传等领域得到了广泛的应用。假阳性是影响全基因组关联分析结果可靠性的重要因素之一。为了控制假阳性,除了校正P值,GWAS模型从最简单的方差分析(或用于质量性状的卡方检验)到加入固定效应协变量的普通线性模型(general linear model,GLM),再到加入随机效应的混合线性模型(mixed linear model,MLM)持续改进,控制了多种混杂因素导致的假阳性。将个体的遗传效应拟合为由基因组亲缘关系矩阵(genomic relationships matrix,GRM)定义的随机效应是目前常用的方法。由于MLM的参数估计大量消耗计算资源,研究人员不断尝试模型求解优化和GRM的构建优化(GRM的构建优化同时也提高了计算效率),最终将基于MLM计算的时间复杂度由O(MN3)逐步改进到O(MN),实现了计算速度与统计功效的飞跃。针对质量性状病例对照比失衡带来的假阳性问题,研究人员进一步对广义混合线性模型(generalized linear mixed model,GLMM)进行了校正。本文较全面地介绍了GWAS的基本原理和发展,着重阐述了GWAS中MLM模型的改进和优化细节,同时,列举了GWAS在农业中的应用,包括在植物、动物和微生物方面的研究成果,以及基于单倍型的GWAS应用。最后,从进一步提高GWAS统计功效和GWAS试验设计2个角度对GWAS未来的发展进行了展望。

关键词: 全基因组关联分析, 复杂性状, 随机效应, 基因组亲缘关系矩阵, 混合线性模型

Abstract:

Genome-wide association study (GWAS) is an effective method to locate genomic loci that are significantly associated with traits. With the accumulated phenotypic data, the continuous development of high-throughput genotyping technology, and the improved statistical methods, it promotes the wide application of GWAS in area of human disease and animal and plant genetics. False positives are one of the important concerns that impair the reliability of genome-wide association results. To control the false positives, in addition to correcting the P-values, GWAS models have been continuously improved from the naive methods like ANOVA (for quantitative trait) or Chi-square test (for quality trait), to general linear model (GLM), which incorporates fixed-effect covariates, to the mixed linear model (MLM), which incorporates random effects. Fitting individual genetic effects into random effects defined by the genomic relationships matrix (GRM) is commonly adapted currently. Since the parameter estimation of MLM consumes a lot of computational resources, researchers have tried to optimize solving models and constructing GRM (which also improves computing efficiency), and the time complexity gradually decreased from O(MN3) to O(MN) for MLM-based methods, achieving a great leap in computational speed and statistical efficacy. For inflations caused by unbalanced case-control data, researchers further correct the generalized mixed linear model (GLMM). This paper comprehensively introduces the basic principles and development of GWAS, with specific emphasis on the model improvement and optimization details. We also list the applications of MLM in GWAS in agriculture, including progress on animals, plants and microbes, as well as the application of haplotype in GWAS. Finally, we give prospects on the future developments of GWAS from the viewpoints of further model optimization and experimental design.

Key words: genome-wide association study, complex traits, random effects, genomic relationships matrix, mixed linear model