中国农业科学 ›› 2020, Vol. 53 ›› Issue (1): 191-200.doi: 10.3864/j.issn.0578-1752.2020.01.018

• 畜牧·兽医·资源昆虫 • 上一篇    下一篇

约束标准化线性回归法估计合成品种动物基因组品种构成

何俊1,李智1,2,吴晓林1,2   

  1. 1 湖南农业大学动物科技学院,中国长沙 410128
    2 美国纽勤公司生物信息与生物统计部,美国林肯市 68504
  • 收稿日期:2019-03-01 接受日期:2019-05-30 出版日期:2020-01-01 发布日期:2020-01-19
  • 作者简介:何俊,Tel:0731-84618176;E-mail:hejun@hunau.edu.cn
  • 基金资助:
    湖南省科技计划重点项目(2018NK2081);长沙市科技计划重点项目(kq1801014);湖南省百人计划项目和湖南省畜禽安全协同创新中心项目

Using Restricted Standardized Linear Regression Model to Estimate Genomic Breed Composition in Composite Breed Animals

Jun HE1,Zhi LI1,2,XiaoLin WU1,2   

  1. 1 College of Animal Science and Technology, Hunan Agricultural University, Changsha 410128, China
    2 Biostatistics and Bioinformatics,Neogen GeneSeek, Lincoln, NE 68504, USA
  • Received:2019-03-01 Accepted:2019-05-30 Online:2020-01-01 Published:2020-01-19

摘要:

【背景】合成品种是由至少两种纯种(祖先)培育的新品种,旨在兼顾祖先品种的有利遗传特征,并且可以长期保持后代的杂种优势而不需要每个世代都杂交。合成品种的遗传稳定,不同于杂交群体,因而可以像纯种一样繁育。实践中,估计合成品种的祖先品种对每个动物个体基因组的遗传贡献比例,即基因组品种构成(genomic breeding composition, GBC),在畜禽品种登记、品种培育历史和品种构成分析、品种保护和杂交优势预测等方面有着非常重要的意义。利用基因组SNP基因型数据,采用合适的数学模型和统计方法,可以鉴定现有纯种品种的动物个体或纯种品种在杂交个体基因组的遗传贡献比例,而估计合成品种GBC的方法和研究都较少。【目的】线性回归是估计GBC的常用方法之一,但也存在诸多的问题。本研究旨在提出和评估一种约束的标准化线性回归方法(restricted standardized linear regression, RSLR),作为传统线性回归方法的改进方法,应用于估计合成品种动物个体的GBC。【方法】采用肉牛王牛(Beefmaster)及其3个祖先品种(婆罗门牛、海福特牛和短角牛)的GGP 50K SNP芯片所测定的基因型数据,通过计算其基因频率和欧氏距离,利用层次聚类分析方法解析了4个动物群体的遗传关系,然后提出了RSLR方法,估计合成品种动物个体GBC的原理和方法。为了检验该方法的估计效果,从基因型数据中选择了均匀分布的分别包含1 000、5 000、10 000、20 000、30 000、40 000个SNP以及3个祖先品种共有的47 900个SNP的7个子集,分别采用RSLR和传统线性回归(linear regression, LR)两种方法估计了4 323头肉牛王牛的GBC,并比较了两种方法的计算结果。【结果】聚类分析的结果与4个品种间的遗传关系相吻合,表明肉牛王牛与婆罗门牛的遗传关系最近,遗传距离小于其与海福特牛和短角牛的遗传距离。LR方法估计的GBC会低估婆罗门牛(0.459—0.462)和短角牛(0.208—0.212)对于肉牛王牛的基因组贡献,同时高估海福特牛(0.326—0.333)的基因组贡献。但RSLR方法估计的肉牛王牛GBC的平均值与3个祖先品种预期的基因组贡献比例比较吻合:婆罗门牛为0.497—0.503,海福特牛为0.262—0.274,短角牛为0.229—0.231。此外,LR方法估计GBC的标准差和变异系数明显大于用RSLR估计的结果。当SNP子集数量在20 000以上时,LR方法估计牛肉王牛的3个祖先品种婆罗门牛、海福特牛和短角牛基因组贡献的标准差分别为0.048、0.032和0.051—0.052,变异系数分别为10.46%—10.50%、9.61%—9.76%和23.94%—25.00%,而RSLR方法估计的标准差,3个祖先品种对应为0.021、0.021—0.022和0.024—0.025,变异系数分别为4.18%—4.20%、7.89%—8.33%以及10.26%—10.68%。【结论】用RSLR方法估计的合成品种肉牛王牛动物个体的GBC,比LR方法的估计结果更加准确,估计的结果比LR方法估计的结果更稳定,且估计的一致性也更好,可以作为线性回归方法的改进,应用于估计合成品种动物个体GBC。

关键词: SNP芯片, 线性回归, 合成品种, 基因组品种构成

Abstract:

【Background】A composite breed is made up of two or more purebreds (ancestries), designed to combine advantageous genetic characteristics from the ancestry breeds and to retain heterosis in future generations without crossbreeding. Unlike crossbred populations, composite variety can be maintained as a purebred. In practice, knowing the ratio of genomic contribution of an ancestry breed to individual composite animals, referred to as the genomic breed composition (GBC), is of importance in animal breed registration, tracing breeding history and population structure, breed conservation, and the prediction of heterosis. Using a set of genomic SNP genotype and an appropriate statistical model, GBC of a purebred or crossbred animal can be estimated. So far, studies on statistical methods devote to the estimation of GBC in composite breed are limited. Linear regression (LR) analysis was commonly used to estimated GBC of individual animals, but it had some limitations such as the coefficients of ancestral breeds does not add to 1.【Objective】The purpose of the present study was to propose and evaluate the use of restricted standardized regression analysis, as an improved approach of linear regression analysis to estimate GBC in composite animals. 【Method】The dataset consisted of 4 323 Beefmaster cattle and purebred animals belonging to their ancestry breeds, namely Brahman, Hereford and Shorthorn. All these animals were genotyped by GeneSeek Genomic Profiling (GGP) bovine 50K SNP chips. Allelic frequencies of each SNP and the Euclidean distance between breeds were computed for the four animal populations, and their genetic relationships were revealed by Hierarchical Clustering based on Euclidean distance of SNP allele frequencies among the four populations. Genomic breed composition of the 4 323 Beefmaster cattle were estimated using RSLR and LR, respectively, based on 7 SNP panels(1K, 5K, 10K, 20K, 30K, 40K, and all the common 47 900 SNP). 【Result】The results of the clustering analysis agreed well with the genetic relationships of Beefmaster and the three ancestral breeds, showing that Beefmaster was more related to Brahman than Herdford and Shorhorn. Linear regression analysis underestimated the genomic contribution ratios of Brahman cattle (0.459-0.462) and shorthorn cattle (0.208-0.212) and at the same time overestimated that of Hereford cattle (0.326-0.333) to Beefmaster cattle. In contrast, estimated GBC of the 4 323 Beefmaster cattle obtained by using RSLR agreed well with expected genomic contribution ratios of the three ancestry breeds, which were 0.497-0.503 for Brahman, 0.262-0.274 for Hereford, and 0.229-0.231 for Shorthorn, respectively. Furthermore, the standard deviations (SD) and coefficients of variance (CV) of GBC obtained by using LR were larger than those obtained using RSLR. With 20K or more SNPs as the reference panels, the SD of GBC estimated by using LR were 0.048 (Brahman), 0.032 (Hereford) and 0.051-0.052 (Shorthorn), and the corresponding CV were 10.46%-10.50% (Brahman), 9.61%-9.76% (Hereford) and 23.94%-25.00% (Shorthorn), respectively. Using RSLR, on the other hand, the SD of GBC pertaining to each of the three ancestry breeds were 0.021 (Brahman), 0.021-0.022(Hereford) and 0.024-0.025 (Shorthorn), and the responding CV were 4.18%-4.20% (Brahman), 7.89%-8.33% (Hereford) and 10.26%-10.68% (Shorthorn), correspondingly. 【Conclusion】The RSLR method provided more accurate and consistent estimates of GBC in the 4 323 Beefmaster cattle than the LR approach. It thus provided a new statistical method for the estimation of GBC in composite animals.

Key words: SNP chip, linear regression, composite breeds, genomic breed composition