Journal of Integrative Agriculture ›› 2023, Vol. 22 ›› Issue (6): 1909-1927.DOI: 10.1016/j.jia.2023.02.011

• • 上一篇    下一篇

  

  • 收稿日期:2022-09-27 修回日期:2023-02-10 接受日期:2022-11-16 出版日期:2023-06-20 发布日期:2022-11-16

Ensemble learning prediction of soybean yields in China based on meteorological data

LI Qian-chuan1, XU Shi-wei1, 2, 5#, ZHUANG Jia-yu1, 5, LIU Jia-jia2, ZHOU Yi3, ZHANG Ze-xi4   

  1. 1 Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, P.R.China

    2 Beijing Engineering Research Center for Agricultural Monitoring and Early Warning, Beijing 100081, P.R.China

    3 Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, P.R.China

    4 The Department of Mathematics, Columbia University, NY 10027, USA

    5 Key Laboratory of Agricultural Monitoring and Early Warning Technology, Ministry of Agriculture and Rural Affairs, Beijing 100081, P.R.China

  • Received:2022-09-27 Revised:2023-02-10 Accepted:2022-11-16 Online:2023-06-20 Published:2022-11-16
  • About author:LI Qian-chuan, E-mail: 82101211326@caas.cn; #Correspondence XU Shi-wei, Email: xushiwei@caas.cn
  • Supported by:
    The research was supported by the Science and Tech- nology Innovation Project of Chinese Academy of Agricultural Sciences (CAAS-ASTIP-2016-AII).     

摘要:

准确预测大豆单产对于农业生产、监测和预警具有重要意义。尽管目前有研究已经使用机器学习算法来基于气象数据预测大豆单产,但尚没有充分探讨如何使用不同的模型来有效地将不同地区的大豆气象单产与大豆单产区分开来。此外,综合利用各种机器学习算法的优势与特点以通过集成学习算法提高大豆预测单产精度的研究也不够深入。通过对中国最主要的两个大豆主产区东北地区和黄淮地区,173个县级行政区域和气象观测站跨度34年的单日气象数据和大豆产量数据进行研究与分析,本文采用K近邻(K-Nearest Neighbors, KNN),随机森林(Random Forest, RF)和支持向量机(Support Vector Machine, SVR)作为3个有效的基模型,建立了基于堆栈集成学习框架的高精度、高可靠性大豆气象单产预测模型。通过5折交叉验证进一步提升了模型泛化能力,并利用主成分分析降维和超参数调优对模型进行了优化。利用173个县的5年滑动预测和4种回归指标进行模型精度评价,表明大豆气象单产堆栈集成学习预测模型具有更高的精度和更强的鲁棒性。基于堆栈集成学习框架173个县大豆单产5年滑动估测表明,模型估测效果能够详细反映出大豆单产的时空分布变化情况,MAPE低于5%。大豆气象单产堆栈集成学习预测模型为准确预测大豆单产提供了新的思路。

Abstract:

The accurate prediction of soybean yield is of great significance for agricultural production, monitoring and early warning.  Although previous studies have used machine learning algorithms to predict soybean yield based on meteorological data, it is not clear how different models can be used to effectively separate soybean meteorological yield from soybean yield in various regions.  In addition, comprehensively integrating the advantages of various machine learning algorithms to improve the prediction accuracy through ensemble learning algorithms has not been studied in depth.  This study used and analyzed various daily meteorological data and soybean yield data from 173 county-level administrative regions and meteorological stations in two principal soybean planting areas in China (Northeast China and the Huang–Huai region), covering 34 years.  Three effective machine learning algorithms (K-nearest neighbor, random forest, and support vector regression) were adopted as the base-models to establish a high-precision and highly-reliable soybean meteorological yield prediction model based on the stacking ensemble learning framework.  The model’s generalizability was further improved through 5-fold cross-validation, and the model was optimized by principal component analysis and hyperparametric optimization.  The accuracy of the model was evaluated by using the five-year sliding prediction and four regression indicators of the 173 counties, which showed that the stacking model has higher accuracy and stronger robustness.  The 5-year sliding estimations of soybean yield based on the stacking model in 173 counties showed that the prediction effect can reflect the spatiotemporal distribution of soybean yield in detail, and the mean absolute percentage error (MAPE) was less than 5%.  The stacking prediction model of soybean meteorological yield provides a new approach for accurately predicting soybean yield.

Key words: meteorological factors , ensemble learning ,  crop yield prediction ,  machine learning ,  county-level