Automatic extraction and structuration of soil–environment relationship information from soil survey reports

doi:10.1016/S2095-3119(18)62071-4

Journal of Integrative Agriculture

2019, Vol. 18

Issue (2): 328-339 DOI: 10.1016/S2095-3119(18)62071-4

Special focus: Digital mapping in agriculture and environment

Advanced Online Publication | Current Issue | Archive | Adv Search

Automatic extraction and structuration of soil–environment relationship information from soil survey reports

WANG De-sheng^{1, 2, 3}, LIU Jun-zhi^{1, 2, 3}, ZHU A-xing^{1, 2, 3, 4, 5}, WANG Shu^{1, 2, 3}, ZENG Can-ying^{1, 2, 3}, MA Tian-wu^{1, 2, 3}

¹ Key Laboratory of Virtual Geographic Environment, Nanjing Normal University, Nanjing 210023, P.R.China
² State Key Laboratory Cultivation Base of Geographical Environment Evolution (Jiangsu Province), Nanjing 210023, P.R.China
³ Jiangsu Center for Collaborative Innovation in Geographic Information Resource Development and Application, Nanjing 210023, P.R.China
⁴ State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, P.R.China
⁵ Department of Geography, University of Wisconsin-Madison, Madison, WI 53706, USA

Abstract
References

Download: PDF (1092KB) ( )
Export: BibTeX | EndNote (RIS)

Abstract

In addition to soil samples, conventional soil maps, and experienced soil surveyors, text about soils (e.g., soil survey reports) is an important potential data source for extracting soil–environment relationships. Considering that the words describing soil–environment relationships are often mixed with unrelated words, the first step is to extract the needed words and organize them in a structured way. This paper applies natural language processing (NLP) techniques to automatically extract and structure information from soil survey reports regarding soil–environment relationships. The method includes two steps: (1) construction of a knowledge frame and (2) information extraction using either a rule-based method or a statistic-based method for different types of information. For uniformly written text information, the rule-based approach was used to extract information. These types of variables include slope, elevation, accumulated temperature, annual mean temperature, annual precipitation, and frost-free period. For information contained in text written in diverse styles, the statistic-based method was adopted. These types of variables include landform and parent material. The soil species of China soil survey reports were selected as the experimental dataset. Precision (P), recall (R), and F₁-measure (F1) were used to evaluate the performances of the method. For the rule-based method, the P values were 1, the R values were above 92%, and the F1 values were above 96% for all the involved variables. For the method based on the conditional random fields (CRFs), the P, R and F1 values for the parent material were, respectively, 84.15, 83.13, and 83.64%; the values for landform were 88.33, 76.81, and 82.17%, respectively. To explore the impact of text types on the performance of the CRFs-based method, CRFs models were trained and validated separately by the descriptive texts of soil types and typical profiles. For parent material, the maximum F1 value for the descriptive text of soil types was 90.7%, while the maximum F1 value for the descriptive text of soil profiles was only 75%. For landform, the maximum F1 value for the descriptive text of soil types was 85.33%, which was similar to that of the descriptive text of soil profiles (i.e., 85.71%). These results suggest that NLP techniques are effective for the extraction and structuration of soil–environment relationship information from a text data source.

Keywords: soil–environment relationship text natural language processing extraction structuration

Received: 02 January 2018 Accepted:

Fund: This study is supported by the National Natural Science Foundation of China (41431177 and 41601413), the National Basic Research Program of China (2015CB954102), the Natural Science Research Program of Jiangsu Province, China (BK20150975 and 14KJA170001), and the Outstanding Innovation Team in Colleges and Universities in Jiangsu Province, China.

Corresponding Authors: Correspondence LIU Jun-zhi, E-mail: liujunzhi@njnu.edu.cn; ZHU A-xing, E-mail: azhu@wisc.edu

About author: WANG De-sheng, E-mail: desheng.5@163.com

	Service
	E-mail this article
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	WANG De-sheng
	LIU Jun-zhi
	ZHU A-xing
	WANG Shu
	ZENG Can-ying
	MA Tian-wu

Cite this article:

WANG De-sheng, LIU Jun-zhi, ZHU A-xing, WANG Shu, ZENG Can-ying, MA Tian-wu. 2019. Automatic extraction and structuration of soil–environment relationship information from soil survey reports. Journal of Integrative Agriculture, 18(2): 328-339.

Aone C, Ramos-Santacruz M. 2000. REES: A large-scale relation and event extraction system. Proceedings of the Sixth Conference on Applied Natural Language Processing.Association for Computational Linguistics, Stroudsburg, USA.
Appelt D E. 1999. Introduction to information extraction. AI Communications, 12, 161–172.
Appelt D E, Hobbs J R, Bear J, Israel D, Tyson M. 1993. Fastus: A finite-state processor for information extraction from real-world text. International Joint Conferences on Artificial Intelligence, 93, 1172–1178.
Beucher A, Siemssen R, Fröjdö S, Österholm P, Martinkauppi A, Edén P. 2015. Artificial neural network for mapping and characterization of acid sulfate soils: Application to Sirppujoki River catchment, southwestern Finland. Geoderma, 247–248, 38–50.
Bird S, Klein E, Loper E. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, USA.
Brungard C W, Boettinger J L, Duniway M C, Wills S A, Edwards Jr T C. 2015. Machine learning for predicting soil classes in three semi-arid landscapes. Geoderma, 239–240, 68–83.
Chang A X, Manning C D. 2012. Sutime: A library for recognizing and normalizing time expressions. The International Conference on Language Resources and Evaluation, 2012, 3735–3740.
Ciravegna F. 2001. Adaptive information extraction from text by rule induction and generalisation. International Joint Conferences on Artificial Intelligence, 32, 1251–1256.
Cook S E, Corner R J, Grealish G, Gessler P E, Chartres C J. 1996. A rule-based system to map soil properties. Soil Science Society of America Journal, 60, 1893–1900.
Corner R J, Hickey R J, Cook S E. 2002. Knowledge based soil attribute mapping in GIS: The expector method. Transactions in GIS, 6, 383–402.
GB/T 17296-2009. 2009. Classification and codes for Chinese soil. The State Bureau of Quality and Technical Supervision, China National Dtandardization Management Committee. (in Chinese)
Hengl T, Heuvelink G B, Kempen B, Leenaars J G, Walsh M G, Shepherd K D, Tamene L. 2015. Mapping soil properties of Africa at 250 m resolution: random forests significantly improve current predictions. PLoS ONE, 10, e0125814.
Heung B, Ho H C, Zhang J, Knudby A, Bulmer C E, Schmidt M G. 2016. An overview and comparison of machine-learning techniques for classification purposes in digital soil mapping. Geoderma, 265, 62–77.
Jenny H. 1941. Factors of Soil Formation: A System of Quantitative Pedology. Dover Publications, New York.
Jurafsky D, Martin J H. 2000. Speech and language processing: An introduction to natural language processing. Computational Linguistics and Speech Recognition, 36, 161–187.
Jurafsky D, Martin J H. 2014. Speech and Language Processing. Pearson, London.
Lafferty J D, Mccallum A, Pereira F C N. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann Publishers, San Francisco, USA. pp. 282–289.
Liu J, Zhu A X. 2009. Mapping with words: A new approach to automated digital soil survey. International Journal of Intelligent Systems, 24, 293–311.
Manaris B. 1998. Natural language processing: A human-computer interaction perspective. Advances in Computers, 47, 1–66.
Martin J H, Jurafsky D. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Pearson/Prentice Hall, London, England.
McBratney A B, Santos M M, Minasny B. 2003. On digital soil mapping. Geoderma, 117, 3–52.
Mikheev A, Moens M, Grover C. 1999. Named entity recognition without gazetteers. In: Proceedings of the Ninth Conference on European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Bergen, Norway. pp.1–8.
Minsky M. 1975. A framework for representing knowledge. In: The Psychology of Computer Vision. McGraw-Hill, New York. pp. 211–277.
Nauman T W, Thompson J A. 2014. Semi-automated disaggregation of conventional soil maps using knowledge driven data mining and classification trees. Geoderma, 213, 385–399.
Odgers N P, Libohova Z, Thompson J A. 2012. Equal-area spline functions applied to a legacy soil database to create weighted-means maps of soil organic carbon at a continental scale. Geoderma, 189–190, 153–163.
OSNSSC (The Office for the Second National Soil Survey of China). 1993. Soil Species (Series) of China (Vol. 1). Chinese Agriculture Press, Beijing. (in Chinese)
OSNSSC (The Office for the Second National Soil Survey of China). 1994a. Soil Species (Series) of China (Vol. 2). Chinese Agriculture Press, Beijing. (in Chinese)
OSNSSC (The Office for the Second National Soil Survey of China) 1994b. Soil Species (Series) of China (Vol. 3). Chinese Agriculture Press, Beijing. (in Chinese)
OSNSSC (The Office for the Second National Soil Survey of China). 1995a. Soil Species (Series) of China (Vol. 4). Chinese Agriculture Press, Beijing. (in Chinese)
OSNSSC (The Office for the Second National Soil Survey of China). 1995b. Soil Species (Series) of China (Vol. 5). Chinese Agriculture Press, Beijing. (in Chinese)
OSNSSC (The Office for the Second National Soil Survey of China). 1996. Soil Species (Series) of China (Vol. 6). Chinese Agriculture Press, Beijing. (in Chinese)
Piskorski J, Yangarber R. 2013. Information extraction: Past, present and future. In: Multi-source, Multilingual Information Extraction and Summarization. Springer, Heidelberg, Germany. pp. 23–49.
Pustejovsky J, Stubbs A. 2012. Natural Language Annotation for Machine Learning: A Guide to Corpus-building for Applications. O’Reilly Media, Sebastopol, USA.
Qi F, Zhu A X. 2003. Knowledge discovery from soil maps using inductive learning. International Journal of Geographical Information Science, 17, 771–795.
Rodrigues M, Teixeira A. 2015. Advanced Applications of Natural Language Processing for Performing Information Extraction. Springer, Heidelberg, Germany.
Rossiter D. 2008. Digital soil mapping as a component of data renewal for areas with sparse soil data infrastructures. In: Digital Soil Mapping with Limited Data. Springer, Heidelberg, Germany. pp. 69–80.
Shaalan K, Raza H. 2007. Person name entity recognition for Arabic. In: Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources. Association for Computational Linguistics, Stroudsburg, USA. pp. 17–24.
Shariff A R B, Egenhofer M J, Mark D M. 1998. Natural language spatial relations between linear and areal objects: The topology and metric of English-language terms. International Journal of Geographical Information Science, 12, 215–245.
Shi X, Yu D, Warner E, Pan X, Petersen G, Gong Z, Weindorf D. 2004. Soil database of 1: 1,000,000 digital soil survey and reference system of the Chinese genetic soil classification system. Soil Horizons, 45, 129–136.
Soderland S. 1999. Learning information extraction rules for semi-structured and free text. Machine learning, 34, 233–272.
Stum A K, Boettinger J, White M, Ramsey R. 2010. Random Forests Applied as a Soil Spatial Predictive Model in Arid Utah. Springer, Heidelberg, Germany.
Sutton C, McCallum A. 2012. An introduction to conditional random fields. Foundations and Trends in Machine Learning, 4, 267–373.
Valenzuela-Escárcega M A, Hahn-Powell G, Surdeanu M, Hicks T. 2015. A domain-independent rule-based framework for event extraction. In: Proceedings of ACL-IJCNLP 2015 System Demonstrations. Association for Computational Linguistics, Stroudsburg, USA. pp. 127–132.
Wang H, Qi Z, Hao H, Xu B. 2014. A hybrid method for Chinese entity relation extraction In: Natural Language Processing and Chinese Computing. Springer, Germany. pp. 357–367.
Wu L, Liu L, Li H, Gao Y. 2017. A Chinese toponym recognition method based on conditional random field. Geomatics & Information Science of Wuhan University, 42, 150–156. (in Chinese)
Wu Y, Jiang M, Lei J, Xu H. 2015. Named entity recognition in Chinese clinical text using deep neural network. Studies in Health Technology and Informatics, 216, 624.
Yu H, Zhang H, Liu Q. 2003. Recognition of Chinese organization name based on role tagging. In: Advances in Computation of Oriental Languages: Proceedings of the 20th International Conference on Computer Processing of Oriental Languages. Tsinghua University Press, China. pp. 79–87.
Zhang C, Zhang X, Jiang W, Shen Q, Zhang S. 2009. Rule-based extraction of spatial relations in natural language text. In: 2009 International Conference on Computational Intelligence and Software Engineering. IEEE, China. pp. 1–4.
Zhao Q, Sui Z. 2008. To extract ontology attribute value automatically based on WWW. In: 2008 International Conference on Natural Language Processing and Knowledge Engineering. IEEE, China. pp. 1–7.
Zhu A X. 1999. A personal construct-based knowledge acquisition process for natural resource mapping. International Journal of Geographical Information Science, 13, 119–141.
Zitouni I. 2014. Natural Language Processing of Semitic Languages. Springer, Germany.
Zong C Q. 2013. Statistical Natural Language Processing. Tsinghua University Press, Beijing. (in Chines

[1]	LIAO Zhen-qi, DAI Yu-long, WANG Han, Quirine M. KETTERINGS, LU Jun-sheng, ZHANG Fu-cang, LI Zhi-jun, FAN Jun-liang. A double-layer model for improving the estimation of wheat canopy nitrogen content from unmanned aerial vehicle multispectral imagery[J]. >Journal of Integrative Agriculture, 2023, 22(7): 2248-2270.
[2]	XIE Si-han, WANG Zhao-jun, HE Zhi-yong, ZENG Mao-mao, QIN Fang, Benu ADHIKARI, CHEN Jie. The effects of maltodextrin/starch in soy protein isolate–wheat gluten on the thermal stability of high-moisture extrudates[J]. >Journal of Integrative Agriculture, 2023, 22(5): 1590-1602.
[3]	CHENG Wei, ZHU A-xing, QIN Cheng-zhi, QI Feng. Updating conventional soil maps by mining soil–environment relationships from individual soil polygons[J]. >Journal of Integrative Agriculture, 2019, 18(2): 265-278.
[4]	LIU Xing-li, MU Tai-hua, SUN Hong-nan, ZHANG Miao, CHEN Jing-wang. Influence of potato flour on dough rheological properties and quality of steamed bread[J]. >Journal of Integrative Agriculture, 2016, 15(11): 2666-2676.

No Suggested Reading articles found!

Viewed

Full text

Abstract

Cited

Shared

Discussed

Cite this article:

About JIA

Editorial board

For authors