中国农业科学 ›› 2015, Vol. 48 ›› Issue (3): 449-459.doi: 10.3864/j.issn.0578-1752.2015.03.05

• 农业信息技术 • 上一篇    下一篇

基于语音识别的农产品价格信息采集方法

许金普1,2,诸叶平1   

  1. 1中国农业科学院农业信息研究所/农业部农业信息服务技术重点试验室,北京100081
    2青岛农业大学动漫与传媒学院,山东青岛266109
  • 收稿日期:2014-03-03 出版日期:2015-01-31 发布日期:2015-01-31
  • 通讯作者: 诸叶平,Tel:010-82103120;E-mail:zhuyeping@caas.cn
  • 作者简介:许金普,Tel:13361282758;E-mail:xjp@qau.edu.cn
  • 基金资助:
    国家自然科学基金项目(61271364)

The Agricultural Price Information Acquisition Method Based on Speech Recognition

XU Jin-pu1,2, ZHU Ye-ping1   

  1. 1Agricultural Information Institute, Chinese Academy of Agricultural Sciences/Key Laboratory of Agri-Information Service Technology, Ministry of Agriculture, Beijing 100081
    2Animation and Media College, Qingdao Agricultural University, Qingdao 266109, Shandong
  • Received:2014-03-03 Online:2015-01-31 Published:2015-01-31

摘要: 【目的】将语音识别技术应用到农产品价格信息采集中,面向非特定人和限定词汇量的汉语普通话连续语音识别,提出一种适合于农产品价格采集环境的语音识别鲁棒性方法;以隐马尔科夫模型为基础,训练出适合该环境下的声学模型,以缓解因测试环境和训练环境不匹配而导致的识别率降低,进一步提高识别率。【方法】在数据采集和处理阶段,首先根据要识别的限定词汇按照一定的语法规则构建转换文法,利用该文法生成的脚本指导训练集和测试集录音;然后选择不同的农产品价格采集环境和不同的说话人进行语音采集,并进行准确的人工切分,最后构建出语音语料库;在模型训练阶段,选择自左向右无跳转结构的连续混合密度隐马尔科夫模型,对训练集中的数据提取39维的MFCC特征向量,用于模型训练。首先以单音素为建模单元,并分别训练基于隐马尔科夫模型的男声模型、女声模型及男女混合声学模型;然后,考虑到单音素稳定性差和易受协同发音现象影响等因素,以上下文相关的三音素为识别单元建模,重新训练上述模型;针对三音子建模单元带来的模型数量大量增加,进而造成的训练样本不足问题,采用决策树状态聚类方法来改善该问题;决策树的构建过程中,利用语音学知识,根据声母发音方式、发音部位的不同以及韵母构成、韵头的不同,划分出若干声韵母集合实现了二值问题集的设计;在此基础上,用增加混和高斯分量的方法来进一步使得模型描述更加精确;最后,为了解决信道的乘性噪声问题,在采用上述策略的同时利用CMN和CVN方法来缓解测试环境与训练环境不匹配问题,最终训练得到了相应的男声模型和女声模型。在测试阶段,对采用上述各方法后得到的不同模型,分别采用相同的测试集进行试验,得出不同方法下的句子识别率、词识别率以及精准度。【结果】三音子声学模型的识别性能明显优于单音素声学模型,女声模型和男声模型的性能均优于男女混合声学模型决策树聚类方法对识别率的提高不明显但可以明显减少三音子模型的数量,混合高斯分量的增加对识别率具有一定提高但同时带来计算量的增加,CMN和CVN方法可以明显提高系统的识别性能。通过对不同地点和不同说话人进行测试,最终识别率男性为95.04%,女性为97.62%。【结论】语音识别技术应用到农产品价格信息采集过程中是可行的。本文提出了一种农产品价格采集环境下提高语音识别率的方法,试验证明通过该方法训练出的模型具有较好的识别性能,本研究方法为日后应用系统的开发奠定了基础。

关键词: 语音识别, 农产品价格, 信息采集, 倒谱均值方差归一化(CMVN), 决策树聚类

Abstract: 【Objective】In this research, speech recognition technology was applied to collect agricultural price information. The aim of the research is to recognize the continuous speech which is limited in vocabulary and uttered by independent Chinese mandarin speakers, and to propose a robust speech recognition method suitable for the environment where agricultural product prices are collected. On the basis of Hidden Markov Model (HMM), we train the acoustic models for this environment, so as to relieve the decrease of recognition rate caused by the mismatching between the test environment and the training environment, and to make further improvement of the recognition rate. 【Method】 In the stage of acquiring and processing data, we first built the transformation grammar according to certain rules to recognize the limited vocabulary, and this grammar will be used to guide the recording of both train data and test data. Then we select different environments to collect agricultural product prices by different speakers. On this basis, we built a speech corpus in which speech data are artificially segmented with accuracy. In the stage of training model, we choose the continuous mixed density Hidden Markov Model with left-to-rigt and non-jump structure, and extract 39 demension MFCC feature vector from training dataset to train the HMMs. Firstly, we select monophones as the recognition unit to train male HMMs, female HMMs, and male-female mixed HMMs. Taking it into consideration that the monophones are poor in stability and vulnerable to coarticulation, we select context-dependent triphone as the decoding unit to retrain above HMMs. Since the number of triphones models will increase significantly when the triphones are chosen as modeling unit, we use the decision tree clustering to solve the insufficiency of training samples. In the process of building a decision tree, we divide all the finals and initials into different sets by using the phonetic knowledge. For the identification of initials, we appeal to their pronunciation way and place, and for the finals, we resort to their constitution and head vowels. In this way we realize the design of binary value questions. On this basis, we increase Gaussian mixture components to make the model more accurately described. Besides, in order to solve the problem of convolution noise in the communication channel, we adopt the CMN and CVN methods to alleviate the mismatching problem between test environment and training environment. Finally, the male and female HMMs are obtained respectively by training. In the stage of test, for the different models employing different methods mentioned above, we do the test experiments with the same test dataset respectively and obtain the sentence recognition rate, word recognition rate, and accuracy of every different method. 【Result】 The results show that recognition performance of triphone models are superior to monophone models. Both male and female HMMs perform better than the male and female mixed acoustic models. Though decision tree clustering method cannot promote recognition rate significantly, it can reduce the quantity of triphone models evidently. Gaussian mixture components improve the recognition rate on the one hand, but they bring a certain amount of increase in calculation on the other. CMN and CVN methods can significantly improve the performance of identification system. Through the different locations and different speaker test, the methods we have used demonstrated varying degrees improvement in the recognition performance. The ultimate recognition rate was 95.04% for males, and 97.62% for females.【Conclusion】It is feasible to apply speech recognition technology to the collection of agricultural product price information. In this paper, we proposed a method to improve the recognition rate in agricultural product price information acquisition. The experiment results show that the models trained by these methods have a good recognition performance. Furthermore, the approach adopted by our research lays a foundation for the development of the application system in the future.

Key words: speech recognition, agricultural price, information acquisition, CMVN, decision tree clustering