Using Teager Energy Cepstrum and HMM distancesin Automatic Speech Recognition and Analysis of Unvoiced Speech

In this study, the use of silicon NAM (Non-Audible Murmur) microphone in automatic speech recognition is presented. NAM microphones are special acoustic sensors, which are attached behind the talker-s ear and can capture not only normal (audible) speech, but also very quietly uttered speech (non-audible murmur). As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech conversion etc.) for sound-impaired people. Using a small amount of training data and adaptation approaches, 93.9% word accuracy was achieved for a 20k Japanese vocabulary dictation task. Non-audible murmur recognition in noisy environments is also investigated. In this study, further analysis of the NAM speech has been made using distance measures between hidden Markov model (HMM) pairs. It has been shown the reduced spectral space of NAM speech using a metric distance, however the location of the different phonemes of NAM are similar to the location of the phonemes of normal speech, and the NAM sounds are well discriminated. Promising results in using nonlinear features are also introduced, especially under noisy conditions.




References:
[1] Y. Nakajima, H. Kashioka, K. Shikano, N. Campbell, "Non-Audible
Murmur Recognition Input Interface Using Stethoscopic Microphone
Attached to the Skin", Proceedings of ICASSP, pp. 708-711, 2003.
[2] Y. Zheng, Z. Liu, Z. Zhang, M. Sinclair, J. Droppo, L. Deng, A. Acero,
Z. Huang, "Air- and Bone-Conductive Integrated Microphones for Robust
Speech Detection and Enhancement", Proceedings of ASRU, pp. 249-253,
2003.
[3] Z. Liu, A. Subramaya, Z. Zhang, J. Droppo, A. Acero, "Leakage Model
and Teeth Clack Removal for Air- and Bone-conductive Integrated
Microphones " Proceedings of ICASSP, pp. 1093-1096, 2005.
[4] M. Graciarena, H. Franco, K. Sonmez, H. Bratt, "Combining Standard
and Throat Microphones for Robust Speech Recognition", IEEE Signal
Processing Letters, Vol. 10, No 3, pp.72-74, 2003.
[5] O. M. Strand, T. Holter, A. Egeberg, S. Stensby, "On the Feasility of ASR
in Extreme Noise Using the Parat Earplug Communication Terminal"
Proceeding of ASRU, pp. 315-320, 2003.
[6] S. C. Jou, T. Schultz, Alex Weibel, "Adaptation for Soft Whisper
Recognition Using a Throat Microphone", Proceedings of ICSLP, 2004.
[7] P. Heracleous, T. Kaino, H. Saruwatari, and K. Shikano, "Applications of
NAM Microphones in Speech Recognition for Privacy in Human-machine
Communication," Proceedings of Interspeech2005-EUROSPEECH, pp.
3041-3044, 2005.
[8] Junqua J-C, "The Lombard Reflex and its Role on Human Listeners and
Automatic Speech Recognizers," J. Acoust. Soc. Am., Vol. 1 pp. 510-524,
1993.
[9] A. Wakao, K. Takeda, F. Itakura, "Variability of Lombard Effects Under
Different Noise Conditions", Proceedings of ICSLP, pp. 2009-2012,
1996.
[10] J.H.L. Hansen, "Morphological Constrained Feature Enhancement with
Adaptive Cepstral Compensation (MCE-ACC) for Speech Recognition in
Noise and Lombard Effect", IEEE Trans. Speech Audio Proc. vol. 2, pp.
598-614, 1994.
[11] B.A. Hanson, T. Applebaum, "Robust Speaker-independent Word
Recognition Using Instantaneous Dynamic and Acceleration Features:
Experiments with Lombard and Noisy Speech", Proceedings of ICASSP,
pp. 857-860, 1990.
[12] R. Ruiz, B. Harmegnies, C. Legros, D. Poch, "Time- and Spectrum
Related Variabilities in Stressed Speech Under Laboratory and Real
Conditions", Speech Communication vol. 20, pp. 111-129, 1996.
[13] P. Heracleous, T. Kaino, H. Saruwatari, and K. Shikano,"Investigating
the Role of the Lombard Reflex in Non-Audible Murmur (NAM) Recognition,"
Proceedings of Interspeech2005-EUROSPEECH, pp. 2649-2652,
2005.
[14] G. Zhou, J.H.L. Hansen, and J.F. Kaiser, "Classification of Speech under
Stress Based on Features Derived from the Nonlinear Teager Energy
Operator," IEEE ICASSP-98, vol. 1, pp. 549-552, 1998.
[15] M. Nakamura, K. Iwano, and S. Furui, "Analysis of Spectral Reduction
in Spontaneous Speech and its Effects on Speech Recognition Performances,",
Proceedings of Interspeech2005-EUROSPEECH, pp. 3381-
3384, 2005.
[16] T. Kawahara et al., "Free Software Toolkit for Japanese Large Vocabulary
Continuous Speech Recognition", Proceedings of ICSLP, pp. IV-
476-479, 2000.
[17] K. Itou et al., "JNAS: Japanese Speech Corpus for Large Vocabulary
Continuous Speech Recognition Research", The Journal of Acoustical
Society of Japan (E), Vol. 20, pp. 199-206, 1999.
[18] C. J. Leggetter, C. Woodland, "Maximum Likelihood Linear Regression
for Speaker Adaptation of Continuous Density Hidden Markov Models",
Computer Speech and Language, Vol. 9, pp. 171-185, 1995.
[19] C.H. Lee, C.H. Lin, and B.H. Juang, "A study on speaker adaptation
of the parameters of continuous density hidden Markov models", IEEE
transactions Signal Processing, Vol. 39, pp. 806-814, 1991.
[20] P.C. Woodland, D. Pye, M.J.F. Gales, "Iterative Unsupervised Adaptation
Using Maximum Likelihood Linear Regression", Proceedings of
ICSLP, pp. 1133-1136, 1996.
[21] B.-H. Juang, and L. Rabiner, "A Probabilistic Distance Measure for
Hidden Markov Models", AT&T Technical Journal, pp. 391-408, 1985.
[22] D. Dimitriadis, P. Maragos, and A. Potamianos, "Auditory Teager Energy
Cepstrum Coefficients for Robust Speech Recognition," Proceeding
of Interspeech2005-EUROSPEECH, pp. 3013-3016, 2005.
[23] R. D.Patterson, and J. Holdsworth, "A Functional Model of Neural
Activity Patterns and Auditory Images," Advances in speech, Hearing
and Language Processing, vol.3, JAI Press, London, 1991.