Automatic Distance Compensation for Robust Voice-based Human-Computer Interaction

Distant-talking voice-based HCI system suffers from
performance degradation due to mismatch between the acoustic
speech (runtime) and the acoustic model (training). Mismatch is
caused by the change in the power of the speech signal as observed at
the microphones. This change is greatly influenced by the change in
distance, affecting speech dynamics inside the room before reaching
the microphones. Moreover, as the speech signal is reflected, its
acoustical characteristic is also altered by the room properties. In
general, power mismatch due to distance is a complex problem. This
paper presents a novel approach in dealing with distance-induced
mismatch by intelligently sensing instantaneous voice power variation
and compensating model parameters. First, the distant-talking speech
signal is processed through microphone array processing, and the
corresponding distance information is extracted. Distance-sensitive
Gaussian Mixture Models (GMMs), pre-trained to capture both
speech power and room property are used to predict the optimal
distance of the speech source. Consequently, pre-computed statistic
priors corresponding to the optimal distance is selected to correct
the statistics of the generic model which was frozen during training.
Thus, model combinatorics are post-conditioned to match the power
of instantaneous speech acoustics at runtime. This results to an
improved likelihood in predicting the correct speech command at
farther distances. We experiment using real data recorded inside two
rooms. Experimental evaluation shows voice recognition performance
using our method is more robust to the change in distance compared
to the conventional approach. In our experiment, under the most
acoustically challenging environment (i.e., Room 2: 2.5 meters), our
method achieved 24.2% improvement in recognition performance
against the best-performing conventional method.





References:
<p>[1] &rdquo;http://www.gartner.com&rdquo; Information technology research and advisory
company
[2] R. Gomez, T. Kawahara, K. Nakamura and K. Nakadai &ldquo;Multi-party
Human-Robot Interaction with Distant-Talking Speech Recognition&rdquo; In
Proceedings IEEE Human Robot Interaction, 2012
[3] M. Seltzer, &ldquo;Speech-Recognizer-Based Optimization for Microphone
Array Processing&rdquo; IEEE Signal Processing Letters, Vol. 10, No. 3, 2003
[4] M. Seltzer and R. Stern, &ldquo;Subband Likelihood-Maximizing Beamforming
for Speech Recognition in Reverberant Environments&rdquo; IEEE Trans. on
Audio, Speech, and Lang. Proc., Vol. 14, No. 6, 2006
[5] The HTK documentation http://htk.eng.cam.ac.uk/docs/docs.shtml
[6] Kaifu Lee &ldquo;Automatic Speech Recogntion &ndash; The Development of
SPHINX System&rdquo; Kluwer Academic Publishers, Boston, 1989
[7] R. Gomez, J. Even, H. Saruwatari, and K. Shikano, &ldquo;Rapid Unsupervised
Speaker Adaptation Robust in Reverberant Environment Conditions&rdquo; In
Proceedings Interspeech, 2008
[8] L. Lee and R. Rose, &ldquo;Speaker Normalization using Efficient Frequency
Warping Procedures&rdquo; In Proceedings IEEE Int. Conf. Acoust., Speech,
Signal Proc. ICASSP, pp 353-356, 1996
[9] D.Pye and P.C.Woodland &ldquo;Experiments in Speaker Normalisation and
Adaptation for Large Vocabulary Speech Recognition&rdquo; In Proceedings
IEEE Int. Conf. Acoust., Speech, Signal Proc. ICASSP, pp 1047-1050, 1997
[10] A. Baba, S. Yoshizawa, A. Lee, H. Saruwatari, and K. Shikano, &ldquo;Elderly
Acoustic Model fro Large Vocabulary Continuous Speech Recogntion&rdquo; In
Proceedings EUROSPEECH, 2001
[11] C. Huang, T. Chen, S. Li and JL. Zhou &ldquo;Analysis of Speaker Variability&rdquo;
In Proceedings EUROSPEECH, 2001
[12] D. Pye and P.C. Woodland &ldquo;Experiments in Speaker Normalisation and
Adaptation for Large Vocabulary Adaptation&rdquo; In Proceedings IEEE Int.
Conf. Acoust., Speech, Signal Proc. ICASSP, 1997
[13] Guiliani and Gerosa &ldquo;Invetsigating Recognition of Children&rsquo;s Speech
&rdquo; In Proceedings IEEE Int. Conf. Acoust., Speech, Signal Proc. ICASSP,
2003
[14] R. Gomez and T. Kawahara &rdquo;Denoising Using Optimized Wavelet
Filtering for Automatic Speech Recognition&rdquo; In Proceedings Interspeech,
2011
[15] K. Kinoshita , T. Nakatani and M. Miyoshi, &ldquo;Efficient Blind
Dereverberation Framework for Automatic Speech Recognition&rdquo; In
Proceedings Interspeech, 2005
[16] K. Kinoshita , T. Nakatani and M. Miyoshi, &ldquo;Spectral Subtraction
Steered By Multi-step Forward Linear Prediction For Single Channel
Speech Dereverberation&rdquo; In Proceedings IEEE Int. Conf. Acoust., Speech,
Signal Proc. ICASSP, 2006
[17] R. Gomez, J. Even, H. Saruwatari, and K. Shikano , &ldquo;Distant-talking
Robust Speech Recognition Using Late Reflection Components of Room
Impulse Response&rdquo; In Proceedings IEEE Int. Conf. Acoust., Speech,
Signal Proc. ICASSP, 2008
[18] R. Gomez, J. Even, H. Saruwatari, and K. Shikano, &ldquo;Fast
Dereverberation for Hands-Free Speech Recognition&rdquo; IEEE Workshop
HSCMA, 2008
[19] H. Kuttruff, &ldquo;Room Acoustics&rdquo; Spon Press, 2000
[20] P. Naylor and N. Gaubitch, &ldquo;Speech Dereverberation&rdquo; In Proceedings
IWAENC, 2005
[21] Y. Huang, J. Benesty, and J. Chen, &ldquo;Speech acquisition and enhancement
in a reverberant, cocktail-party-like environment&rdquo; In Proceedings IEEE
Int. Conf. Acoust., Speech, Signal Proc. ICASSP, 2008
[22] G. Gannot and M. Moonen, &ldquo;Subspace Methods for Multimicrophone
Speech Dereverberation&rdquo; In Proceedings Eurasip Journal on Applied
Signal Processing, vol. E80-A pp 804-808, 1997
[23] T. Hikichi, M. Delcroix, and M. Miyoshi, &ldquo;Inverse Filtering for Speech
Dereverberation Less Sensitive to Noise and Room Transfer Function
Fluctuations&rdquo; In Proceedings Eurasip Journal on Advances in Signal
Processing, vol. 2007
[24] H. Attias, J. Platt, A. Acero, and L. Deng, &ldquo;Speech Denoising and
Dereverberation Using Probabilistic Models&rdquo; In Proceedings MIT Press
In Advances in Neural Information Processing Systems 13, 2001
[25] T. Nakatani, B-H. Juang, T. Yoshioka, K. Kinoshita, M. Delcroix, and
M. Miyoshi, &ldquo;Speech Dereverberation Based on Maximum-Likelihood
Estimation with Time-Varying Gaussian Source Model&rdquo; In Proceedings
IEEE Trans. on Audio, Speech, and Lang. Proc., Vol. 16, No. 8, 2008
[26] R. Gomez, T. Kawahara, K. Nakamura and K. Nakadai, &rdquo;Robust handsfree
Automatic Speech Recognition for human-machine interaction&rdquo; In
Proceedings IEEE Humanoids, 2010
[27] H. Sawada et al.,&ldquo;Polar coordinate based nonlinear function for
frequency-domain blind source separation,&rdquo; in Proc. of ICASSP 2002,
2002
[28] H. Nakajima, K. Nakadai, Y. Hasegawa and H. Tsujino, &ldquo;Adaptive
Step-size Parameter Control for real World Blind Source Separation&rdquo; In
Proceedings IEEE Int. Conf. Acoust., Speech, Signal Proc. ICASSP, 2008
[29] Akinobu Lee &rdquo;JULIUS: A Free Continuous Speech Recognition
Software&rdquo; www.sourceforge.jp Kyoto University, Japan
[30] L.R.Rabiner and B. Gold, &rdquo;Theory and Application of Digital Signal
Processing&rdquo; Prentice Hall, Englewood Cliffs 1975
[31] L.R.Rabiner and R.W. Scahefer, &rdquo;Digital Processing of Speech Signals&rdquo;
Prentice Hall, Englewood Cliffs 1978
[32] L.R.Rabiner and B.H. Juang , &rdquo;Fundamentals of Speech Recognition&rdquo;
Prentice Hall, Englewood Cliffs 1993
[33] C.H. Lee, L.R. Rabiner, R. Pieraccini and J.G. Wilpon &rdquo;Acoustic
Modelling for Large Vocabulary Speech Recognition&rdquo; In Proceedings
Computer Speech and Language, 1990
[34] T. Cincarek, H. Kawanami, H. Saruwatari, and K.
Shikano,&rdquo;Development and portability of ASR and Q and A modules for
real-environment speech-oriented guidance systems&rdquo; In Proceedings IEEE
Automatic Speech Recognition and Understanding ASRU, 2007
[35] S. Takeuchi, T. Cincarek, H. Kawanami, H. Saruwatari, and K.
Shikano,&rdquo;Question and answer database optimization using speech
recognition results&rdquo; In Proceedings Interspeech, 2008
[36] Y. Suzuki, F. Asano, H.-Y. Kim, and T. Sone, &rdquo;An optimum computergenerated
pulse signal suitable for the measurement of very long impulse
responses&rdquo; Journal Acoustical Society of America, 1995
[37] H.-G. Hirsch and H. Finster, &ldquo;A new approach for the adaptation of
HMMs to reverberation and background noise&rdquo; In Proceeding Speech
Communication, pp 244-263, 2008
[38] R. Gomez, K. Nakamura and K. Nakadai, &rdquo;Hands-free Human-Robot
Communication Robust to Speaker&rsquo;s Radial Position&rdquo; In Proceeding IEEE
International Conference on Robots and Automation ICRA, 2013</p>