Continuous Feature Adaptation for Non-Native Speech Recognition

The current speech interfaces in many military applications may be adequate for native speakers. However, the recognition rate drops quite a lot for non-native speakers (people with foreign accents). This is mainly because the nonnative speakers have large temporal and intra-phoneme variations when they pronounce the same words. This problem is also complicated by the presence of large environmental noise such as tank noise, helicopter noise, etc. In this paper, we proposed a novel continuous acoustic feature adaptation algorithm for on-line accent and environmental adaptation. Implemented by incremental singular value decomposition (SVD), the algorithm captures local acoustic variation and runs in real-time. This feature-based adaptation method is then integrated with conventional model-based maximum likelihood linear regression (MLLR) algorithm. Extensive experiments have been performed on the NATO non-native speech corpus with baseline acoustic model trained on native American English. The proposed feature-based adaptation algorithm improved the average recognition accuracy by 15%, while the MLLR model based adaptation achieved 11% improvement. The corresponding word error rate (WER) reduction was 25.8% and 2.73%, as compared to that without adaptation. The combined adaptation achieved overall recognition accuracy improvement of 29.5%, and WER reduction of 31.8%, as compared to that without adaptation.




References:
[1] B. R. Ramakrishnan, Recognition of Incomplete Spectrograms for
Robust Speech Recognition, Ph.D. dissertation, Dept. Electrical and
Computer Engineering, Carnegie Mellon University, 2000.
[2] Z. Wang, T. Schultz, A. Waibel, "Comparison of acoustic model
adaptation techniques on non-native speech" IEEE Int.. Conf. Acoust.
Speech Signal Process (ICASSP), 2003.
[3] S.V., Milner, B.P, "Noise-adaptive hidden Markov models based on
Wiener filters", Proc. European Conf. Speech Technology, Berlin, 1993,
Vol. II, pp.1023-1026.
[4] "Acoustical and Environmental Robustness in Automatic Speech
Recognition". A. Acero. Ph. D.Dissertation, ECE Department, CMU,
Sept. 1990.
[5] Nadas, A., Nahamoo, D. and Picheny, M.A, "Speech recognition using
noise-adaptive prototypes", IEEE Trans. Acoust. Speech Signal Process.
Vol.37, No. 10, pp-1495- 1502, 1989.
[6] Mansour, D. and Juang, B.H, "The short-time modified coherence
representation and its application for noisy speech recognition", Proc.
IEEE Int.. Conf. Acoust. Speech Signal Process., New York, April
1988.
[7] S. Chakrabartty, Y. Deng and G. Cauwenberghs, "Robust Speech
Feature Extraction by Growth Transformation in Reproducing Kernel
Hilbert Space," Proc. IEEE Int. Conf. Acoustics Speech and Signal
Processing (ICASSP'2004), Montreal Canada, May 17-21, 2004.
[8] Ghitza, O., "Auditory nerve representation as a basis for speech
processing", in Advances in Speech Signal Processing, ed. by S. Furui
and M.M.Sondhi (Marcel Dekker, New York), Chapter 15, pp.453-485.
[9] H. Hermansky, "Perceptual linear predictive (PLP) analysis of speech",
J. Acoustic Soc. Am., vol. 87, no. 4, pp. 1738-1752, Apr. 1990.
[10] Y. Deng, S. Chakrabartty, and G. Cauwenberghs, "Analog Auditory
Perception Model for Robust Speech Recognition," Proc. IEEE Int.
Joint Conf. on Neural Network (IJCNN'2004), Budapest Hungary, July
2004.
[11] F.H. Liu, R.M. Stern, X. Huang, A. Acero, "Efficient Cepstral
Normalization for Robust Speech Recognition", Proceedings of ARPA
Speech and Natural Language Workshop, 1993.
[12] S. Wegmann, D. McAllaster, J. Orloff, B. Peskin, "Speaker
normalization on conversational telephone speech", Proc. ICASSP,
1996.
[13] C. J. Leggetter, P. C. Woodland, "Speaker adaptation of HMMs using
linear regression", Technical Report CUED/F-INFENG/ TR. 181,
Cambridge University, 1994.
[14] D. Giuliani, M. Gerosa, F. Brugnara, "Speaker Normalization through
Constrained MLLR Based Transforms", International Conference on
Spoken Language Processing, ICSLP, 2004.
[15] C.H. Lee, J.L. Gauvain, "Speaker adaptation based on MAP estimation
of HMM parameters", Acoustics, Speech, and Signal Processing,
ICASSP, 1993.
[16] V. Doumpiotis, S. Tsakalidis, and W. Byrne. "Discriminative linear
transforms for feature normalization and speaker adaptation in HMM
estimation", IEEE Transactions on Speech and Audio Processing, 13(3),
May 2005.
[17] G. Saon, G. Zweig and M. Padmanabhan, "Linear feature space
projections for speaker adaptation", ICASSP 2001, Salt Lake City, Utah,
2001.
[18] Brand, M., "Incremental singular value decomposition of uncertain data
with missing values", Proceedings, European Conference on Computer
Vision, ECCV, 2002.
[19] Ed. F. Deprettere, SVD and Signal Processing: Algorithms, Analysis and
Applications, Elsevier Science Publishers, North Holland, 1988.
[20] K. Hermus, I. Dologlou, P. Wambacq and D. V. Compernolle. "Fully
Adaptive SVD-Based Noise Removal for Robust Speech Recognition",
In Proc. European Conference on Speech Communication and
Technology, volume V, pages 1951--1954, Budapest, Hungary,
September 1999.
[21] L. F. Uebel and P. C. Woodland, "Improvements in linear transforms
based speaker adaptation," in ICASSP, 2001.
[22] T. Anastasakos, J. McDonough, R. Schwartz, etc, "A compact model for
speaker-adaptive training," in ICSLP, 1996.
[23] P. C. Woodland and D. Povey, "Large scale discriminative training for
speech recognition," in Proceedings of the Tutorial and Research
Workshop on Automatic Speech Recognition. ISCA, 2000.
[24] L. Benarousse, E. Geoffrois, J. Grieco, R. Series, etc,, "The NATO
Native and Non-Native (N4) Speech Corpus", in Proceedings Workshop
on Multilingual Speech and Language Processing, Aalborg, Denmark,
2001.
[25] M. K. Ravishankar, ''Sphinx-3 s3.3 Decoder", Sphinx Speech Group,
CMU.
[26] P Beyerlein, X Aubert, R Haeb-Umbach, M Harris, "Large vocabulary
continuous speech recognition of Broadcast News-The Philips/RWTH
approach", Speech Communication, 2002.
[27] V.R. Gadde, A. Stolcke, D. Vergyri, J. Zheng, K. Sonmez, "Building an
ASR System for Noisy Environments: SRI-s 2001 SPINE Evaluation
System", Proceedings of ICSLP, 2002.
[28] S. M. Katz, "Estimation of Probabilities from Sparse Data for the
Language Model Component of a Speech Recognizer," in IEEE
Transactions on Acoustics, Speech and Signal Processing, Vol. 35(3),
pp. 400-401, March, 1987.
[29] D. Povey, P.C. Woodland, M.J.F. Gales, "Discriminative MAP for
acoustic model adaptation", Proc. ICASSP, 2003.
[30] J. Stadermann and G. Rigoll, "Two-stage speaker adaptation of hybrid
tied-posterior acoustic models," in ICASSP, 2005.
[31] P. Kenny, G. Boulianne, P. Dumouchel, "Eigenvoice Modeling with
Sparse Training Data", IEEE Transactions on Speech and Audio
Processing, 2005.
[32] T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, "Wsjcam0: A
British English Speech Corpus for Large Vocabulary Continuous Speech
Recognition ", Proc. ICASSP, 1995.