Continuous Feature Adaptation for Non-Native Speech Recognition
The current speech interfaces in many military
applications may be adequate for native speakers. However,
the recognition rate drops quite a lot for non-native speakers
(people with foreign accents). This is mainly because the nonnative
speakers have large temporal and intra-phoneme
variations when they pronounce the same words. This
problem is also complicated by the presence of large
environmental noise such as tank noise, helicopter noise, etc.
In this paper, we proposed a novel continuous acoustic feature
adaptation algorithm for on-line accent and environmental
adaptation. Implemented by incremental singular value
decomposition (SVD), the algorithm captures local acoustic
variation and runs in real-time. This feature-based adaptation
method is then integrated with conventional model-based
maximum likelihood linear regression (MLLR) algorithm.
Extensive experiments have been performed on the NATO
non-native speech corpus with baseline acoustic model trained
on native American English. The proposed feature-based
adaptation algorithm improved the average recognition
accuracy by 15%, while the MLLR model based adaptation
achieved 11% improvement. The corresponding word error
rate (WER) reduction was 25.8% and 2.73%, as compared to
that without adaptation. The combined adaptation achieved
overall recognition accuracy improvement of 29.5%, and
WER reduction of 31.8%, as compared to that without
adaptation.
[1] B. R. Ramakrishnan, Recognition of Incomplete Spectrograms for
Robust Speech Recognition, Ph.D. dissertation, Dept. Electrical and
Computer Engineering, Carnegie Mellon University, 2000.
[2] Z. Wang, T. Schultz, A. Waibel, "Comparison of acoustic model
adaptation techniques on non-native speech" IEEE Int.. Conf. Acoust.
Speech Signal Process (ICASSP), 2003.
[3] S.V., Milner, B.P, "Noise-adaptive hidden Markov models based on
Wiener filters", Proc. European Conf. Speech Technology, Berlin, 1993,
Vol. II, pp.1023-1026.
[4] "Acoustical and Environmental Robustness in Automatic Speech
Recognition". A. Acero. Ph. D.Dissertation, ECE Department, CMU,
Sept. 1990.
[5] Nadas, A., Nahamoo, D. and Picheny, M.A, "Speech recognition using
noise-adaptive prototypes", IEEE Trans. Acoust. Speech Signal Process.
Vol.37, No. 10, pp-1495- 1502, 1989.
[6] Mansour, D. and Juang, B.H, "The short-time modified coherence
representation and its application for noisy speech recognition", Proc.
IEEE Int.. Conf. Acoust. Speech Signal Process., New York, April
1988.
[7] S. Chakrabartty, Y. Deng and G. Cauwenberghs, "Robust Speech
Feature Extraction by Growth Transformation in Reproducing Kernel
Hilbert Space," Proc. IEEE Int. Conf. Acoustics Speech and Signal
Processing (ICASSP'2004), Montreal Canada, May 17-21, 2004.
[8] Ghitza, O., "Auditory nerve representation as a basis for speech
processing", in Advances in Speech Signal Processing, ed. by S. Furui
and M.M.Sondhi (Marcel Dekker, New York), Chapter 15, pp.453-485.
[9] H. Hermansky, "Perceptual linear predictive (PLP) analysis of speech",
J. Acoustic Soc. Am., vol. 87, no. 4, pp. 1738-1752, Apr. 1990.
[10] Y. Deng, S. Chakrabartty, and G. Cauwenberghs, "Analog Auditory
Perception Model for Robust Speech Recognition," Proc. IEEE Int.
Joint Conf. on Neural Network (IJCNN'2004), Budapest Hungary, July
2004.
[11] F.H. Liu, R.M. Stern, X. Huang, A. Acero, "Efficient Cepstral
Normalization for Robust Speech Recognition", Proceedings of ARPA
Speech and Natural Language Workshop, 1993.
[12] S. Wegmann, D. McAllaster, J. Orloff, B. Peskin, "Speaker
normalization on conversational telephone speech", Proc. ICASSP,
1996.
[13] C. J. Leggetter, P. C. Woodland, "Speaker adaptation of HMMs using
linear regression", Technical Report CUED/F-INFENG/ TR. 181,
Cambridge University, 1994.
[14] D. Giuliani, M. Gerosa, F. Brugnara, "Speaker Normalization through
Constrained MLLR Based Transforms", International Conference on
Spoken Language Processing, ICSLP, 2004.
[15] C.H. Lee, J.L. Gauvain, "Speaker adaptation based on MAP estimation
of HMM parameters", Acoustics, Speech, and Signal Processing,
ICASSP, 1993.
[16] V. Doumpiotis, S. Tsakalidis, and W. Byrne. "Discriminative linear
transforms for feature normalization and speaker adaptation in HMM
estimation", IEEE Transactions on Speech and Audio Processing, 13(3),
May 2005.
[17] G. Saon, G. Zweig and M. Padmanabhan, "Linear feature space
projections for speaker adaptation", ICASSP 2001, Salt Lake City, Utah,
2001.
[18] Brand, M., "Incremental singular value decomposition of uncertain data
with missing values", Proceedings, European Conference on Computer
Vision, ECCV, 2002.
[19] Ed. F. Deprettere, SVD and Signal Processing: Algorithms, Analysis and
Applications, Elsevier Science Publishers, North Holland, 1988.
[20] K. Hermus, I. Dologlou, P. Wambacq and D. V. Compernolle. "Fully
Adaptive SVD-Based Noise Removal for Robust Speech Recognition",
In Proc. European Conference on Speech Communication and
Technology, volume V, pages 1951--1954, Budapest, Hungary,
September 1999.
[21] L. F. Uebel and P. C. Woodland, "Improvements in linear transforms
based speaker adaptation," in ICASSP, 2001.
[22] T. Anastasakos, J. McDonough, R. Schwartz, etc, "A compact model for
speaker-adaptive training," in ICSLP, 1996.
[23] P. C. Woodland and D. Povey, "Large scale discriminative training for
speech recognition," in Proceedings of the Tutorial and Research
Workshop on Automatic Speech Recognition. ISCA, 2000.
[24] L. Benarousse, E. Geoffrois, J. Grieco, R. Series, etc,, "The NATO
Native and Non-Native (N4) Speech Corpus", in Proceedings Workshop
on Multilingual Speech and Language Processing, Aalborg, Denmark,
2001.
[25] M. K. Ravishankar, ''Sphinx-3 s3.3 Decoder", Sphinx Speech Group,
CMU.
[26] P Beyerlein, X Aubert, R Haeb-Umbach, M Harris, "Large vocabulary
continuous speech recognition of Broadcast News-The Philips/RWTH
approach", Speech Communication, 2002.
[27] V.R. Gadde, A. Stolcke, D. Vergyri, J. Zheng, K. Sonmez, "Building an
ASR System for Noisy Environments: SRI-s 2001 SPINE Evaluation
System", Proceedings of ICSLP, 2002.
[28] S. M. Katz, "Estimation of Probabilities from Sparse Data for the
Language Model Component of a Speech Recognizer," in IEEE
Transactions on Acoustics, Speech and Signal Processing, Vol. 35(3),
pp. 400-401, March, 1987.
[29] D. Povey, P.C. Woodland, M.J.F. Gales, "Discriminative MAP for
acoustic model adaptation", Proc. ICASSP, 2003.
[30] J. Stadermann and G. Rigoll, "Two-stage speaker adaptation of hybrid
tied-posterior acoustic models," in ICASSP, 2005.
[31] P. Kenny, G. Boulianne, P. Dumouchel, "Eigenvoice Modeling with
Sparse Training Data", IEEE Transactions on Speech and Audio
Processing, 2005.
[32] T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, "Wsjcam0: A
British English Speech Corpus for Large Vocabulary Continuous Speech
Recognition ", Proc. ICASSP, 1995.
[1] B. R. Ramakrishnan, Recognition of Incomplete Spectrograms for
Robust Speech Recognition, Ph.D. dissertation, Dept. Electrical and
Computer Engineering, Carnegie Mellon University, 2000.
[2] Z. Wang, T. Schultz, A. Waibel, "Comparison of acoustic model
adaptation techniques on non-native speech" IEEE Int.. Conf. Acoust.
Speech Signal Process (ICASSP), 2003.
[3] S.V., Milner, B.P, "Noise-adaptive hidden Markov models based on
Wiener filters", Proc. European Conf. Speech Technology, Berlin, 1993,
Vol. II, pp.1023-1026.
[4] "Acoustical and Environmental Robustness in Automatic Speech
Recognition". A. Acero. Ph. D.Dissertation, ECE Department, CMU,
Sept. 1990.
[5] Nadas, A., Nahamoo, D. and Picheny, M.A, "Speech recognition using
noise-adaptive prototypes", IEEE Trans. Acoust. Speech Signal Process.
Vol.37, No. 10, pp-1495- 1502, 1989.
[6] Mansour, D. and Juang, B.H, "The short-time modified coherence
representation and its application for noisy speech recognition", Proc.
IEEE Int.. Conf. Acoust. Speech Signal Process., New York, April
1988.
[7] S. Chakrabartty, Y. Deng and G. Cauwenberghs, "Robust Speech
Feature Extraction by Growth Transformation in Reproducing Kernel
Hilbert Space," Proc. IEEE Int. Conf. Acoustics Speech and Signal
Processing (ICASSP'2004), Montreal Canada, May 17-21, 2004.
[8] Ghitza, O., "Auditory nerve representation as a basis for speech
processing", in Advances in Speech Signal Processing, ed. by S. Furui
and M.M.Sondhi (Marcel Dekker, New York), Chapter 15, pp.453-485.
[9] H. Hermansky, "Perceptual linear predictive (PLP) analysis of speech",
J. Acoustic Soc. Am., vol. 87, no. 4, pp. 1738-1752, Apr. 1990.
[10] Y. Deng, S. Chakrabartty, and G. Cauwenberghs, "Analog Auditory
Perception Model for Robust Speech Recognition," Proc. IEEE Int.
Joint Conf. on Neural Network (IJCNN'2004), Budapest Hungary, July
2004.
[11] F.H. Liu, R.M. Stern, X. Huang, A. Acero, "Efficient Cepstral
Normalization for Robust Speech Recognition", Proceedings of ARPA
Speech and Natural Language Workshop, 1993.
[12] S. Wegmann, D. McAllaster, J. Orloff, B. Peskin, "Speaker
normalization on conversational telephone speech", Proc. ICASSP,
1996.
[13] C. J. Leggetter, P. C. Woodland, "Speaker adaptation of HMMs using
linear regression", Technical Report CUED/F-INFENG/ TR. 181,
Cambridge University, 1994.
[14] D. Giuliani, M. Gerosa, F. Brugnara, "Speaker Normalization through
Constrained MLLR Based Transforms", International Conference on
Spoken Language Processing, ICSLP, 2004.
[15] C.H. Lee, J.L. Gauvain, "Speaker adaptation based on MAP estimation
of HMM parameters", Acoustics, Speech, and Signal Processing,
ICASSP, 1993.
[16] V. Doumpiotis, S. Tsakalidis, and W. Byrne. "Discriminative linear
transforms for feature normalization and speaker adaptation in HMM
estimation", IEEE Transactions on Speech and Audio Processing, 13(3),
May 2005.
[17] G. Saon, G. Zweig and M. Padmanabhan, "Linear feature space
projections for speaker adaptation", ICASSP 2001, Salt Lake City, Utah,
2001.
[18] Brand, M., "Incremental singular value decomposition of uncertain data
with missing values", Proceedings, European Conference on Computer
Vision, ECCV, 2002.
[19] Ed. F. Deprettere, SVD and Signal Processing: Algorithms, Analysis and
Applications, Elsevier Science Publishers, North Holland, 1988.
[20] K. Hermus, I. Dologlou, P. Wambacq and D. V. Compernolle. "Fully
Adaptive SVD-Based Noise Removal for Robust Speech Recognition",
In Proc. European Conference on Speech Communication and
Technology, volume V, pages 1951--1954, Budapest, Hungary,
September 1999.
[21] L. F. Uebel and P. C. Woodland, "Improvements in linear transforms
based speaker adaptation," in ICASSP, 2001.
[22] T. Anastasakos, J. McDonough, R. Schwartz, etc, "A compact model for
speaker-adaptive training," in ICSLP, 1996.
[23] P. C. Woodland and D. Povey, "Large scale discriminative training for
speech recognition," in Proceedings of the Tutorial and Research
Workshop on Automatic Speech Recognition. ISCA, 2000.
[24] L. Benarousse, E. Geoffrois, J. Grieco, R. Series, etc,, "The NATO
Native and Non-Native (N4) Speech Corpus", in Proceedings Workshop
on Multilingual Speech and Language Processing, Aalborg, Denmark,
2001.
[25] M. K. Ravishankar, ''Sphinx-3 s3.3 Decoder", Sphinx Speech Group,
CMU.
[26] P Beyerlein, X Aubert, R Haeb-Umbach, M Harris, "Large vocabulary
continuous speech recognition of Broadcast News-The Philips/RWTH
approach", Speech Communication, 2002.
[27] V.R. Gadde, A. Stolcke, D. Vergyri, J. Zheng, K. Sonmez, "Building an
ASR System for Noisy Environments: SRI-s 2001 SPINE Evaluation
System", Proceedings of ICSLP, 2002.
[28] S. M. Katz, "Estimation of Probabilities from Sparse Data for the
Language Model Component of a Speech Recognizer," in IEEE
Transactions on Acoustics, Speech and Signal Processing, Vol. 35(3),
pp. 400-401, March, 1987.
[29] D. Povey, P.C. Woodland, M.J.F. Gales, "Discriminative MAP for
acoustic model adaptation", Proc. ICASSP, 2003.
[30] J. Stadermann and G. Rigoll, "Two-stage speaker adaptation of hybrid
tied-posterior acoustic models," in ICASSP, 2005.
[31] P. Kenny, G. Boulianne, P. Dumouchel, "Eigenvoice Modeling with
Sparse Training Data", IEEE Transactions on Speech and Audio
Processing, 2005.
[32] T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, "Wsjcam0: A
British English Speech Corpus for Large Vocabulary Continuous Speech
Recognition ", Proc. ICASSP, 1995.
@article{"International Journal of Information, Control and Computer Sciences:50332", author = "Y. Deng and X. Li and C. Kwan and B. Raj and R. Stern", title = "Continuous Feature Adaptation for Non-Native Speech Recognition", abstract = "The current speech interfaces in many military
applications may be adequate for native speakers. However,
the recognition rate drops quite a lot for non-native speakers
(people with foreign accents). This is mainly because the nonnative
speakers have large temporal and intra-phoneme
variations when they pronounce the same words. This
problem is also complicated by the presence of large
environmental noise such as tank noise, helicopter noise, etc.
In this paper, we proposed a novel continuous acoustic feature
adaptation algorithm for on-line accent and environmental
adaptation. Implemented by incremental singular value
decomposition (SVD), the algorithm captures local acoustic
variation and runs in real-time. This feature-based adaptation
method is then integrated with conventional model-based
maximum likelihood linear regression (MLLR) algorithm.
Extensive experiments have been performed on the NATO
non-native speech corpus with baseline acoustic model trained
on native American English. The proposed feature-based
adaptation algorithm improved the average recognition
accuracy by 15%, while the MLLR model based adaptation
achieved 11% improvement. The corresponding word error
rate (WER) reduction was 25.8% and 2.73%, as compared to
that without adaptation. The combined adaptation achieved
overall recognition accuracy improvement of 29.5%, and
WER reduction of 31.8%, as compared to that without
adaptation.", keywords = "speaker adaptation; environment adaptation;
robust speech recognition; SVD; non-native speech recognition", volume = "1", number = "6", pages = "1537-8", }