Text-independent Speaker Identification Based on MAP Channel Compensation and Pitch-dependent Features

One major source of performance decline in speaker recognition system is channel mismatch between training and testing. This paper focuses on improving channel robustness of speaker recognition system in two aspects of channel compensation technique and channel robust features. The system is text-independent speaker identification system based on two-stage recognition. In the aspect of channel compensation technique, this paper applies MAP (Maximum A Posterior Probability) channel compensation technique, which was used in speech recognition, to speaker recognition system. In the aspect of channel robust features, this paper introduces pitch-dependent features and pitch-dependent speaker model for the second stage recognition. Based on the first stage recognition to testing speech using GMM (Gaussian Mixture Model), the system uses GMM scores to decide if it needs to be recognized again. If it needs to, the system selects a few speakers from all of the speakers who participate in the first stage recognition for the second stage recognition. For each selected speaker, the system obtains 3 pitch-dependent results from his pitch-dependent speaker model, and then uses ANN (Artificial Neural Network) to unite the 3 pitch-dependent results and 1 GMM score for getting a fused result. The system makes the second stage recognition based on these fused results. The experiments show that the correct rate of two-stage recognition system based on MAP channel compensation technique and pitch-dependent features is 41.7% better than the baseline system for closed-set test.




References:
[1] D. A. Reynolds, "Channel Robust Speaker Verification via Feature
Mapping," in Proc. of ICASSP-03, Hong Kong, 2003,pp.53-56.
[2] B. S. Atal, "Effectiveness of Linear Prediction Characteristics of the
Speech Wave for Automatic Speaker Identification and Verification,"
Journal of the Acoustical Society of America. Vol. 55, no.6,
pp.1304-1312, 1974.
[3] H. Hermansky, N. Morgan, "RASTA Processing of Speech," IEEE
Speech And Audio Processing, Vol.2, no.4, pp.578-589, 1994.
[4] S. Furui, "Cepstral Analysis Technique for Automatic Speaker
Verification," IEEE, ASSP, Vol.29, no.2, pp.254-72, 1981.
[5] J. Chien, H. Wang, L. Lee, "Estimation of Channel Bias for Telephone
Speech Recognition," in Proc. of ICSLP, 1996, pp.1840-1843.
[6] Teunen R, Shahshahani B, Heck L, "A Model-based Transformational
Approach to Robust Speaker Recognition," in Proc. of ICSLP, 2000,
pp.495-498.
[7] D. A. Reynolds, "The Effect of Handset Variability on Speaker
Recognition Performance: Experiments on the Switchboard Corpus," in
Proc. of ICASSP, 1996, pp.113-116.
[8] R. Auckenthaler, M. Carey, H. Lloyd-Thomas, "Score Normalization for
Text-independent Speaker Verification System," Digital Signal
Processing, vol.10, no.1, 2000.
[9] D. A. Reynolds, W. Andrews, J. Campbell, J. Navratil, B. Peskin, A.
Adami, Q. Jin, D. Klusacek, J. Abramson, R. Mihaescu, J. Godfrey, D.
Jones, B. Xiang, "The SuperSID Project: Exploiting High-level
Information for High-accuracy Speaker Recognition," in Proc. of
ICASSP-03, Hong Kong, 2003, pp. 784-787.
[10] K. Sönmez, E. Shriberg, L. Heck, M. Weintraub, "Modeling Dynamic
Prosodic Variation for Speaker Verification," in Proc. of ICSLP, 1998,
pp.3189-3192.
[11] M. J. Carey, E. S. Parris, H. Lloyd-Thomas, S. Bennett, "Robust Prosodic
Features for Speaker Identification," in Proc. of ICSLP, 1996,
pp.1800-1803.
[12] M. K. Sönmez, L. Heck, M. Weintraub, E. Shriberg, "A Lognormal Tied
Mixture Model of Pitch for Prosodybased Speaker Recognition," in Proc.
of Eurospeech, 1997, pp.1391-1394.