Text-independent Speaker Identification Based on MAP Channel Compensation and Pitch-dependent Features
One major source of performance decline in speaker
recognition system is channel mismatch between training and testing.
This paper focuses on improving channel robustness of speaker
recognition system in two aspects of channel compensation technique
and channel robust features. The system is text-independent speaker
identification system based on two-stage recognition. In the aspect of
channel compensation technique, this paper applies MAP (Maximum
A Posterior Probability) channel compensation technique, which was
used in speech recognition, to speaker recognition system. In the
aspect of channel robust features, this paper introduces
pitch-dependent features and pitch-dependent speaker model for the
second stage recognition. Based on the first stage recognition to
testing speech using GMM (Gaussian Mixture Model), the system
uses GMM scores to decide if it needs to be recognized again. If it
needs to, the system selects a few speakers from all of the speakers
who participate in the first stage recognition for the second stage
recognition. For each selected speaker, the system obtains 3
pitch-dependent results from his pitch-dependent speaker model, and
then uses ANN (Artificial Neural Network) to unite the 3
pitch-dependent results and 1 GMM score for getting a fused result.
The system makes the second stage recognition based on these fused
results. The experiments show that the correct rate of two-stage
recognition system based on MAP channel compensation technique
and pitch-dependent features is 41.7% better than the baseline system
for closed-set test.
[1] D. A. Reynolds, "Channel Robust Speaker Verification via Feature
Mapping," in Proc. of ICASSP-03, Hong Kong, 2003,pp.53-56.
[2] B. S. Atal, "Effectiveness of Linear Prediction Characteristics of the
Speech Wave for Automatic Speaker Identification and Verification,"
Journal of the Acoustical Society of America. Vol. 55, no.6,
pp.1304-1312, 1974.
[3] H. Hermansky, N. Morgan, "RASTA Processing of Speech," IEEE
Speech And Audio Processing, Vol.2, no.4, pp.578-589, 1994.
[4] S. Furui, "Cepstral Analysis Technique for Automatic Speaker
Verification," IEEE, ASSP, Vol.29, no.2, pp.254-72, 1981.
[5] J. Chien, H. Wang, L. Lee, "Estimation of Channel Bias for Telephone
Speech Recognition," in Proc. of ICSLP, 1996, pp.1840-1843.
[6] Teunen R, Shahshahani B, Heck L, "A Model-based Transformational
Approach to Robust Speaker Recognition," in Proc. of ICSLP, 2000,
pp.495-498.
[7] D. A. Reynolds, "The Effect of Handset Variability on Speaker
Recognition Performance: Experiments on the Switchboard Corpus," in
Proc. of ICASSP, 1996, pp.113-116.
[8] R. Auckenthaler, M. Carey, H. Lloyd-Thomas, "Score Normalization for
Text-independent Speaker Verification System," Digital Signal
Processing, vol.10, no.1, 2000.
[9] D. A. Reynolds, W. Andrews, J. Campbell, J. Navratil, B. Peskin, A.
Adami, Q. Jin, D. Klusacek, J. Abramson, R. Mihaescu, J. Godfrey, D.
Jones, B. Xiang, "The SuperSID Project: Exploiting High-level
Information for High-accuracy Speaker Recognition," in Proc. of
ICASSP-03, Hong Kong, 2003, pp. 784-787.
[10] K. Sönmez, E. Shriberg, L. Heck, M. Weintraub, "Modeling Dynamic
Prosodic Variation for Speaker Verification," in Proc. of ICSLP, 1998,
pp.3189-3192.
[11] M. J. Carey, E. S. Parris, H. Lloyd-Thomas, S. Bennett, "Robust Prosodic
Features for Speaker Identification," in Proc. of ICSLP, 1996,
pp.1800-1803.
[12] M. K. Sönmez, L. Heck, M. Weintraub, E. Shriberg, "A Lognormal Tied
Mixture Model of Pitch for Prosodybased Speaker Recognition," in Proc.
of Eurospeech, 1997, pp.1391-1394.
[1] D. A. Reynolds, "Channel Robust Speaker Verification via Feature
Mapping," in Proc. of ICASSP-03, Hong Kong, 2003,pp.53-56.
[2] B. S. Atal, "Effectiveness of Linear Prediction Characteristics of the
Speech Wave for Automatic Speaker Identification and Verification,"
Journal of the Acoustical Society of America. Vol. 55, no.6,
pp.1304-1312, 1974.
[3] H. Hermansky, N. Morgan, "RASTA Processing of Speech," IEEE
Speech And Audio Processing, Vol.2, no.4, pp.578-589, 1994.
[4] S. Furui, "Cepstral Analysis Technique for Automatic Speaker
Verification," IEEE, ASSP, Vol.29, no.2, pp.254-72, 1981.
[5] J. Chien, H. Wang, L. Lee, "Estimation of Channel Bias for Telephone
Speech Recognition," in Proc. of ICSLP, 1996, pp.1840-1843.
[6] Teunen R, Shahshahani B, Heck L, "A Model-based Transformational
Approach to Robust Speaker Recognition," in Proc. of ICSLP, 2000,
pp.495-498.
[7] D. A. Reynolds, "The Effect of Handset Variability on Speaker
Recognition Performance: Experiments on the Switchboard Corpus," in
Proc. of ICASSP, 1996, pp.113-116.
[8] R. Auckenthaler, M. Carey, H. Lloyd-Thomas, "Score Normalization for
Text-independent Speaker Verification System," Digital Signal
Processing, vol.10, no.1, 2000.
[9] D. A. Reynolds, W. Andrews, J. Campbell, J. Navratil, B. Peskin, A.
Adami, Q. Jin, D. Klusacek, J. Abramson, R. Mihaescu, J. Godfrey, D.
Jones, B. Xiang, "The SuperSID Project: Exploiting High-level
Information for High-accuracy Speaker Recognition," in Proc. of
ICASSP-03, Hong Kong, 2003, pp. 784-787.
[10] K. Sönmez, E. Shriberg, L. Heck, M. Weintraub, "Modeling Dynamic
Prosodic Variation for Speaker Verification," in Proc. of ICSLP, 1998,
pp.3189-3192.
[11] M. J. Carey, E. S. Parris, H. Lloyd-Thomas, S. Bennett, "Robust Prosodic
Features for Speaker Identification," in Proc. of ICSLP, 1996,
pp.1800-1803.
[12] M. K. Sönmez, L. Heck, M. Weintraub, E. Shriberg, "A Lognormal Tied
Mixture Model of Pitch for Prosodybased Speaker Recognition," in Proc.
of Eurospeech, 1997, pp.1391-1394.
@article{"International Journal of Electrical, Electronic and Communication Sciences:51094", author = "Jiqing Han and Rongchun Gao", title = "Text-independent Speaker Identification Based on MAP Channel Compensation and Pitch-dependent Features", abstract = "One major source of performance decline in speaker
recognition system is channel mismatch between training and testing.
This paper focuses on improving channel robustness of speaker
recognition system in two aspects of channel compensation technique
and channel robust features. The system is text-independent speaker
identification system based on two-stage recognition. In the aspect of
channel compensation technique, this paper applies MAP (Maximum
A Posterior Probability) channel compensation technique, which was
used in speech recognition, to speaker recognition system. In the
aspect of channel robust features, this paper introduces
pitch-dependent features and pitch-dependent speaker model for the
second stage recognition. Based on the first stage recognition to
testing speech using GMM (Gaussian Mixture Model), the system
uses GMM scores to decide if it needs to be recognized again. If it
needs to, the system selects a few speakers from all of the speakers
who participate in the first stage recognition for the second stage
recognition. For each selected speaker, the system obtains 3
pitch-dependent results from his pitch-dependent speaker model, and
then uses ANN (Artificial Neural Network) to unite the 3
pitch-dependent results and 1 GMM score for getting a fused result.
The system makes the second stage recognition based on these fused
results. The experiments show that the correct rate of two-stage
recognition system based on MAP channel compensation technique
and pitch-dependent features is 41.7% better than the baseline system
for closed-set test.", keywords = "Channel Compensation, Channel Robustness, MAP,Speaker Identification", volume = "4", number = "3", pages = "412-7", }