Speaker Identification by Joint Statistical Characterization in the Log Gabor Wavelet Domain
Real world Speaker Identification (SI) application
differs from ideal or laboratory conditions causing perturbations that
leads to a mismatch between the training and testing environment
and degrade the performance drastically. Many strategies have been
adopted to cope with acoustical degradation; wavelet based Bayesian
marginal model is one of them. But Bayesian marginal models
cannot model the inter-scale statistical dependencies of different
wavelet scales. Simple nonlinear estimators for wavelet based
denoising assume that the wavelet coefficients in different scales are
independent in nature. However wavelet coefficients have significant
inter-scale dependency. This paper enhances this inter-scale
dependency property by a Circularly Symmetric Probability Density
Function (CS-PDF) related to the family of Spherically Invariant
Random Processes (SIRPs) in Log Gabor Wavelet (LGW) domain
and corresponding joint shrinkage estimator is derived by Maximum
a Posteriori (MAP) estimator. A framework is proposed based on
these to denoise speech signal for automatic speaker identification
problems. The robustness of the proposed framework is tested for
Text Independent Speaker Identification application on 100 speakers
of POLYCOST and 100 speakers of YOHO speech database in three
different noise environments. Experimental results show that the
proposed estimator yields a higher improvement in identification
accuracy compared to other estimators on popular Gaussian Mixture
Model (GMM) based speaker model and Mel-Frequency Cepstral
Coefficient (MFCC) features.
[1] Boll, S. F., "Suppression of Acoustic Noise in Speech using Spectral
Subtraction", IEEE ASSP, 27(2):113-120, 1979.
[2] Berouti M., Schwartz R., and Makhoul J., "Enhancement of speech
corrupted by acoustic noise", IEEE ICASSP, 1979, vol. 1, pp. 208-211.
[3] Y. Ephraim and D. Malah, "Speech Enhancement using a Minimum
Mean-Square Error Short-Time Spectral Amplitude Estimator", IEEE
Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-
32, no. 6, pp. 1109-1121, Dec. 1984.
[4] Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean
square error log-spectral amplitude estimator", IEEE Trans. on Acoust.,
Speech ,Signal Processing, vol. ASSP-33, pp. 443-445, Apr. 1985.
[5] T. H. Dat, K. Takeda and F. Itakura, "Generalized Gamma Modeling of
Speech and its Online Estimation for Speech Enhancement",
Proceedings of ICASSP-2005, 2005.
[6] R. Martin and C. Breithaupt, "Speech Enhancement in the DFT Domain
using Laplacian Speech Priors", in Proc. International Workshop on
Acoustic Echo and Noise Control (IWAENC 03), pp. 87-90, Kyoto,
Japan, Sep. 2003.
[7] R. Martin, "Speech Enhancement Using MMSE Short Time Spectral
Estimation with Gamma Distributed Speech Priors", IEEE ICASSP-02,
Orlando, Florida, May 2002.
[8] H. Brehm, E.W. J¨ungst and D. Wolf, "Simulation von Sprachsignalen",
AE¨U, Vol. 28, 1974, pp. 445-450.
[9] W. B. Davenport, "An experimental study of speech wave probability
distributions", J. Acoust. Soc. Amer., Vol. 24, July 1952, pp. 390-399.
[10] Thomas Lotter and Peter Vary, "Speech Enhancement by MAP Spectral
Amplitude Estimation Using a Super-Gaussian Speech Model",
EURASIP Journal on Applied Signal Processing , vol. 2005, Issue 7, pp.
1110-1126.
[11] C. Breithaupt and R. Martin, "MMSE Estimation of Magnitude-Squared
DFT Coefficients with Super-Gaussian Priors", IEEE Proc. Intern. Conf.
on Acoustics, Speech and Signal Processing, vol. I, pp. 896-899, April
2003.
[12] Deng, J. Droppo, and A. Acero. "Estimating cepstrum of speech under
the presence of noise using a joint prior of static and dynamic features",
IEEE Transactions on Speech and Audio Processing, vol. 12, no. 3, May
2004, pp. 218-233.
[13] I. Cohen, "Speech Enhancement Using a Noncausal A Priori SNR
Estimator", IEEE Signal Processing Letters, Vol. 11, No. 9, Sep. 2004,
pp. 725-728.
[14] S. Kamath and P. Loizou, "A Multi-Band Spectral Subtraction Method
for Enhancing Speech Corrupted by Colored Noise", In Proceedings
International Conference on Acoustics, Speech and Signal Processing,
2002.
[15] E. Zavarehei, S. Vaseghi and Q. Yan, "Speech Enhancement using
Kalman Filters for Restoration of Short-Time DFT Trajectories",
Automatic Speech Recognition and Understanding (ASRU), 2005 IEEE
Workshop, Nov. 27, 2005, pp. 219 - 224.
[16] Moreno P., Raj B., Stern R., "A vector Taylor series approach for
environment-independent speech recognition", Proc. ICASSP, pp. 733-
736, 1996.
[17] Acero A., Deng L., Kristjansson T., Zhang J., "HMM adapation using
vector Taylor series for noisy speech recognition", ICSLP Bejing, pp.
869-872, 2000.
[18] Gauvain J., Lee C., "MAP estimation for multivariate Gaussian mixture
observation of Markov Chains", IEEE Trans. Speech & Audio
Processing, 2, pp. 291-298, 1994.
[19] Leggetter C., Woodland P., "Maximum Likelihood Linear Regression
for speaker adaptation of continuous density HMMs", Comp. Sp. &
Lang., pp. 171-185, 1995.
[20] D. L. Donoho, "De-noising by soft-thresholding", IEEE Transactions on
Information Theory, 41(3):613-627, 1995.
[21] D. L. Donoho and I. M. Johnstone, "Ideal spatial adaptation by wavelet
shrinkage", Biometrika, 81(3):425-455, 1994.
[22] R. R. Coifman and D. Donoho, "Time-invariant wavelet denoising", In
A. Antoniadis and G. Oppenheim, editors, Wavelets and Statistics,
volume 103 of Lecture Notes in Statistics, pages 125-150, New York,
1995. Springer-Verlag.
[23] H. Brehm, "Description of spherically invariant random processes by
means of G-functions", in: Lecture Notes in Computer Science, Vol.
969, Springer, New York, 1982, pp. 39-73.
[24] S. B. Davis and P. Mermelstein, "Comparison of parametric
representation for monosyllabic word recognition in continuously
spoken sentences", IEEE Trans. On ASSP, vol. ASSP 28, no. 4, pp. 357-
365, Aug. 1980.
[25] Molla, M. K. I., and K. Hirose, "On the effectiveness of mfccs and their
statistical distribution properties in speaker identification", in Virtual
Environments, Human-Computer Interfaces and Measurement Systems,
VCIMS2004 IEEE Symposium, July 12-14, 2004, pp. 136-141.
[26] R. Vergin, D. OShaughnessy, and A. Farhat, "Generalized mel
frequency cepstral coefficients for large-vocabulary speaker-independent
continuous-speech recognition", IEEE Trans. On Speech and Audio
Processing, vol. 7, no. 5, pp. 525-532, Sep. 1999.
[27] Douglas A. Reynolds, Richard C. Rose, "Robust Text- Independent
Speaker Identification Using Gaussian Mixture Speaker Models", IEEE
Transactions on Speech and Audio Processing, pp. 72-83, vol. 3, no. 1,
January 1995.
[28] D. Donoho and I. Johnstone, "Ideal adaptation via wavelet shrinkage",
Biometrika, vol. 81, pp. 425-455, 1994.
[29] D. Gabor, "Theory of communication", J. Inst. Electr. Eng. 93, pp.
429457, 1946.
[30] J. Morlet, G. Arens, E. Fourgeau and D. Giard, "Wave Propagation and
Sampling Theory - Part II: Sampling theory and complex waves",
Geophysics, 47(2):222-236, Feb. 1982.
[31] D. J. Field, "Relations between the statistics of natural images and the
response properties of cortical cells", Journal of the Optical Society of
America A, 4(12):2379-2394, Dec. 1987.
[32] S. Senapati and G. Saha, "Speech Enhancement by Marginal Statistical
Characterization in Log gabor Wavelet domain", International J. of
Signal Processing, vol. 4, no. 2, pp. 107-113, 2007.
[1] Boll, S. F., "Suppression of Acoustic Noise in Speech using Spectral
Subtraction", IEEE ASSP, 27(2):113-120, 1979.
[2] Berouti M., Schwartz R., and Makhoul J., "Enhancement of speech
corrupted by acoustic noise", IEEE ICASSP, 1979, vol. 1, pp. 208-211.
[3] Y. Ephraim and D. Malah, "Speech Enhancement using a Minimum
Mean-Square Error Short-Time Spectral Amplitude Estimator", IEEE
Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-
32, no. 6, pp. 1109-1121, Dec. 1984.
[4] Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean
square error log-spectral amplitude estimator", IEEE Trans. on Acoust.,
Speech ,Signal Processing, vol. ASSP-33, pp. 443-445, Apr. 1985.
[5] T. H. Dat, K. Takeda and F. Itakura, "Generalized Gamma Modeling of
Speech and its Online Estimation for Speech Enhancement",
Proceedings of ICASSP-2005, 2005.
[6] R. Martin and C. Breithaupt, "Speech Enhancement in the DFT Domain
using Laplacian Speech Priors", in Proc. International Workshop on
Acoustic Echo and Noise Control (IWAENC 03), pp. 87-90, Kyoto,
Japan, Sep. 2003.
[7] R. Martin, "Speech Enhancement Using MMSE Short Time Spectral
Estimation with Gamma Distributed Speech Priors", IEEE ICASSP-02,
Orlando, Florida, May 2002.
[8] H. Brehm, E.W. J¨ungst and D. Wolf, "Simulation von Sprachsignalen",
AE¨U, Vol. 28, 1974, pp. 445-450.
[9] W. B. Davenport, "An experimental study of speech wave probability
distributions", J. Acoust. Soc. Amer., Vol. 24, July 1952, pp. 390-399.
[10] Thomas Lotter and Peter Vary, "Speech Enhancement by MAP Spectral
Amplitude Estimation Using a Super-Gaussian Speech Model",
EURASIP Journal on Applied Signal Processing , vol. 2005, Issue 7, pp.
1110-1126.
[11] C. Breithaupt and R. Martin, "MMSE Estimation of Magnitude-Squared
DFT Coefficients with Super-Gaussian Priors", IEEE Proc. Intern. Conf.
on Acoustics, Speech and Signal Processing, vol. I, pp. 896-899, April
2003.
[12] Deng, J. Droppo, and A. Acero. "Estimating cepstrum of speech under
the presence of noise using a joint prior of static and dynamic features",
IEEE Transactions on Speech and Audio Processing, vol. 12, no. 3, May
2004, pp. 218-233.
[13] I. Cohen, "Speech Enhancement Using a Noncausal A Priori SNR
Estimator", IEEE Signal Processing Letters, Vol. 11, No. 9, Sep. 2004,
pp. 725-728.
[14] S. Kamath and P. Loizou, "A Multi-Band Spectral Subtraction Method
for Enhancing Speech Corrupted by Colored Noise", In Proceedings
International Conference on Acoustics, Speech and Signal Processing,
2002.
[15] E. Zavarehei, S. Vaseghi and Q. Yan, "Speech Enhancement using
Kalman Filters for Restoration of Short-Time DFT Trajectories",
Automatic Speech Recognition and Understanding (ASRU), 2005 IEEE
Workshop, Nov. 27, 2005, pp. 219 - 224.
[16] Moreno P., Raj B., Stern R., "A vector Taylor series approach for
environment-independent speech recognition", Proc. ICASSP, pp. 733-
736, 1996.
[17] Acero A., Deng L., Kristjansson T., Zhang J., "HMM adapation using
vector Taylor series for noisy speech recognition", ICSLP Bejing, pp.
869-872, 2000.
[18] Gauvain J., Lee C., "MAP estimation for multivariate Gaussian mixture
observation of Markov Chains", IEEE Trans. Speech & Audio
Processing, 2, pp. 291-298, 1994.
[19] Leggetter C., Woodland P., "Maximum Likelihood Linear Regression
for speaker adaptation of continuous density HMMs", Comp. Sp. &
Lang., pp. 171-185, 1995.
[20] D. L. Donoho, "De-noising by soft-thresholding", IEEE Transactions on
Information Theory, 41(3):613-627, 1995.
[21] D. L. Donoho and I. M. Johnstone, "Ideal spatial adaptation by wavelet
shrinkage", Biometrika, 81(3):425-455, 1994.
[22] R. R. Coifman and D. Donoho, "Time-invariant wavelet denoising", In
A. Antoniadis and G. Oppenheim, editors, Wavelets and Statistics,
volume 103 of Lecture Notes in Statistics, pages 125-150, New York,
1995. Springer-Verlag.
[23] H. Brehm, "Description of spherically invariant random processes by
means of G-functions", in: Lecture Notes in Computer Science, Vol.
969, Springer, New York, 1982, pp. 39-73.
[24] S. B. Davis and P. Mermelstein, "Comparison of parametric
representation for monosyllabic word recognition in continuously
spoken sentences", IEEE Trans. On ASSP, vol. ASSP 28, no. 4, pp. 357-
365, Aug. 1980.
[25] Molla, M. K. I., and K. Hirose, "On the effectiveness of mfccs and their
statistical distribution properties in speaker identification", in Virtual
Environments, Human-Computer Interfaces and Measurement Systems,
VCIMS2004 IEEE Symposium, July 12-14, 2004, pp. 136-141.
[26] R. Vergin, D. OShaughnessy, and A. Farhat, "Generalized mel
frequency cepstral coefficients for large-vocabulary speaker-independent
continuous-speech recognition", IEEE Trans. On Speech and Audio
Processing, vol. 7, no. 5, pp. 525-532, Sep. 1999.
[27] Douglas A. Reynolds, Richard C. Rose, "Robust Text- Independent
Speaker Identification Using Gaussian Mixture Speaker Models", IEEE
Transactions on Speech and Audio Processing, pp. 72-83, vol. 3, no. 1,
January 1995.
[28] D. Donoho and I. Johnstone, "Ideal adaptation via wavelet shrinkage",
Biometrika, vol. 81, pp. 425-455, 1994.
[29] D. Gabor, "Theory of communication", J. Inst. Electr. Eng. 93, pp.
429457, 1946.
[30] J. Morlet, G. Arens, E. Fourgeau and D. Giard, "Wave Propagation and
Sampling Theory - Part II: Sampling theory and complex waves",
Geophysics, 47(2):222-236, Feb. 1982.
[31] D. J. Field, "Relations between the statistics of natural images and the
response properties of cortical cells", Journal of the Optical Society of
America A, 4(12):2379-2394, Dec. 1987.
[32] S. Senapati and G. Saha, "Speech Enhancement by Marginal Statistical
Characterization in Log gabor Wavelet domain", International J. of
Signal Processing, vol. 4, no. 2, pp. 107-113, 2007.
@article{"International Journal of Electrical, Electronic and Communication Sciences:60849", author = "Suman Senapati and Goutam Saha", title = "Speaker Identification by Joint Statistical Characterization in the Log Gabor Wavelet Domain", abstract = "Real world Speaker Identification (SI) application
differs from ideal or laboratory conditions causing perturbations that
leads to a mismatch between the training and testing environment
and degrade the performance drastically. Many strategies have been
adopted to cope with acoustical degradation; wavelet based Bayesian
marginal model is one of them. But Bayesian marginal models
cannot model the inter-scale statistical dependencies of different
wavelet scales. Simple nonlinear estimators for wavelet based
denoising assume that the wavelet coefficients in different scales are
independent in nature. However wavelet coefficients have significant
inter-scale dependency. This paper enhances this inter-scale
dependency property by a Circularly Symmetric Probability Density
Function (CS-PDF) related to the family of Spherically Invariant
Random Processes (SIRPs) in Log Gabor Wavelet (LGW) domain
and corresponding joint shrinkage estimator is derived by Maximum
a Posteriori (MAP) estimator. A framework is proposed based on
these to denoise speech signal for automatic speaker identification
problems. The robustness of the proposed framework is tested for
Text Independent Speaker Identification application on 100 speakers
of POLYCOST and 100 speakers of YOHO speech database in three
different noise environments. Experimental results show that the
proposed estimator yields a higher improvement in identification
accuracy compared to other estimators on popular Gaussian Mixture
Model (GMM) based speaker model and Mel-Frequency Cepstral
Coefficient (MFCC) features.", keywords = "Speaker Identification, Log Gabor Wavelet,
Bayesian Bivariate Estimator, Circularly Symmetric Probability
Density Function, SIRP.", volume = "1", number = "9", pages = "1356-9", }