Speaker Identification by Joint Statistical Characterization in the Log Gabor Wavelet Domain

Real world Speaker Identification (SI) application differs from ideal or laboratory conditions causing perturbations that leads to a mismatch between the training and testing environment and degrade the performance drastically. Many strategies have been adopted to cope with acoustical degradation; wavelet based Bayesian marginal model is one of them. But Bayesian marginal models cannot model the inter-scale statistical dependencies of different wavelet scales. Simple nonlinear estimators for wavelet based denoising assume that the wavelet coefficients in different scales are independent in nature. However wavelet coefficients have significant inter-scale dependency. This paper enhances this inter-scale dependency property by a Circularly Symmetric Probability Density Function (CS-PDF) related to the family of Spherically Invariant Random Processes (SIRPs) in Log Gabor Wavelet (LGW) domain and corresponding joint shrinkage estimator is derived by Maximum a Posteriori (MAP) estimator. A framework is proposed based on these to denoise speech signal for automatic speaker identification problems. The robustness of the proposed framework is tested for Text Independent Speaker Identification application on 100 speakers of POLYCOST and 100 speakers of YOHO speech database in three different noise environments. Experimental results show that the proposed estimator yields a higher improvement in identification accuracy compared to other estimators on popular Gaussian Mixture Model (GMM) based speaker model and Mel-Frequency Cepstral Coefficient (MFCC) features.




References:
[1] Boll, S. F., "Suppression of Acoustic Noise in Speech using Spectral
Subtraction", IEEE ASSP, 27(2):113-120, 1979.
[2] Berouti M., Schwartz R., and Makhoul J., "Enhancement of speech
corrupted by acoustic noise", IEEE ICASSP, 1979, vol. 1, pp. 208-211.
[3] Y. Ephraim and D. Malah, "Speech Enhancement using a Minimum
Mean-Square Error Short-Time Spectral Amplitude Estimator", IEEE
Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-
32, no. 6, pp. 1109-1121, Dec. 1984.
[4] Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean
square error log-spectral amplitude estimator", IEEE Trans. on Acoust.,
Speech ,Signal Processing, vol. ASSP-33, pp. 443-445, Apr. 1985.
[5] T. H. Dat, K. Takeda and F. Itakura, "Generalized Gamma Modeling of
Speech and its Online Estimation for Speech Enhancement",
Proceedings of ICASSP-2005, 2005.
[6] R. Martin and C. Breithaupt, "Speech Enhancement in the DFT Domain
using Laplacian Speech Priors", in Proc. International Workshop on
Acoustic Echo and Noise Control (IWAENC 03), pp. 87-90, Kyoto,
Japan, Sep. 2003.
[7] R. Martin, "Speech Enhancement Using MMSE Short Time Spectral
Estimation with Gamma Distributed Speech Priors", IEEE ICASSP-02,
Orlando, Florida, May 2002.
[8] H. Brehm, E.W. J¨ungst and D. Wolf, "Simulation von Sprachsignalen",
AE¨U, Vol. 28, 1974, pp. 445-450.
[9] W. B. Davenport, "An experimental study of speech wave probability
distributions", J. Acoust. Soc. Amer., Vol. 24, July 1952, pp. 390-399.
[10] Thomas Lotter and Peter Vary, "Speech Enhancement by MAP Spectral
Amplitude Estimation Using a Super-Gaussian Speech Model",
EURASIP Journal on Applied Signal Processing , vol. 2005, Issue 7, pp.
1110-1126.
[11] C. Breithaupt and R. Martin, "MMSE Estimation of Magnitude-Squared
DFT Coefficients with Super-Gaussian Priors", IEEE Proc. Intern. Conf.
on Acoustics, Speech and Signal Processing, vol. I, pp. 896-899, April
2003.
[12] Deng, J. Droppo, and A. Acero. "Estimating cepstrum of speech under
the presence of noise using a joint prior of static and dynamic features",
IEEE Transactions on Speech and Audio Processing, vol. 12, no. 3, May
2004, pp. 218-233.
[13] I. Cohen, "Speech Enhancement Using a Noncausal A Priori SNR
Estimator", IEEE Signal Processing Letters, Vol. 11, No. 9, Sep. 2004,
pp. 725-728.
[14] S. Kamath and P. Loizou, "A Multi-Band Spectral Subtraction Method
for Enhancing Speech Corrupted by Colored Noise", In Proceedings
International Conference on Acoustics, Speech and Signal Processing,
2002.
[15] E. Zavarehei, S. Vaseghi and Q. Yan, "Speech Enhancement using
Kalman Filters for Restoration of Short-Time DFT Trajectories",
Automatic Speech Recognition and Understanding (ASRU), 2005 IEEE
Workshop, Nov. 27, 2005, pp. 219 - 224.
[16] Moreno P., Raj B., Stern R., "A vector Taylor series approach for
environment-independent speech recognition", Proc. ICASSP, pp. 733-
736, 1996.
[17] Acero A., Deng L., Kristjansson T., Zhang J., "HMM adapation using
vector Taylor series for noisy speech recognition", ICSLP Bejing, pp.
869-872, 2000.
[18] Gauvain J., Lee C., "MAP estimation for multivariate Gaussian mixture
observation of Markov Chains", IEEE Trans. Speech & Audio
Processing, 2, pp. 291-298, 1994.
[19] Leggetter C., Woodland P., "Maximum Likelihood Linear Regression
for speaker adaptation of continuous density HMMs", Comp. Sp. &
Lang., pp. 171-185, 1995.
[20] D. L. Donoho, "De-noising by soft-thresholding", IEEE Transactions on
Information Theory, 41(3):613-627, 1995.
[21] D. L. Donoho and I. M. Johnstone, "Ideal spatial adaptation by wavelet
shrinkage", Biometrika, 81(3):425-455, 1994.
[22] R. R. Coifman and D. Donoho, "Time-invariant wavelet denoising", In
A. Antoniadis and G. Oppenheim, editors, Wavelets and Statistics,
volume 103 of Lecture Notes in Statistics, pages 125-150, New York,
1995. Springer-Verlag.
[23] H. Brehm, "Description of spherically invariant random processes by
means of G-functions", in: Lecture Notes in Computer Science, Vol.
969, Springer, New York, 1982, pp. 39-73.
[24] S. B. Davis and P. Mermelstein, "Comparison of parametric
representation for monosyllabic word recognition in continuously
spoken sentences", IEEE Trans. On ASSP, vol. ASSP 28, no. 4, pp. 357-
365, Aug. 1980.
[25] Molla, M. K. I., and K. Hirose, "On the effectiveness of mfccs and their
statistical distribution properties in speaker identification", in Virtual
Environments, Human-Computer Interfaces and Measurement Systems,
VCIMS2004 IEEE Symposium, July 12-14, 2004, pp. 136-141.
[26] R. Vergin, D. OShaughnessy, and A. Farhat, "Generalized mel
frequency cepstral coefficients for large-vocabulary speaker-independent
continuous-speech recognition", IEEE Trans. On Speech and Audio
Processing, vol. 7, no. 5, pp. 525-532, Sep. 1999.
[27] Douglas A. Reynolds, Richard C. Rose, "Robust Text- Independent
Speaker Identification Using Gaussian Mixture Speaker Models", IEEE
Transactions on Speech and Audio Processing, pp. 72-83, vol. 3, no. 1,
January 1995.
[28] D. Donoho and I. Johnstone, "Ideal adaptation via wavelet shrinkage",
Biometrika, vol. 81, pp. 425-455, 1994.
[29] D. Gabor, "Theory of communication", J. Inst. Electr. Eng. 93, pp.
429457, 1946.
[30] J. Morlet, G. Arens, E. Fourgeau and D. Giard, "Wave Propagation and
Sampling Theory - Part II: Sampling theory and complex waves",
Geophysics, 47(2):222-236, Feb. 1982.
[31] D. J. Field, "Relations between the statistics of natural images and the
response properties of cortical cells", Journal of the Optical Society of
America A, 4(12):2379-2394, Dec. 1987.
[32] S. Senapati and G. Saha, "Speech Enhancement by Marginal Statistical
Characterization in Log gabor Wavelet domain", International J. of
Signal Processing, vol. 4, no. 2, pp. 107-113, 2007.