Abstract: Speaker recognition is performed in high Additive White Gaussian Noise (AWGN) environments using principals of Computational Auditory Scene Analysis (CASA). CASA methods often classify sounds from images in the time-frequency (T-F) plane using spectrograms or cochleargrams as the image. In this paper atomic decomposition implemented by matching pursuit performs a transform from time series speech signals to the T-F plane. The atomic decomposition creates a sparsely populated T-F vector in “weight space” where each populated T-F position contains an amplitude weight. The weight space vector along with the atomic dictionary represents a denoised, compressed version of the original signal. The arraignment or of the atomic indices in the T-F vector are used for classification. Unsupervised feature learning implemented by a sparse autoencoder learns a single dictionary of basis features from a collection of envelope samples from all speakers. The approach is demonstrated using pairs of speakers from the TIMIT data set. Pairs of speakers are selected randomly from a single district. Each speak has 10 sentences. Two are used for training and 8 for testing. Atomic index probabilities are created for each training sentence and also for each test sentence. Classification is performed by finding the lowest Euclidean distance between then probabilities from the training sentences and the test sentences. Training is done at a 30dB Signal-to-Noise Ratio (SNR). Testing is performed at SNR’s of 0 dB, 5 dB, 10 dB and 30dB. The algorithm has a baseline classification accuracy of ~93% averaged over 10 pairs of speakers from the TIMIT data set. The baseline accuracy is attributable to short sequences of training and test data as well as the overall simplicity of the classification algorithm. The accuracy is not affected by AWGN and produces ~93% accuracy at 0dB SNR.
Abstract: A simple adaptive voice activity detector (VAD) is
implemented using Gabor and gammatone atomic decomposition of
speech for high Gaussian noise environments. Matching pursuit is
used for atomic decomposition, and is shown to achieve optimal
speech detection capability at high data compression rates for low
signal to noise ratios. The most active dictionary elements found by
matching pursuit are used for the signal reconstruction so that the
algorithm adapts to the individual speakers dominant time-frequency
characteristics. Speech has a high peak to average ratio enabling
matching pursuit greedy heuristic of highest inner products to isolate
high energy speech components in high noise environments. Gabor
and gammatone atoms are both investigated with identical
logarithmically spaced center frequencies, and similar bandwidths.
The algorithm performs equally well for both Gabor and gammatone
atoms with no significant statistical differences. The algorithm
achieves 70% accuracy at a 0 dB SNR, 90% accuracy at a 5 dB SNR
and 98% accuracy at a 20dB SNR using 30d B SNR as a reference
for voice activity.