A Simple Adaptive Atomic Decomposition Voice Activity Detector Implemented by Matching Pursuit

A simple adaptive voice activity detector (VAD) is implemented using Gabor and gammatone atomic decomposition of speech for high Gaussian noise environments. Matching pursuit is used for atomic decomposition, and is shown to achieve optimal speech detection capability at high data compression rates for low signal to noise ratios. The most active dictionary elements found by matching pursuit are used for the signal reconstruction so that the algorithm adapts to the individual speakers dominant time-frequency characteristics. Speech has a high peak to average ratio enabling matching pursuit greedy heuristic of highest inner products to isolate high energy speech components in high noise environments. Gabor and gammatone atoms are both investigated with identical logarithmically spaced center frequencies, and similar bandwidths. The algorithm performs equally well for both Gabor and gammatone atoms with no significant statistical differences. The algorithm achieves 70% accuracy at a 0 dB SNR, 90% accuracy at a 5 dB SNR and 98% accuracy at a 20dB SNR using 30d B SNR as a reference for voice activity.

Blind Source Separation based on the Estimation for the Number of the Blind Sources under a Dynamic Acoustic Environment

Independent component analysis can estimate unknown source signals from their mixtures under the assumption that the source signals are statistically independent. However, in a real environment, the separation performance is often deteriorated because the number of the source signals is different from that of the sensors. In this paper, we propose an estimation method for the number of the sources based on the joint distribution of the observed signals under two-sensor configuration. From several simulation results, it is found that the number of the sources is coincident to that of peaks in the histogram of the distribution. The proposed method can estimate the number of the sources even if it is larger than that of the observed signals. The proposed methods have been verified by several experiments.

Automotive 3-Microphone Noise Canceller in a Frequently Moving Noise Source Environment

A combined three-microphone voice activity detector (VAD) and noise-canceling system is studied to enhance speech recognition in an automobile environment. A previous experiment clearly shows the ability of the composite system to cancel a single noise source outside of a defined zone. This paper investigates the performance of the composite system when there are frequently moving noise sources (noise sources are coming from different locations but are not always presented at the same time) e.g. there is other passenger speech or speech from a radio when a desired speech is presented. To work in a frequently moving noise sources environment, whilst a three-microphone voice activity detector (VAD) detects voice from a “VAD valid zone", the 3-microphone noise canceller uses a “noise canceller valid zone" defined in freespace around the users head. Therefore, a desired voice should be in the intersection of the noise canceller valid zone and VAD valid zone. Thus all noise is suppressed outside this intersection of area. Experiments are shown for a real environment e.g. all results were recorded in a car by omni-directional electret condenser microphones.

VoIP Source Model based on the Hyperexponential Distribution

In this paper we present a statistical analysis of Voice over IP (VoIP) packet streams produced by the G.711 voice coder with voice activity detection (VAD). During telephone conversation, depending whether the interlocutor speaks (ON) or remains silent (OFF), packets are produced or not by a voice coder. As index of dispersion for both ON and OFF times distribution was greater than one, we used hyperexponential distribution for approximation of streams duration. For each stage of the hyperexponential distribution, we tested goodness of our fits using graphical methods, we calculated estimation errors, and performed Kolmogorov-Smirnov test. Obtained results showed that the precise VoIP source model can be based on the five-state Markov process.

Efficient DTW-Based Speech Recognition System for Isolated Words of Arabic Language

Despite the fact that Arabic language is currently one of the most common languages worldwide, there has been only a little research on Arabic speech recognition relative to other languages such as English and Japanese. Generally, digital speech processing and voice recognition algorithms are of special importance for designing efficient, accurate, as well as fast automatic speech recognition systems. However, the speech recognition process carried out in this paper is divided into three stages as follows: firstly, the signal is preprocessed to reduce noise effects. After that, the signal is digitized and hearingized. Consequently, the voice activity regions are segmented using voice activity detection (VAD) algorithm. Secondly, features are extracted from the speech signal using Mel-frequency cepstral coefficients (MFCC) algorithm. Moreover, delta and acceleration (delta-delta) coefficients have been added for the reason of improving the recognition accuracy. Finally, each test word-s features are compared to the training database using dynamic time warping (DTW) algorithm. Utilizing the best set up made for all affected parameters to the aforementioned techniques, the proposed system achieved a recognition rate of about 98.5% which outperformed other HMM and ANN-based approaches available in the literature.