Abstract: In real-field applications, the correct determination of voice segments highly improves the overall system accuracy and minimises the total computation time. This paper presents reliable measures of speech compression by detcting the end points of the speech signals prior to compressing them. The two different compession schemes used are the Global threshold and the Level- Dependent threshold techniques. The performance of the proposed method is tested wirh the Signal to Noise Ratios, Peak Signal to Noise Ratios and Normalized Root Mean Square Error parameter measures.
Abstract: Distant-talking voice-based HCI system suffers from
performance degradation due to mismatch between the acoustic
speech (runtime) and the acoustic model (training). Mismatch is
caused by the change in the power of the speech signal as observed at
the microphones. This change is greatly influenced by the change in
distance, affecting speech dynamics inside the room before reaching
the microphones. Moreover, as the speech signal is reflected, its
acoustical characteristic is also altered by the room properties. In
general, power mismatch due to distance is a complex problem. This
paper presents a novel approach in dealing with distance-induced
mismatch by intelligently sensing instantaneous voice power variation
and compensating model parameters. First, the distant-talking speech
signal is processed through microphone array processing, and the
corresponding distance information is extracted. Distance-sensitive
Gaussian Mixture Models (GMMs), pre-trained to capture both
speech power and room property are used to predict the optimal
distance of the speech source. Consequently, pre-computed statistic
priors corresponding to the optimal distance is selected to correct
the statistics of the generic model which was frozen during training.
Thus, model combinatorics are post-conditioned to match the power
of instantaneous speech acoustics at runtime. This results to an
improved likelihood in predicting the correct speech command at
farther distances. We experiment using real data recorded inside two
rooms. Experimental evaluation shows voice recognition performance
using our method is more robust to the change in distance compared
to the conventional approach. In our experiment, under the most
acoustically challenging environment (i.e., Room 2: 2.5 meters), our
method achieved 24.2% improvement in recognition performance
against the best-performing conventional method.
Abstract: This paper presents the cepstral and trispectral
analysis of a speech signal produced by normal men, men with
defective audition (deaf, deep deaf) and others affected by
tracheotomy, the trispectral analysis based on parametric methods
(Autoregressive AR) using the fourth order cumulant. These
analyses are used to detect and compare the pitches and the formants
of corresponding voiced sounds (vowel \a\, \i\ and \u\). The first
results appear promising, since- it seems after several experimentsthere
is no deformation of the spectrum as one could have supposed
it at the beginning, however these pathologies influenced the two
characteristics:
The defective audition influences to the formants contrary to the
tracheotomy, which influences the fundamental frequency (pitch).
Abstract: This paper presents a formant-tracking linear prediction
(FTLP) model for speech processing in noise. The main focus of this
work is the detection of formant trajectory based on Hidden Markov
Models (HMM), for improved formant estimation in noise. The
approach proposed in this paper provides a systematic framework for
modelling and utilization of a time- sequence of peaks which satisfies
continuity constraints on parameter; the within peaks are modelled
by the LP parameters. The formant tracking LP model estimation
is composed of three stages: (1) a pre-cleaning multi-band spectral
subtraction stage to reduce the effect of residue noise on formants
(2) estimation stage where an initial estimate of the LP model of
speech for each frame is obtained (3) a formant classification using
probability models of formants and Viterbi-decoders. The evaluation
results for the estimation of the formant tracking LP model tested
in Gaussian white noise background, demonstrate that the proposed
combination of the initial noise reduction stage with formant tracking
and LPC variable order analysis, results in a significant reduction in
errors and distortions. The performance was evaluated with noisy
natual vowels extracted from international french and English vocabulary
speech signals at SNR value of 10dB. In each case, the
estimated formants are compared to reference formants.
Abstract: The speech signal conveys information about the
identity of the speaker. The area of speaker identification is
concerned with extracting the identity of the person speaking the
utterance. As speech interaction with computers becomes more
pervasive in activities such as the telephone, financial transactions
and information retrieval from speech databases, the utility of
automatically identifying a speaker is based solely on vocal
characteristic. This paper emphasizes on text dependent speaker
identification, which deals with detecting a particular speaker from a
known population. The system prompts the user to provide speech
utterance. System identifies the user by comparing the codebook of
speech utterance with those of the stored in the database and lists,
which contain the most likely speakers, could have given that speech
utterance. The speech signal is recorded for N speakers further the
features are extracted. Feature extraction is done by means of LPC
coefficients, calculating AMDF, and DFT. The neural network is
trained by applying these features as input parameters. The features
are stored in templates for further comparison. The features for the
speaker who has to be identified are extracted and compared with the
stored templates using Back Propogation Algorithm. Here, the
trained network corresponds to the output; the input is the extracted
features of the speaker to be identified. The network does the weight
adjustment and the best match is found to identify the speaker. The
number of epochs required to get the target decides the network
performance.
Abstract: In this paper in consideration of each available
techniques deficiencies for speech recognition, an advanced method
is presented that-s able to classify speech signals with the high
accuracy (98%) at the minimum time. In the presented method, first,
the recorded signal is preprocessed that this section includes
denoising with Mels Frequency Cepstral Analysis and feature
extraction using discrete wavelet transform (DWT) coefficients; Then
these features are fed to Multilayer Perceptron (MLP) network for
classification. Finally, after training of neural network effective
features are selected with UTA algorithm.
Abstract: Preprocessing of speech signals is considered a crucial step in the development of a robust and efficient speech or speaker recognition system. In this paper, we present some popular statistical outlier-detection based strategies to segregate the silence/unvoiced part of the speech signal from the voiced portion. The proposed methods are based on the utilization of the 3 σ edit rule, and the Hampel Identifier which are compared with the conventional techniques: (i) short-time energy (STE) based methods, and (ii) distribution based methods. The results obtained after applying the proposed strategies on some test voice signals are encouraging.
Abstract: This paper investigates the performance of a speech
recognizer in an interactive voice response system for various coded
speech signals, coded by using a vector quantization technique namely
Multi Switched Split Vector Quantization Technique. The process of
recognizing the coded output can be used in Voice banking application.
The recognition technique used for the recognition of the coded speech
signals is the Hidden Markov Model technique. The spectral distortion
performance, computational complexity, and memory requirements of
Multi Switched Split Vector Quantization Technique and the
performance of the speech recognizer at various bit rates have been
computed. From results it is found that the speech recognizer is
showing better performance at 24 bits/frame and it is found that the
percentage of recognition is being varied from 100% to 93.33% for
various bit rates.
Abstract: This work presents a novel means of extracting fixedlength parameters from voice signals, such that words can be recognized
in linear time. The power and the zero crossing rate are first
calculated segment by segment from a voice signal; by doing so, two
feature sequences are generated. We then construct an FIR system
across these two sequences. The parameters of this FIR system, used
as the input of a multilayer proceptron recognizer, can be derived by
recursive LSE (least-square estimation), implying that the complexity of overall process is linear to the signal size. In the second part of
this work, we introduce a weighting factor λ to emphasize recent
input; therefore, we can further recognize continuous speech signals.
Experiments employ the voice signals of numbers, from zero to nine, spoken in Mandarin Chinese. The proposed method is verified to
recognize voice signals efficiently and accurately.
Abstract: Independent component analysis (ICA) in the
frequency domain is used for solving the problem of blind source
separation (BSS). However, this method has some problems. For
example, a general ICA algorithm cannot determine the permutation
of signals which is important in the frequency domain ICA. In this
paper, we propose an approach to the solution for a permutation
problem. The idea is to effectively combine two conventional
approaches. This approach improves the signal separation
performance by exploiting features of the conventional approaches.
We show the simulation results using artificial data.
Abstract: The development of the signal compression
algorithms is having compressive progress. These algorithms are
continuously improved by new tools and aim to reduce, an average,
the number of bits necessary to the signal representation by means of
minimizing the reconstruction error. The following article proposes
the compression of Arabic speech signal by a hybrid method
combining the wavelet transform and the linear prediction. The
adopted approach rests, on one hand, on the original signal
decomposition by ways of analysis filters, which is followed by the
compression stage, and on the other hand, on the application of the
order 5, as well as, the compression signal coefficients. The aim of
this approach is the estimation of the predicted error, which will be
coded and transmitted. The decoding operation is then used to
reconstitute the original signal. Thus, the adequate choice of the
bench of filters is useful to the transform in necessary to increase the
compression rate and induce an impercevable distortion from an
auditive point of view.
Abstract: In this work, we are interested in developing a speech denoising tool by using a discrete wavelet packet transform (DWPT). This speech denoising tool will be employed for applications of recognition, coding and synthesis. For noise reduction, instead of applying the classical thresholding technique, some wavelet packet nodes are set to zero and the others are thresholded. To estimate the non stationary noise level, we employ the spectral entropy. A comparison of our proposed technique to classical denoising methods based on thresholding and spectral subtraction is made in order to evaluate our approach. The experimental implementation uses speech signals corrupted by two sorts of noise, white and Volvo noises. The obtained results from listening tests show that our proposed technique is better than spectral subtraction. The obtained results from SNR computation show the superiority of our technique when compared to the classical thresholding method using the modified hard thresholding function based on u-law algorithm.
Abstract: This paper presents a recognition system for isolated
words like robot commands. It’s carried out by Time Delay Neural
Networks; TDNN. To teleoperate a robot for specific tasks as turn,
close, etc… In industrial environment and taking into account the
noise coming from the machine. The choice of TDNN is based on its
generalization in terms of accuracy, in more it acts as a filter that
allows the passage of certain desirable frequency characteristics of
speech; the goal is to determine the parameters of this filter for
making an adaptable system to the variability of speech signal and to
noise especially, for this the back propagation technique was used in
learning phase. The approach was applied on commands pronounced
in two languages separately: The French and Arabic. The results for
two test bases of 300 spoken words for each one are 87%, 97.6% in
neutral environment and 77.67%, 92.67% when the white Gaussian
noisy was added with a SNR of 35 dB.
Abstract: In this paper, we propose a novel time-frequency distribution (TFD) for the analysis of multi-component signals. In particular, we use synthetic as well as real-life speech signals to prove the superiority of the proposed TFD in comparison to some existing ones. In the comparison, we consider the cross-terms suppression and the high energy concentration of the signal around its instantaneous frequency (IF).