Abstract: The aim of the study is to determine the acoustic features of voice and speech of children with autism spectrum disorders (ASD) as a possible additional diagnostic criterion. The participants in the study were 95 children with ASD aged 5-16 years, 150 typically development (TD) children, and 103 adults – listening to children’s speech samples. Three types of experimental methods for speech analysis were performed: spectrographic, perceptual by listeners, and automatic recognition. In the speech of children with ASD, the pitch values, pitch range, values of frequency and intensity of the third formant (emotional) leading to the “atypical” spectrogram of vowels are higher than corresponding parameters in the speech of TD children. High values of vowel articulation index (VAI) are specific for ASD children’s speech signals. These acoustic features can be considered as diagnostic marker of autism. The ability of humans and automatic recognition of the psychoneurological state of children via their speech is determined.
Abstract: This paper proposes strategies in level crossing (LC) sampling and reconstruction that provide alias-free high-fidelity signal reconstruction for speech signals without exponentially increasing sample number with increasing bit-depth. We introduce methods in LC sampling that reduce the sampling rate close to the Nyquist frequency even for large bit-depth. The results indicate that larger variation in the sampling intervals leads to alias-free sampling scheme; this is achieved by either reducing the bit-depth or adding a jitter to the system for high bit-depths. In conjunction with windowing, the signal is reconstructed from the LC samples using an efficient Toeplitz reconstruction algorithm.
Abstract: This paper proposes strategies in level crossing (LC) sampling and reconstruction that provide high fidelity signal reconstruction for speech signals; these strategies circumvent the problem of exponentially increasing number of samples as the bit-depth is increased and hence are highly efficient. Specifically, the results indicate that the distribution of the intervals between samples is one of the key factors in the quality of signal reconstruction; including samples with short intervals does not improve the accuracy of the signal reconstruction, whilst samples with large intervals lead to numerical instability. The proposed sampling method, termed reduced conventional level crossing (RCLC) sampling, exploits redundancy between samples to improve the efficiency of the sampling without compromising performance. A reconstruction technique is also proposed that enhances the numerical stability through linear interpolation of samples separated by large intervals. Interpolation is demonstrated to improve the accuracy of the signal reconstruction in addition to the numerical stability. We further demonstrate that the RCLC and interpolation methods can give useful levels of signal recovery even if the average sampling rate is less than the Nyquist rate.
Abstract: Voice biometric data associated with physiological, psychological and other factors are widely used in forensic phonoscopy. There are various methods for identifying and verifying a person by voice. This article explores the minimum speech signal data as individual parameters of a speech signal. Monozygotic twins are believed to be genetically identical. Using the minimum data of the speech signal, we came to the conclusion that the voice imprint of monozygotic twins is individual. According to the conclusion of the experiment, we can conclude that the minimum indicators of the speech signal are more stable and reliable for phonoscopic examinations.
Abstract: Intelligibility is an essential characteristic of a speech
signal, which is used to help in the understanding of information in
speech signal. Background noise in the environment can deteriorate
the intelligibility of a recorded speech. In this paper, we presented a
simple variance subtracted - variable level discrete wavelet transform,
which improve the intelligibility of speech. The proposed algorithm
does not require an explicit estimation of noise, i.e., prior knowledge
of the noise; hence, it is easy to implement, and it reduces the
computational burden. The proposed algorithm decides a separate
decomposition level for each frame based on signal dominant and
dominant noise criteria. The performance of the proposed algorithm
is evaluated with speech intelligibility measure (STOI), and results
obtained are compared with Universal Discrete Wavelet Transform
(DWT) thresholding and Minimum Mean Square Error (MMSE)
methods. The experimental results revealed that the proposed scheme
outperformed competing methods
Abstract: The separation of speech signals has become a research
hotspot in the field of signal processing in recent years. It has
many applications and influences in teleconferencing, hearing aids,
speech recognition of machines and so on. The sounds received are
usually noisy. The issue of identifying the sounds of interest and
obtaining clear sounds in such an environment becomes a problem
worth exploring, that is, the problem of blind source separation.
This paper focuses on the under-determined blind source separation
(UBSS). Sparse component analysis is generally used for the problem
of under-determined blind source separation. The method is mainly
divided into two parts. Firstly, the clustering algorithm is used to
estimate the mixing matrix according to the observed signals. Then
the signal is separated based on the known mixing matrix. In this
paper, the problem of mixing matrix estimation is studied. This paper
proposes an improved algorithm to estimate the mixing matrix for
speech signals in the UBSS model. The traditional potential algorithm
is not accurate for the mixing matrix estimation, especially for low
signal-to noise ratio (SNR).In response to this problem, this paper
considers the idea of an improved potential function method to
estimate the mixing matrix. The algorithm not only avoids the inuence
of insufficient prior information in traditional clustering algorithm,
but also improves the estimation accuracy of mixing matrix. This
paper takes the mixing of four speech signals into two channels as
an example. The results of simulations show that the approach in this
paper not only improves the accuracy of estimation, but also applies
to any mixing matrix.
Abstract: In this paper, we present the comparative subjective analysis of Improved Minima Controlled Recursive Averaging (IMCRA) Algorithm, the Kalman filter and the cascading of IMCRA and Kalman filter algorithms. Performance of speech enhancement algorithms can be predicted in two different ways. One is the objective method of evaluation in which the speech quality parameters are predicted computationally. The second is a subjective listening test in which the processed speech signal is subjected to the listeners who judge the quality of speech on certain parameters. The comparative objective evaluation of these algorithms was analyzed in terms of Global SNR, Segmental SNR and Perceptual Evaluation of Speech Quality (PESQ) by the authors and it was reported that with cascaded algorithms there is a substantial increase in objective parameters. Since subjective evaluation is the real test to judge the quality of speech enhancement algorithms, the authenticity of superiority of cascaded algorithms over individual IMCRA and Kalman algorithms is tested through subjective analysis in this paper. The results of subjective listening tests have confirmed that the cascaded algorithms perform better under all types of noise conditions.
Abstract: This paper investigates MIMO (Multiple-Input
Multiple-Output) adaptive filtering techniques for the application
of supervised source separation in the context of convolutive
mixtures. From the observation that there is correlation among the
signals of the different mixtures, an improvement in the NSAF
(Normalized Subband Adaptive Filter) algorithm is proposed in
order to accelerate its convergence rate. Simulation results with
mixtures of speech signals in reverberant environments show the
superior performance of the proposed algorithm with respect to the
performances of the NLMS (Normalized Least-Mean-Square) and
conventional NSAF, considering both the convergence speed and
SIR (Signal-to-Interference Ratio) after convergence.
Abstract: Speech recognition is of an important contribution in promoting new technologies in human computer interaction. Today, there is a growing need to employ speech technology in daily life and business activities. However, speech recognition is a challenging task that requires different stages before obtaining the desired output. Among automatic speech recognition (ASR) components is the feature extraction process, which parameterizes the speech signal to produce the corresponding feature vectors. Feature extraction process aims at approximating the linguistic content that is conveyed by the input speech signal. In speech processing field, there are several methods to extract speech features, however, Mel Frequency Cepstral Coefficients (MFCC) is the popular technique. It has been long observed that the MFCC is dominantly used in the well-known recognizers such as the Carnegie Mellon University (CMU) Sphinx and the Markov Model Toolkit (HTK). Hence, this paper focuses on the MFCC method as the standard choice to identify the different speech segments in order to obtain the language phonemes for further training and decoding steps. Due to MFCC good performance, the previous studies show that the MFCC dominates the Arabic ASR research. In this paper, we demonstrate MFCC as well as the intermediate steps that are performed to get these coefficients using the HTK toolkit.
Abstract: We propose a system to real environmental noise and
channel mismatch for forensic speaker verification systems. This
method is based on suppressing various types of real environmental
noise by using independent component analysis (ICA) algorithm.
The enhanced speech signal is applied to mel frequency cepstral
coefficients (MFCC) or MFCC feature warping to extract the
essential characteristics of the speech signal. Channel effects are
reduced using an intermediate vector (i-vector) and probabilistic
linear discriminant analysis (PLDA) approach for classification. The
proposed algorithm is evaluated by using an Australian forensic voice
comparison database, combined with car, street and home noises
from QUT-NOISE at a signal to noise ratio (SNR) ranging from -10
dB to 10 dB. Experimental results indicate that the MFCC feature
warping-ICA achieves a reduction in equal error rate about (48.22%,
44.66%, and 50.07%) over using MFCC feature warping when the
test speech signals are corrupted with random sessions of street, car,
and home noises at -10 dB SNR.
Abstract: This paper investigates the problem of blind speech separation from the speech mixture of two speakers. A voice activity detector employing the Steered Response Power - Phase Transform (SRP-PHAT) is presented for detecting the activity information of speech sources and then the desired speech signals are extracted from the speech mixture by using an optimal beamformer. For evaluation, the algorithm effectiveness, a simulation using real speech recordings had been performed in a double-talk situation where two speakers are active all the time. Evaluations show that the proposed blind speech separation algorithm offers a good interference suppression level whilst maintaining a low distortion level of the desired signal.
Abstract: Speaker recognition is performed in high Additive White Gaussian Noise (AWGN) environments using principals of Computational Auditory Scene Analysis (CASA). CASA methods often classify sounds from images in the time-frequency (T-F) plane using spectrograms or cochleargrams as the image. In this paper atomic decomposition implemented by matching pursuit performs a transform from time series speech signals to the T-F plane. The atomic decomposition creates a sparsely populated T-F vector in “weight space” where each populated T-F position contains an amplitude weight. The weight space vector along with the atomic dictionary represents a denoised, compressed version of the original signal. The arraignment or of the atomic indices in the T-F vector are used for classification. Unsupervised feature learning implemented by a sparse autoencoder learns a single dictionary of basis features from a collection of envelope samples from all speakers. The approach is demonstrated using pairs of speakers from the TIMIT data set. Pairs of speakers are selected randomly from a single district. Each speak has 10 sentences. Two are used for training and 8 for testing. Atomic index probabilities are created for each training sentence and also for each test sentence. Classification is performed by finding the lowest Euclidean distance between then probabilities from the training sentences and the test sentences. Training is done at a 30dB Signal-to-Noise Ratio (SNR). Testing is performed at SNR’s of 0 dB, 5 dB, 10 dB and 30dB. The algorithm has a baseline classification accuracy of ~93% averaged over 10 pairs of speakers from the TIMIT data set. The baseline accuracy is attributable to short sequences of training and test data as well as the overall simplicity of the classification algorithm. The accuracy is not affected by AWGN and produces ~93% accuracy at 0dB SNR.
Abstract: Recently, Automatic Speech Recognition (ASR) systems were used to assist children in language acquisition as it has the ability to detect human speech signal. Despite the benefits offered by the ASR system, there is a lack of ASR systems for Malay-speaking children. One of the contributing factors for this is the lack of continuous speech database for the target users. Though cross-lingual adaptation is a common solution for developing ASR systems for under-resourced language, it is not viable for children as there are very limited speech databases as a source model. In this research, we propose a two-stage adaptation for the development of ASR system for Malay-speaking children using a very limited database. The two stage adaptation comprises the cross-lingual adaptation (first stage) and cross-age adaptation. For the first stage, a well-known speech database that is phonetically rich and balanced, is adapted to the medium-sized Malay adults using supervised MLLR. The second stage adaptation uses the speech acoustic model generated from the first adaptation, and the target database is a small-sized database of the target users. We have measured the performance of the proposed technique using word error rate, and then compare them with the conventional benchmark adaptation. The two stage adaptation proposed in this research has better recognition accuracy as compared to the benchmark adaptation in recognizing children’s speech.
Abstract: This paper describes a method for AWGN (Additive White Gaussian Noise) variance estimation in noisy stochastic signals, referred to as Multiplicative-Noising Variance Estimation (MNVE). The aim was to develop an estimation algorithm with minimal number of assumptions on the original signal structure. The provided MATLAB simulation and results analysis of the method applied on speech signals showed more accuracy than standardized AR (autoregressive) modeling noise estimation technique. In addition, great performance was observed on very low signal-to-noise ratios, which in general represents the worst case scenario for signal denoising methods. High execution time appears to be the only disadvantage of MNVE. After close examination of all the observed features of the proposed algorithm, it was concluded it is worth of exploring and that with some further adjustments and improvements can be enviably powerful.
Abstract: Speech Segmentation is the measure of the change
point detection for partitioning an input speech signal into regions
each of which accords to only one speaker. In this paper, we apply
two features based on multi-scale product (MP) of the clean speech,
namely the spectral centroid of MP, and the zero crossings rate of
MP. We focus on multi-scale product analysis as an important tool
for segmentation extraction. The MP is based on making the product
of the speech wavelet transform coefficients (WTC). We have
estimated our method on the Keele database. The results show the
effectiveness of our method. It indicates that the two features can find
word boundaries, and extracted the segments of the clean speech.
Abstract: Speech enhancement is a long standing problem with
numerous applications like teleconferencing, VoIP, hearing aids and
speech recognition. The motivation behind this research work is to
obtain a clean speech signal of higher quality by applying the optimal
noise cancellation technique. Real-time adaptive filtering algorithms
seem to be the best candidate among all categories of the speech
enhancement methods. In this paper, we propose a speech
enhancement method based on Recursive Least Squares (RLS)
adaptive filter of speech signals. Experiments were performed on
noisy data which was prepared by adding AWGN, Babble and Pink
noise to clean speech samples at -5dB, 0dB, 5dB and 10dB SNR
levels. We then compare the noise cancellation performance of
proposed RLS algorithm with existing NLMS algorithm in terms of
Mean Squared Error (MSE), Signal to Noise ratio (SNR) and SNR
Loss. Based on the performance evaluation, the proposed RLS
algorithm was found to be a better optimal noise cancellation
technique for speech signals.
Abstract: In this paper, Fuzzy C-Means clustering with
Expectation Maximization-Gaussian Mixture Model based hybrid
modeling algorithm is proposed for Continuous Tamil Speech
Recognition. The speech sentences from various speakers are used
for training and testing phase and objective measures are between the
proposed and existing Continuous Speech Recognition algorithms.
From the simulated results, it is observed that the proposed algorithm
improves the recognition accuracy and F-measure up to 3% as
compared to that of the existing algorithms for the speech signal from
various speakers. In addition, it reduces the Word Error Rate, Error
Rate and Error up to 4% as compared to that of the existing
algorithms. In all aspects, the proposed hybrid modeling for Tamil
speech recognition provides the significant improvements for speechto-
text conversion in various applications.
Abstract: In this paper, Least Mean Square (LMS) adaptive
noise reduction algorithm is proposed to enhance the speech signal
from the noisy speech. In this, the speech signal is enhanced by
varying the step size as the function of the input signal. Objective and
subjective measures are made under various noises for the proposed
and existing algorithms. From the experimental results, it is seen that
the proposed LMS adaptive noise reduction algorithm reduces Mean
square Error (MSE) and Log Spectral Distance (LSD) as compared to
that of the earlier methods under various noise conditions with
different input SNR levels. In addition, the proposed algorithm
increases the Peak Signal to Noise Ratio (PSNR) and Segmental SNR
improvement (ΔSNRseg) values; improves the Mean Opinion Score
(MOS) as compared to that of the various existing LMS adaptive
noise reduction algorithms. From these experimental results, it is
observed that the proposed LMS adaptive noise reduction algorithm
reduces the speech distortion and residual noise as compared to that
of the existing methods.
Abstract: In this paper a novel method for the detection of
clipping in speech signals is described. It is shown that the new
method has better performance than known clipping detection
methods, is easy to implement, and is robust to changes in signal
amplitude, size of data, etc. Statistical simulation results are
presented.
Abstract: Speech enhancement is the process of eliminating
noise and increasing the quality of a speech signal, which is
contaminated with other kinds of distortions. This paper is on
developing an optimum cascaded system for speech enhancement.
This aim is attained without diminishing any relevant speech
information and without much computational and time complexity.
LMS algorithm, Spectral Subtraction and Kalman filter have been
deployed as the main de-noising algorithms in this work. Since these
algorithms suffer from respective shortcomings, this work has been
undertaken to design cascaded systems in different combinations and
the evaluation of such cascades by qualitative (listening) and
quantitative (SNR) tests.