Abstract: The aim of the study is to determine the acoustic features of voice and speech of children with autism spectrum disorders (ASD) as a possible additional diagnostic criterion. The participants in the study were 95 children with ASD aged 5-16 years, 150 typically development (TD) children, and 103 adults – listening to children’s speech samples. Three types of experimental methods for speech analysis were performed: spectrographic, perceptual by listeners, and automatic recognition. In the speech of children with ASD, the pitch values, pitch range, values of frequency and intensity of the third formant (emotional) leading to the “atypical” spectrogram of vowels are higher than corresponding parameters in the speech of TD children. High values of vowel articulation index (VAI) are specific for ASD children’s speech signals. These acoustic features can be considered as diagnostic marker of autism. The ability of humans and automatic recognition of the psychoneurological state of children via their speech is determined.
Abstract: This paper proposes strategies in level crossing (LC) sampling and reconstruction that provide alias-free high-fidelity signal reconstruction for speech signals without exponentially increasing sample number with increasing bit-depth. We introduce methods in LC sampling that reduce the sampling rate close to the Nyquist frequency even for large bit-depth. The results indicate that larger variation in the sampling intervals leads to alias-free sampling scheme; this is achieved by either reducing the bit-depth or adding a jitter to the system for high bit-depths. In conjunction with windowing, the signal is reconstructed from the LC samples using an efficient Toeplitz reconstruction algorithm.
Abstract: This paper proposes strategies in level crossing (LC) sampling and reconstruction that provide high fidelity signal reconstruction for speech signals; these strategies circumvent the problem of exponentially increasing number of samples as the bit-depth is increased and hence are highly efficient. Specifically, the results indicate that the distribution of the intervals between samples is one of the key factors in the quality of signal reconstruction; including samples with short intervals does not improve the accuracy of the signal reconstruction, whilst samples with large intervals lead to numerical instability. The proposed sampling method, termed reduced conventional level crossing (RCLC) sampling, exploits redundancy between samples to improve the efficiency of the sampling without compromising performance. A reconstruction technique is also proposed that enhances the numerical stability through linear interpolation of samples separated by large intervals. Interpolation is demonstrated to improve the accuracy of the signal reconstruction in addition to the numerical stability. We further demonstrate that the RCLC and interpolation methods can give useful levels of signal recovery even if the average sampling rate is less than the Nyquist rate.
Abstract: The separation of speech signals has become a research
hotspot in the field of signal processing in recent years. It has
many applications and influences in teleconferencing, hearing aids,
speech recognition of machines and so on. The sounds received are
usually noisy. The issue of identifying the sounds of interest and
obtaining clear sounds in such an environment becomes a problem
worth exploring, that is, the problem of blind source separation.
This paper focuses on the under-determined blind source separation
(UBSS). Sparse component analysis is generally used for the problem
of under-determined blind source separation. The method is mainly
divided into two parts. Firstly, the clustering algorithm is used to
estimate the mixing matrix according to the observed signals. Then
the signal is separated based on the known mixing matrix. In this
paper, the problem of mixing matrix estimation is studied. This paper
proposes an improved algorithm to estimate the mixing matrix for
speech signals in the UBSS model. The traditional potential algorithm
is not accurate for the mixing matrix estimation, especially for low
signal-to noise ratio (SNR).In response to this problem, this paper
considers the idea of an improved potential function method to
estimate the mixing matrix. The algorithm not only avoids the inuence
of insufficient prior information in traditional clustering algorithm,
but also improves the estimation accuracy of mixing matrix. This
paper takes the mixing of four speech signals into two channels as
an example. The results of simulations show that the approach in this
paper not only improves the accuracy of estimation, but also applies
to any mixing matrix.
Abstract: This paper investigates MIMO (Multiple-Input
Multiple-Output) adaptive filtering techniques for the application
of supervised source separation in the context of convolutive
mixtures. From the observation that there is correlation among the
signals of the different mixtures, an improvement in the NSAF
(Normalized Subband Adaptive Filter) algorithm is proposed in
order to accelerate its convergence rate. Simulation results with
mixtures of speech signals in reverberant environments show the
superior performance of the proposed algorithm with respect to the
performances of the NLMS (Normalized Least-Mean-Square) and
conventional NSAF, considering both the convergence speed and
SIR (Signal-to-Interference Ratio) after convergence.
Abstract: We propose a system to real environmental noise and
channel mismatch for forensic speaker verification systems. This
method is based on suppressing various types of real environmental
noise by using independent component analysis (ICA) algorithm.
The enhanced speech signal is applied to mel frequency cepstral
coefficients (MFCC) or MFCC feature warping to extract the
essential characteristics of the speech signal. Channel effects are
reduced using an intermediate vector (i-vector) and probabilistic
linear discriminant analysis (PLDA) approach for classification. The
proposed algorithm is evaluated by using an Australian forensic voice
comparison database, combined with car, street and home noises
from QUT-NOISE at a signal to noise ratio (SNR) ranging from -10
dB to 10 dB. Experimental results indicate that the MFCC feature
warping-ICA achieves a reduction in equal error rate about (48.22%,
44.66%, and 50.07%) over using MFCC feature warping when the
test speech signals are corrupted with random sessions of street, car,
and home noises at -10 dB SNR.
Abstract: This paper investigates the problem of blind speech separation from the speech mixture of two speakers. A voice activity detector employing the Steered Response Power - Phase Transform (SRP-PHAT) is presented for detecting the activity information of speech sources and then the desired speech signals are extracted from the speech mixture by using an optimal beamformer. For evaluation, the algorithm effectiveness, a simulation using real speech recordings had been performed in a double-talk situation where two speakers are active all the time. Evaluations show that the proposed blind speech separation algorithm offers a good interference suppression level whilst maintaining a low distortion level of the desired signal.
Abstract: Speaker recognition is performed in high Additive White Gaussian Noise (AWGN) environments using principals of Computational Auditory Scene Analysis (CASA). CASA methods often classify sounds from images in the time-frequency (T-F) plane using spectrograms or cochleargrams as the image. In this paper atomic decomposition implemented by matching pursuit performs a transform from time series speech signals to the T-F plane. The atomic decomposition creates a sparsely populated T-F vector in “weight space” where each populated T-F position contains an amplitude weight. The weight space vector along with the atomic dictionary represents a denoised, compressed version of the original signal. The arraignment or of the atomic indices in the T-F vector are used for classification. Unsupervised feature learning implemented by a sparse autoencoder learns a single dictionary of basis features from a collection of envelope samples from all speakers. The approach is demonstrated using pairs of speakers from the TIMIT data set. Pairs of speakers are selected randomly from a single district. Each speak has 10 sentences. Two are used for training and 8 for testing. Atomic index probabilities are created for each training sentence and also for each test sentence. Classification is performed by finding the lowest Euclidean distance between then probabilities from the training sentences and the test sentences. Training is done at a 30dB Signal-to-Noise Ratio (SNR). Testing is performed at SNR’s of 0 dB, 5 dB, 10 dB and 30dB. The algorithm has a baseline classification accuracy of ~93% averaged over 10 pairs of speakers from the TIMIT data set. The baseline accuracy is attributable to short sequences of training and test data as well as the overall simplicity of the classification algorithm. The accuracy is not affected by AWGN and produces ~93% accuracy at 0dB SNR.
Abstract: This paper describes a method for AWGN (Additive White Gaussian Noise) variance estimation in noisy stochastic signals, referred to as Multiplicative-Noising Variance Estimation (MNVE). The aim was to develop an estimation algorithm with minimal number of assumptions on the original signal structure. The provided MATLAB simulation and results analysis of the method applied on speech signals showed more accuracy than standardized AR (autoregressive) modeling noise estimation technique. In addition, great performance was observed on very low signal-to-noise ratios, which in general represents the worst case scenario for signal denoising methods. High execution time appears to be the only disadvantage of MNVE. After close examination of all the observed features of the proposed algorithm, it was concluded it is worth of exploring and that with some further adjustments and improvements can be enviably powerful.
Abstract: Speech enhancement is a long standing problem with
numerous applications like teleconferencing, VoIP, hearing aids and
speech recognition. The motivation behind this research work is to
obtain a clean speech signal of higher quality by applying the optimal
noise cancellation technique. Real-time adaptive filtering algorithms
seem to be the best candidate among all categories of the speech
enhancement methods. In this paper, we propose a speech
enhancement method based on Recursive Least Squares (RLS)
adaptive filter of speech signals. Experiments were performed on
noisy data which was prepared by adding AWGN, Babble and Pink
noise to clean speech samples at -5dB, 0dB, 5dB and 10dB SNR
levels. We then compare the noise cancellation performance of
proposed RLS algorithm with existing NLMS algorithm in terms of
Mean Squared Error (MSE), Signal to Noise ratio (SNR) and SNR
Loss. Based on the performance evaluation, the proposed RLS
algorithm was found to be a better optimal noise cancellation
technique for speech signals.
Abstract: In this paper a novel method for the detection of
clipping in speech signals is described. It is shown that the new
method has better performance than known clipping detection
methods, is easy to implement, and is robust to changes in signal
amplitude, size of data, etc. Statistical simulation results are
presented.
Abstract: In this paper, we use Radial Basis Function Networks
(RBFN) for solving the problem of environmental interference
cancellation of speech signal. We show that the Second Order Thin-
Plate Spline (SOTPS) kernel cancels the interferences effectively.
For make comparison, we test our experiments on two conventional
most used RBFN kernels: the Gaussian and First order TPS (FOTPS)
basis functions. The speech signals used here were taken from the
OGI Multi-Language Telephone Speech Corpus database and were
corrupted with six type of environmental noise from NOISEX-92
database. Experimental results show that the SOTPS kernel can
considerably outperform the Gaussian and FOTPS functions on
speech interference cancellation problem.
Abstract: One of the essential components of much of DSP
application is noise cancellation. Changes in real time signals are
quite rapid and swift. In noise cancellation, a reference signal which
is an approximation of noise signal (that corrupts the original
information signal) is obtained and then subtracted from the noise
bearing signal to obtain a noise free signal. This approximation of
noise signal is obtained through adaptive filters which are self
adjusting. As the changes in real time signals are abrupt, this needs
adaptive algorithm that converges fast and is stable. Least mean
square (LMS) and normalized LMS (NLMS) are two widely used
algorithms because of their plainness in calculations and
implementation. But their convergence rates are small. Adaptive
averaging filters (AFA) are also used because they have high
convergence, but they are less stable. This paper provides the
comparative study of LMS and Normalized NLMS, AFA and new
enhanced average adaptive (Average NLMS-ANLMS) filters for noise
cancelling application using speech signals.
Abstract: Different pseudo-random or pseudo-noise (PN) as well as orthogonal sequences that can be used as spreading codes for code division multiple access (CDMA) cellular networks or can be used for encrypting speech signals to reduce the residual intelligence are investigated. We briefly review the theoretical background for direct sequence CDMA systems and describe the main characteristics of the maximal length, Gold, Barker, and Kasami sequences. We also discuss about variable- and fixed-length orthogonal codes like Walsh- Hadamard codes. The equivalence of PN and orthogonal codes are also derived. Finally, a new PN sequence is proposed which is shown to have certain better properties than the existing codes.
Abstract: The acoustic and articulatory properties of fricative speech sounds are being studied using magnetic resonance imaging (MRI) and acoustic recordings from a single subject. Area functions were derived from a complete set of axial and coronal MR slices using two different methods: the Mermelstein technique and the Blum transform. Area functions derived from the two techniques were shown to differ significantly in some cases. Such differences will lead to different acoustic predictions and it is important to know which is the more accurate. The vocal tract acoustic transfer function (VTTF) was derived from these area functions for each fricative and compared with measured speech signals for the same fricative and same subject. The VTTFs for /f/ in two vowel contexts and the corresponding acoustic spectra are derived here; the Blum transform appears to show a better match between prediction and measurement than the Mermelstein technique.
Abstract: This paper studies the effect of different compression
constraints and schemes presented in a new and flexible paradigm to
achieve high compression ratios and acceptable signal to noise ratios
of Arabic speech signals. Compression parameters are computed for
variable frame sizes of a level 5 to 7 Discrete Wavelet Transform
(DWT) representation of the signals for different analyzing mother
wavelet functions. Results are obtained and compared for Global
threshold and level dependent threshold techniques. The results
obtained also include comparisons with Signal to Noise Ratios, Peak
Signal to Noise Ratios and Normalized Root Mean Square Error.
Abstract: Although the level crossing concept has been the subject of intensive investigation over the last few years, certain problems of great interest remain unsolved. One of these concern is distribution of threshold levels. This paper presents a new threshold level allocation schemes for level crossing based on nonuniform sampling. Intuitively, it is more reasonable if the information rich regions of the signal are sampled finer and those with sparse information are sampled coarser. To achieve this objective, we propose non-linear quantization functions which dynamically assign the number of quantization levels depending on the importance of the given amplitude range. Two new approaches to determine the importance of the given amplitude segment are presented. The proposed methods are based on exponential and logarithmic functions. Various aspects of proposed techniques are discussed and experimentally validated. Its efficacy is investigated by comparison with uniform sampling.
Abstract: In this paper, a new adaptive Fourier decomposition
(AFD) based time-frequency speech analysis approach is proposed.
Given the fact that the fundamental frequency of speech signals often
undergo fluctuation, the classical short-time Fourier transform (STFT)
based spectrogram analysis suffers from the difficulty of window size
selection. AFD is a newly developed signal decomposition theory. It is
designed to deal with time-varying non-stationary signals. Its
outstanding characteristic is to provide instantaneous frequency for
each decomposed component, so the time-frequency analysis becomes
easier. Experiments are conducted based on the sample sentence in
TIMIT Acoustic-Phonetic Continuous Speech Corpus. The results
show that the AFD based time-frequency distribution outperforms the
STFT based one.
Abstract: An algorithm for learning an overcomplete dictionary
using a Cauchy mixture model for sparse decomposition of an underdetermined
mixing system is introduced. The mixture density
function is derived from a ratio sample of the observed mixture
signals where 1) there are at least two but not necessarily more
mixture signals observed, 2) the source signals are statistically
independent and 3) the sources are sparse. The basis vectors of the
dictionary are learned via the optimization of the location parameters
of the Cauchy mixture components, which is shown to be more
accurate and robust than the conventional data mining methods
usually employed for this task. Using a well known sparse
decomposition algorithm, we extract three speech signals from two
mixtures based on the estimated dictionary. Further tests with
additive Gaussian noise are used to demonstrate the proposed
algorithm-s robustness to outliers.
Abstract: This paper describes Independent Component Analysis (ICA) based fixed-point algorithm for the blind separation of the convolutive mixture of speech, picked-up by a linear microphone array. The proposed algorithm extracts independent sources by non- Gaussianizing the Time-Frequency Series of Speech (TFSS) in a deflationary way. The degree of non-Gaussianization is measured by negentropy. The relative performances of algorithm under random initialization and Null beamformer (NBF) based initialization are studied. It has been found that an NBF based initial value gives speedy convergence as well as better separation performance