Abstract: The paper presents a complete discrete statistical framework, based on a novel vector quantization (VQ) front-end process. This new VQ approach performs an optimal distribution of VQ codebook components on HMM states. This technique that we named the distributed vector quantization (DVQ) of hidden Markov models, succeeds in unifying acoustic micro-structure and phonetic macro-structure, when the estimation of HMM parameters is performed. The DVQ technique is implemented through two variants. The first variant uses the K-means algorithm (K-means- DVQ) to optimize the VQ, while the second variant exploits the benefits of the classification behavior of neural networks (NN-DVQ) for the same purpose. The proposed variants are compared with the HMM-based baseline system by experiments of specific Arabic consonants recognition. The results show that the distributed vector quantization technique increase the performance of the discrete HMM system.
Abstract: In this work, we are interested in developing a speech denoising tool by using a discrete wavelet packet transform (DWPT). This speech denoising tool will be employed for applications of recognition, coding and synthesis. For noise reduction, instead of applying the classical thresholding technique, some wavelet packet nodes are set to zero and the others are thresholded. To estimate the non stationary noise level, we employ the spectral entropy. A comparison of our proposed technique to classical denoising methods based on thresholding and spectral subtraction is made in order to evaluate our approach. The experimental implementation uses speech signals corrupted by two sorts of noise, white and Volvo noises. The obtained results from listening tests show that our proposed technique is better than spectral subtraction. The obtained results from SNR computation show the superiority of our technique when compared to the classical thresholding method using the modified hard thresholding function based on u-law algorithm.
Abstract: Autism spectrum disorder is characterized by
abnormalities in social communication, language abilities and
repetitive behaviors. The present study focused on some grammatical
deficits in autistic children. We evaluated the impairment of correct
use of different Persian verb tenses in autistic children-s speech. Two
standardized Language Test were administered then gathered data
were analyzed. The main result of this study was significant
difference between the mean scores of correct responses to present
tense in comparison with past tense in Persian language. This study
demonstrated that tense is severely impaired in autistic children-s
speech. Our findings indicated those autistic children-s production of
simple present/ past tense opposition to be better than production of
future and past periphrastic forms (past perfect, present perfect, past
progressive).
Abstract: Background noise is particularly damaging to speech
intelligibility for people with hearing loss especially for sensorineural
loss patients. Several investigations on speech intelligibility have
demonstrated sensorineural loss patients need 5-15 dB higher SNR
than the normal hearing subjects. This paper describes Discrete
Cosine Transform Power Normalized Least Mean Square algorithm
to improve the SNR and to reduce the convergence rate of the LMS
for Sensory neural loss patients. Since it requires only real arithmetic,
it establishes the faster convergence rate as compare to time domain
LMS and also this transformation improves the eigenvalue
distribution of the input autocorrelation matrix of the LMS filter.
The DCT has good ortho-normal, separable, and energy compaction
property. Although the DCT does not separate frequencies, it is a
powerful signal decorrelator. It is a real valued function and thus
can be effectively used in real-time operation. The advantages of
DCT-LMS as compared to standard LMS algorithm are shown via
SNR and eigenvalue ratio computations. . Exploiting the symmetry
of the basis functions, the DCT transform matrix [AN] can be
factored into a series of ±1 butterflies and rotation angles. This
factorization results in one of the fastest DCT implementation. There
are different ways to obtain factorizations. This work uses the fast
factored DCT algorithm developed by Chen and company. The
computer simulations results show superior convergence
characteristics of the proposed algorithm by improving the SNR at
least 10 dB for input SNR less than and equal to 0 dB, faster
convergence speed and better time and frequency characteristics.
Abstract: Pattern recognition is the research area of Artificial Intelligence that studies the operation and design of systems that recognize patterns in the data. Important application areas are image analysis, character recognition, fingerprint classification, speech analysis, DNA sequence identification, man and machine diagnostics, person identification and industrial inspection. The interest in improving the classification systems of data analysis is independent from the context of applications. In fact, in many studies it is often the case to have to recognize and to distinguish groups of various objects, which requires the need for valid instruments capable to perform this task. The objective of this article is to show several methodologies of Artificial Intelligence for data classification applied to biomedical patterns. In particular, this work deals with the realization of a Computer-Aided Detection system (CADe) that is able to assist the radiologist in identifying types of mammary tumor lesions. As an additional biomedical application of the classification systems, we present a study conducted on blood samples which shows how these methods may help to distinguish between carriers of Thalassemia (or Mediterranean Anaemia) and healthy subjects.
Abstract: As the enormous amount of on-line text grows on the
World-Wide Web, the development of methods for automatically
summarizing this text becomes more important. The primary goal of
this research is to create an efficient tool that is able to summarize
large documents automatically. We propose an Evolving
connectionist System that is adaptive, incremental learning and
knowledge representation system that evolves its structure and
functionality. In this paper, we propose a novel approach for Part of
Speech disambiguation using a recurrent neural network, a paradigm
capable of dealing with sequential data. We observed that
connectionist approach to text summarization has a natural way of
learning grammatical structures through experience. Experimental
results show that our approach achieves acceptable performance.
Abstract: One major source of performance decline in speaker
recognition system is channel mismatch between training and testing.
This paper focuses on improving channel robustness of speaker
recognition system in two aspects of channel compensation technique
and channel robust features. The system is text-independent speaker
identification system based on two-stage recognition. In the aspect of
channel compensation technique, this paper applies MAP (Maximum
A Posterior Probability) channel compensation technique, which was
used in speech recognition, to speaker recognition system. In the
aspect of channel robust features, this paper introduces
pitch-dependent features and pitch-dependent speaker model for the
second stage recognition. Based on the first stage recognition to
testing speech using GMM (Gaussian Mixture Model), the system
uses GMM scores to decide if it needs to be recognized again. If it
needs to, the system selects a few speakers from all of the speakers
who participate in the first stage recognition for the second stage
recognition. For each selected speaker, the system obtains 3
pitch-dependent results from his pitch-dependent speaker model, and
then uses ANN (Artificial Neural Network) to unite the 3
pitch-dependent results and 1 GMM score for getting a fused result.
The system makes the second stage recognition based on these fused
results. The experiments show that the correct rate of two-stage
recognition system based on MAP channel compensation technique
and pitch-dependent features is 41.7% better than the baseline system
for closed-set test.
Abstract: Currently electronic slide (e-slide) is one of the most common styles in educational presentation. Unfortunately, the utilization of e-slide for the visually impaired is uncommon since they are unable to see the content of such e-slides which are usually composed of text, images and animation. This paper proposes a model for presenting e-slide in multimodal presentation i.e. using conventional slide concurrent with voicing, in both languages Malay and English. At the design level, live multimedia presentation concept is used, while at the implementation level several components are used. The text content of each slide is extracted using COM component, Microsoft Speech API for voicing the text in English language and the text in Malay language is voiced using dictionary approach. To support the accessibility, an auditory user interface is provided as an additional feature. A prototype of such model named as VSlide has been developed and introduced.
Abstract: Emotion in speech is an issue that has been attracting
the interest of the speech community for many years, both in the
context of speech synthesis as well as in automatic speech
recognition (ASR). In spite of the remarkable recent progress in
Large Vocabulary Recognition (LVR), it is still far behind the
ultimate goal of recognising free conversational speech uttered by
any speaker in any environment. Current experimental tests prove
that using state of the art large vocabulary recognition systems the
error rate increases substantially when applied to
spontaneous/emotional speech. This paper shows that recognition
rate for emotionally coloured speech can be improved by using a
language model based on increased representation of emotional
utterances.
Abstract: This paper presents a recognition system for isolated
words like robot commands. It’s carried out by Time Delay Neural
Networks; TDNN. To teleoperate a robot for specific tasks as turn,
close, etc… In industrial environment and taking into account the
noise coming from the machine. The choice of TDNN is based on its
generalization in terms of accuracy, in more it acts as a filter that
allows the passage of certain desirable frequency characteristics of
speech; the goal is to determine the parameters of this filter for
making an adaptable system to the variability of speech signal and to
noise especially, for this the back propagation technique was used in
learning phase. The approach was applied on commands pronounced
in two languages separately: The French and Arabic. The results for
two test bases of 300 spoken words for each one are 87%, 97.6% in
neutral environment and 77.67%, 92.67% when the white Gaussian
noisy was added with a SNR of 35 dB.
Abstract: In this study, the use of silicon NAM (Non-Audible
Murmur) microphone in automatic speech recognition is presented.
NAM microphones are special acoustic sensors, which are attached
behind the talker-s ear and can capture not only normal (audible)
speech, but also very quietly uttered speech (non-audible murmur).
As a result, NAM microphones can be applied in automatic speech
recognition systems when privacy is desired in human-machine communication.
Moreover, NAM microphones show robustness against
noise and they might be used in special systems (speech recognition,
speech conversion etc.) for sound-impaired people. Using a small
amount of training data and adaptation approaches, 93.9% word
accuracy was achieved for a 20k Japanese vocabulary dictation
task. Non-audible murmur recognition in noisy environments is also
investigated. In this study, further analysis of the NAM speech has
been made using distance measures between hidden Markov model
(HMM) pairs. It has been shown the reduced spectral space of NAM
speech using a metric distance, however the location of the different
phonemes of NAM are similar to the location of the phonemes
of normal speech, and the NAM sounds are well discriminated.
Promising results in using nonlinear features are also introduced,
especially under noisy conditions.
Abstract: The paper presents the design concept of a unitselection
text-to-speech synthesis system for the Slovenian language.
Due to its modular and upgradable architecture, the system can be
used in a variety of speech user interface applications, ranging from
server carrier-grade voice portal applications, desktop user interfaces
to specialized embedded devices.
Since memory and processing power requirements are important
factors for a possible implementation in embedded devices, lexica
and speech corpora need to be reduced. We describe a simple and
efficient implementation of a greedy subset selection algorithm that
extracts a compact subset of high coverage text sentences. The
experiment on a reference text corpus showed that the subset
selection algorithm produced a compact sentence subset with a small
redundancy.
The adequacy of the spoken output was evaluated by several
subjective tests as they are recommended by the International
Telecommunication Union ITU.
Abstract: In this paper, we propose a method of alter duration in
frequency domain that control prosody in real time after pitch
alteration. If there has a method to alteration duration freely among
prosody information, that may used in several fields such as speech
impediment person's pronunciation proof reading or language study.
The pitch alteration method used control prosody altered by PSOLA
synthesis method which is in time domain processing method.
However, the duration of pitch alteration speech is changed by the
frequency domain. In this paper, we altered the duration with the
method of duration alteration by Fast Fourier Transformation in
frequency domain. Consequently, the intelligibility of the pitch and
duration are controlled has a slight decrease than the case when only
pitch is changed, but the proposed algorithm obtained the higher MOS
score about naturalness.
Abstract: Electronic commerce is growing rapidly with on-line
sales already heading for hundreds of billion dollars per year. Due to
the huge amount of money transferred everyday, an increased
security level is required. In this work we present the architecture of
an intelligent speaker verification system, which is able to accurately
verify the registered users of an e-commerce service using only their
voices as an input. According to the proposed architecture, a
transaction-based e-commerce application should be complemented
by a biometric server where customer-s unique set of speech models
(voiceprint) is stored. The verification procedure requests from the
user to pronounce a personalized sequence of digits and after
capturing speech and extracting voice features at the client side are
sent back to the biometric server. The biometric server uses pattern
recognition to decide whether the received features match the stored
voiceprint of the customer who claims to be, and accordingly grants
verification. The proposed architecture can provide e-commerce
applications with a higher degree of certainty regarding the identity
of a customer, and prevent impostors to execute fraudulent
transactions.
Abstract: Concatenative speech synthesis is a method that can
make speech sound which has naturalness and high-individuality of a
speaker by introducing a large speech corpus. Based on this method, in
this paper, we propose a voice conversion method whose conversion
speech has high-individuality and naturalness. The authors also have
two subjective evaluation experiments for evaluating individuality and
sound quality of conversion speech. From the results, following three
facts have be confirmed: (a) the proposal method can convert the
individuality of speakers well, (b) employing the framework of unit
selection (especially join cost) of concatenative speech synthesis into
conventional voice conversion improves the sound quality of
conversion speech, and (c) the proposal method is robust against the
difference of genders between a source speaker and a target speaker.
Abstract: The current speech interfaces in many military
applications may be adequate for native speakers. However,
the recognition rate drops quite a lot for non-native speakers
(people with foreign accents). This is mainly because the nonnative
speakers have large temporal and intra-phoneme
variations when they pronounce the same words. This
problem is also complicated by the presence of large
environmental noise such as tank noise, helicopter noise, etc.
In this paper, we proposed a novel continuous acoustic feature
adaptation algorithm for on-line accent and environmental
adaptation. Implemented by incremental singular value
decomposition (SVD), the algorithm captures local acoustic
variation and runs in real-time. This feature-based adaptation
method is then integrated with conventional model-based
maximum likelihood linear regression (MLLR) algorithm.
Extensive experiments have been performed on the NATO
non-native speech corpus with baseline acoustic model trained
on native American English. The proposed feature-based
adaptation algorithm improved the average recognition
accuracy by 15%, while the MLLR model based adaptation
achieved 11% improvement. The corresponding word error
rate (WER) reduction was 25.8% and 2.73%, as compared to
that without adaptation. The combined adaptation achieved
overall recognition accuracy improvement of 29.5%, and
WER reduction of 31.8%, as compared to that without
adaptation.
Abstract: A novel approach to speech coding using the hybrid architecture is presented. Advantages of parametric and perceptual coding methods are utilized together in order to create a speech coding algorithm assuring better signal quality than in traditional CELP parametric codec. Two approaches are discussed. One is based on selection of voiced signal components that are encoded using parametric algorithm, unvoiced components that are encoded perceptually and transients that remain unencoded. The second approach uses perceptual encoding of the residual signal in CELP codec. The algorithm applied for precise transient selection is described. Signal quality achieved using the proposed hybrid codec is compared to quality of some standard speech codecs.
Abstract: Unified Speech Audio Coding (USAC), the latest MPEG standardization for unified speech and audio coding, uses a speech/audio classification algorithm to distinguish speech and audio segments of the input signal. The quality of the recovered audio can be increased by well-designed orchestra/percussion classification and subsequent processing. However, owing to the shortcoming of the system, introducing an orchestra/percussion classification and modifying subsequent processing can enormously increase the quality of the recovered audio. This paper proposes an orchestra/percussion classification algorithm for the USAC system which only extracts 3 scales of Mel-Frequency Cepstral Coefficients (MFCCs) rather than traditional 13 scales of MFCCs and use Iterative Dichotomiser 3 (ID3) Decision Tree rather than other complex learning method, thus the proposed algorithm has lower computing complexity than most existing algorithms. Considering that frequent changing of attributes may lead to quality loss of the recovered audio signal, this paper also design a modified subsequent process to help the whole classification system reach an accurate rate as high as 97% which is comparable to classical 99%.
Abstract: This work presents a fusion of Log Gabor Wavelet
(LGW) and Maximum a Posteriori (MAP) estimator as a speech
enhancement tool for acoustical background noise reduction. The
probability density function (pdf) of the speech spectral amplitude is
approximated by a Generalized Laplacian Distribution (GLD).
Compared to earlier estimators the proposed method estimates the
underlying statistical model more accurately by appropriately
choosing the model parameters of GLD. Experimental results show
that the proposed estimator yields a higher improvement in
Segmental Signal-to-Noise Ratio (S-SNR) and lower Log-Spectral
Distortion (LSD) in two different noisy environments compared to
other estimators.
Abstract: In this paper, we propose a novel time-frequency distribution (TFD) for the analysis of multi-component signals. In particular, we use synthetic as well as real-life speech signals to prove the superiority of the proposed TFD in comparison to some existing ones. In the comparison, we consider the cross-terms suppression and the high energy concentration of the signal around its instantaneous frequency (IF).