Abstract: Developed tool is one of system tools for easier access to various scientific areas and real time interactive learning between
lecturer and for hearing impaired students. There is no demand for the lecturer to know Sign Language (SL). Instead, the new software
tools will perform the translation of the regular speech into SL, after
which it will be transferred to the student. On the other side, the
questions of the student (in SL) will be translated and transferred to
the lecturer in text or speech. One of those tools is presented tool. It-s
too for developing the correct Speech Visemes as a root of total communication method for hearing impared students.
Abstract: Field Association (FA) terms are a limited set of discriminating terms that give us the knowledge to identify document fields which are effective in document classification, similar file retrieval and passage retrieval. But the problem lies in the lack of an effective method to extract automatically relevant Arabic FA Terms to build a comprehensive dictionary. Moreover, all previous studies are based on FA terms in English and Japanese, and the extension of FA terms to other language such Arabic could be definitely strengthen further researches. This paper presents a new method to extract, Arabic FA Terms from domain-specific corpora using part-of-speech (POS) pattern rules and corpora comparison. Experimental evaluation is carried out for 14 different fields using 251 MB of domain-specific corpora obtained from Arabic Wikipedia dumps and Alhyah news selected average of 2,825 FA Terms (single and compound) per field. From the experimental results, recall and precision are 84% and 79% respectively. Therefore, this method selects higher number of relevant Arabic FA Terms at high precision and recall.
Abstract: In this era of technology, fueled by the pervasive usage of the internet, security is a prime concern. The number of new attacks by the so-called “bots", which are automated programs, is increasing at an alarming rate. They are most likely to attack online registration systems. Technology, called “CAPTCHA" (Completely Automated Public Turing test to tell Computers and Humans Apart) do exist, which can differentiate between automated programs and humans and prevent replay attacks. Traditionally CAPTCHA-s have been implemented with the challenge involved in recognizing textual images and reproducing the same. We propose an approach where the visual challenge has to be read out from which randomly selected keywords are used to verify the correctness of spoken text and in turn detect the presence of human. This is supplemented with a speaker recognition system which can identify the speaker also. Thus, this framework fulfills both the objectives – it can determine whether the user is a human or not and if it is a human, it can verify its identity.
Abstract: Bangla Vowel characterization determines the spectral properties of Bangla vowels for efficient synthesis as well as recognition of Bangla vowels. In this paper, Bangla vowels in isolated word have been analyzed based on speech production model within the framework of Analysis-by-Synthesis. This has led to the extraction of spectral parameters for the production model in order to produce different Bangla vowel sounds. The real and synthetic spectra are compared and a weighted square error has been computed along with the error in the formant bandwidths for efficient representation of Bangla vowels. The extracted features produced good representation of targeted Bangla vowel. Such a representation also plays essential role in low bit rate speech coding and vocoders.
Abstract: We report in this paper the procedure of a system of
automatic speech recognition based on techniques of the dynamic
programming. The technique of temporal retiming is a technique
used to synchronize between two forms to compare. We will see how
this technique is adapted to the field of the automatic speech
recognition. We will expose, in a first place, the theory of the
function of retiming which is used to compare and to adjust an
unknown form with a whole of forms of reference constituting the
vocabulary of the application. Then we will give, in the second place,
the various algorithms necessary to their implementation on machine.
The algorithms which we will present were tested on part of the
corpus of words in Arab language Arabdic-10 [4] and gave whole
satisfaction. These algorithms are effective insofar as we apply them
to the small ones or average vocabularies.
Abstract: This paper presents a new method for estimating the nonstationary
noise power spectral density given a noisy signal. The
method is based on averaging the noisy speech power spectrum using
time and frequency dependent smoothing factors. These factors are
adjusted based on signal-presence probability in individual frequency
bins. Signal presence is determined by computing the ratio of the
noisy speech power spectrum to its local minimum, which is updated
continuously by averaging past values of the noisy speech power
spectra with a look-ahead factor. This method adapts very quickly to
highly non-stationary noise environments. The proposed method
achieves significant improvements over a system that uses voice
activity detector (VAD) in noise estimation.
Abstract: The theatre-auditorium under investigation following
the highly reflective characteristics of materials used in it (marble,
painted wood, smooth plaster, etc), architectural and structural
features of the Protocol and its intended use (very multifunctional:
Auditorium, theatre, cinema, musicals, conference room) from the
analysis of the statement of fact made by the acoustic simulation
software Ramsete and supported by data obtained through a
campaign of acoustic measurements of the state of fact made on the
spot by a Fonomet Svantek model SVAN 957, appears to be
acoustically inadequate. After the completion of the 3D model
according to the specifications necessary software used forecast in
order to be recognized by him, have made three simulations, acoustic
simulation of the state of and acoustic simulation of two design
solutions.
Improved noise characteristics found in the first design solution,
compared to the state in fact consists therefore in lowering
Reverberation Time that you turn most desirable value, while the
Indicators of Clarity, the Baricentric Time, the Lateral Efficiency,
Ratio of Low Tmedia BR and defined the Speech Intelligibility
improved significantly. Improved noise characteristics found instead
in the second design solution, as compared to first design solution, is
finally mostly in a more uniform distribution of Leq and in lowering
Reverberation Time that you turn the optimum values. Indicators of
Clarity, and the Lateral Efficiency improve further but at the expense
of a value slightly worse than the BR. Slightly vary the remaining
indices.
Abstract: A set of Artificial Neural Network (ANN) based methods
for the design of an effective system of speech recognition of
numerals of Assamese language captured under varied recording
conditions and moods is presented here. The work is related to
the formulation of several ANN models configured to use Linear
Predictive Code (LPC), Principal Component Analysis (PCA) and
other features to tackle mood and gender variations uttering numbers
as part of an Automatic Speech Recognition (ASR) system in
Assamese. The ANN models are designed using a combination of
Self Organizing Map (SOM) and Multi Layer Perceptron (MLP)
constituting a Learning Vector Quantization (LVQ) block trained in a
cooperative environment to handle male and female speech samples
of numerals of Assamese- a language spoken by a sizable population
in the North-Eastern part of India. The work provides a comparative
evaluation of several such combinations while subjected to handle
speech samples with gender based differences captured by a microphone
in four different conditions viz. noiseless, noise mixed, stressed
and stress-free.
Abstract: A combined three-microphone voice activity detector (VAD) and noise-canceling system is studied to enhance speech recognition in an automobile environment. A previous experiment clearly shows the ability of the composite system to cancel a single noise source outside of a defined zone. This paper investigates the performance of the composite system when there are frequently moving noise sources (noise sources are coming from different locations but are not always presented at the same time) e.g. there is other passenger speech or speech from a radio when a desired speech is presented. To work in a frequently moving noise sources environment, whilst a three-microphone voice activity detector (VAD) detects voice from a “VAD valid zone", the 3-microphone noise canceller uses a “noise canceller valid zone" defined in freespace around the users head. Therefore, a desired voice should be in the intersection of the noise canceller valid zone and VAD valid zone. Thus all noise is suppressed outside this intersection of area. Experiments are shown for a real environment e.g. all results were recorded in a car by omni-directional electret condenser microphones.
Abstract: In this paper is to evaluate audio and speech quality
with the help of Digital Audio Watermarking Technique under the
different types of attacks (signal impairments) like Gaussian Noise,
Compression Error and Jittering Effect. Further attacks are
considered as Hostile Environment. Audio and Speech Quality
Evaluation is an important research topic. The traditional way for
speech quality evaluation is using subjective tests. They are reliable,
but very expensive, time consuming, and cannot be used in certain
applications such as online monitoring. Objective models, based on
human perception, were developed to predict the results of subjective
tests. The existing objective methods require either the original
speech or complicated computation model, which makes some
applications of quality evaluation impossible.
Abstract: This paper proposes a novel approach that combines statistical models and support vector machines. A hybrid scheme which appropriately incorporates the advantages of both the generative and discriminant model paradigms is described and evaluated. Support vector machines (SVMs) are trained to divide the whole speakers' space into small subsets of speakers within a hierarchical tree structure. During testing a speech token is assigned to its corresponding group and evaluation using gaussian mixture models (GMMs) is then processed. Experimental results show that the proposed method can significantly improve the performance of text independent speaker identification task. We report improvements of up to 50% reduction in identification error rate compared to the baseline statistical model.
Abstract: This paper presents a rule-based text- to- speech
(TTS) Synthesis System for Standard Malay, namely SMaTTS. The
proposed system using sinusoidal method and some pre- recorded
wave files in generating speech for the system. The use of phone
database significantly decreases the amount of computer memory
space used, thus making the system very light and embeddable. The
overall system was comprised of two phases the Natural Language
Processing (NLP) that consisted of the high-level processing of text
analysis, phonetic analysis, text normalization and morphophonemic
module. The module was designed specially for SM to overcome
few problems in defining the rules for SM orthography system before
it can be passed to the DSP module. The second phase is the Digital
Signal Processing (DSP) which operated on the low-level process of
the speech waveform generation. A developed an intelligible and
adequately natural sounding formant-based speech synthesis system
with a light and user-friendly Graphical User Interface (GUI) is
introduced. A Standard Malay Language (SM) phoneme set and an
inclusive set of phone database have been constructed carefully for
this phone-based speech synthesizer. By applying the generative
phonology, a comprehensive letter-to-sound (LTS) rules and a
pronunciation lexicon have been invented for SMaTTS. As for the
evaluation tests, a set of Diagnostic Rhyme Test (DRT) word list was
compiled and several experiments have been performed to evaluate
the quality of the synthesized speech by analyzing the Mean Opinion
Score (MOS) obtained. The overall performance of the system as
well as the room for improvements was thoroughly discussed.
Abstract: In this article we present a methodology which
enables preschool and primary school unlanguaged children to
remember words, phrases and texts with the help of graphic signs -
letters, syllables and words. Reading for a child becomes a support
for speech development. Teaching is based on the principle "from
simple to complex", "a letter - a syllable - a word - a proposal - a
text." Availability of multi-level texts allows using this methodology
for working with children who have different levels of speech
development.
Abstract: This paper explores the scalability issues associated
with solving the Named Entity Recognition (NER) problem using
Support Vector Machines (SVM) and high-dimensional features. The
performance results of a set of experiments conducted using binary
and multi-class SVM with increasing training data sizes are
examined. The NER domain chosen for these experiments is the
biomedical publications domain, especially selected due to its
importance and inherent challenges. A simple machine learning
approach is used that eliminates prior language knowledge such as
part-of-speech or noun phrase tagging thereby allowing for its
applicability across languages. No domain-specific knowledge is
included. The accuracy measures achieved are comparable to those
obtained using more complex approaches, which constitutes a
motivation to investigate ways to improve the scalability of multiclass
SVM in order to make the solution more practical and useable.
Improving training time of multi-class SVM would make support
vector machines a more viable and practical machine learning
solution for real-world problems with large datasets. An initial
prototype results in great improvement of the training time at the
expense of memory requirements.
Abstract: In this work we introduce an efficient method to limit
the impact of the hiding process on the quality of the cover speech.
Vector quantization of the speech spectral information reduces drastically
the number of the secret speech parameters to be embedded
in the cover signal. Compared to scalar hiding, vector quantization
hiding technique provides a stego signal that is indistinguishable from
the cover speech. The objective and subjective performance measures
reveal that the current hiding technique attracts no suspicion about the
presence of the secret message in the stego speech, while being able
to recover an intelligible copy of the secret message at the receiver
side.
Abstract: Performance of any continuous speech recognition system is highly dependent on performance of the acoustic models. Generally, development of the robust spoken language technology relies on the availability of large amounts of data. Common way to cope with little data for training each state of Markov models is treebased state tying. This tying method applies contextual questions to tie states. Manual procedure for question generation suffers from human errors and is time consuming. Various automatically generated questions are used to construct decision tree. There are three approaches to generate questions to construct HMMs based on decision tree. One approach is based on misrecognized phonemes, another approach basically uses feature table and the other is based on state distributions corresponding to context-independent subword units. In this paper, all these methods of automatic question generation are applied to the decision tree on FARSDAT corpus in Persian language and their results are compared with those of manually generated questions. The results show that automatically generated questions yield much better results and can replace manually generated questions in Persian language.
Abstract: Real world Speaker Identification (SI) application
differs from ideal or laboratory conditions causing perturbations that
leads to a mismatch between the training and testing environment
and degrade the performance drastically. Many strategies have been
adopted to cope with acoustical degradation; wavelet based Bayesian
marginal model is one of them. But Bayesian marginal models
cannot model the inter-scale statistical dependencies of different
wavelet scales. Simple nonlinear estimators for wavelet based
denoising assume that the wavelet coefficients in different scales are
independent in nature. However wavelet coefficients have significant
inter-scale dependency. This paper enhances this inter-scale
dependency property by a Circularly Symmetric Probability Density
Function (CS-PDF) related to the family of Spherically Invariant
Random Processes (SIRPs) in Log Gabor Wavelet (LGW) domain
and corresponding joint shrinkage estimator is derived by Maximum
a Posteriori (MAP) estimator. A framework is proposed based on
these to denoise speech signal for automatic speaker identification
problems. The robustness of the proposed framework is tested for
Text Independent Speaker Identification application on 100 speakers
of POLYCOST and 100 speakers of YOHO speech database in three
different noise environments. Experimental results show that the
proposed estimator yields a higher improvement in identification
accuracy compared to other estimators on popular Gaussian Mixture
Model (GMM) based speaker model and Mel-Frequency Cepstral
Coefficient (MFCC) features.
Abstract: The human friendly interaction is the key function of a human-centered system. Over the years, it has received much attention to develop the convenient interaction through intention recognition. Intention recognition processes multimodal inputs including speech, face images, and body gestures. In this paper, we suggest a novel approach of intention recognition using a graph representation called Intention Graph. A concept of valid intention is proposed, as a target of intention recognition. Our approach has two phases: goal recognition phase and intention recognition phase. In the goal recognition phase, we generate an action graph based on the observed actions, and then the candidate goals and their plans are recognized. In the intention recognition phase, the intention is recognized with relevant goals and user profile. We show that the algorithm has polynomial time complexity. The intention graph is applied to a simple briefcase domain to test our model.
Abstract: The acoustic and articulatory properties of fricative speech sounds are being studied using magnetic resonance imaging (MRI) and acoustic recordings from a single subject. Area functions were derived from a complete set of axial and coronal MR slices using two different methods: the Mermelstein technique and the Blum transform. Area functions derived from the two techniques were shown to differ significantly in some cases. Such differences will lead to different acoustic predictions and it is important to know which is the more accurate. The vocal tract acoustic transfer function (VTTF) was derived from these area functions for each fricative and compared with measured speech signals for the same fricative and same subject. The VTTFs for /f/ in two vowel contexts and the corresponding acoustic spectra are derived here; the Blum transform appears to show a better match between prediction and measurement than the Mermelstein technique.
Abstract: This paper studies the effect of different compression
constraints and schemes presented in a new and flexible paradigm to
achieve high compression ratios and acceptable signal to noise ratios
of Arabic speech signals. Compression parameters are computed for
variable frame sizes of a level 5 to 7 Discrete Wavelet Transform
(DWT) representation of the signals for different analyzing mother
wavelet functions. Results are obtained and compared for Global
threshold and level dependent threshold techniques. The results
obtained also include comparisons with Signal to Noise Ratios, Peak
Signal to Noise Ratios and Normalized Root Mean Square Error.