Abstract: A simple adaptive voice activity detector (VAD) is
implemented using Gabor and gammatone atomic decomposition of
speech for high Gaussian noise environments. Matching pursuit is
used for atomic decomposition, and is shown to achieve optimal
speech detection capability at high data compression rates for low
signal to noise ratios. The most active dictionary elements found by
matching pursuit are used for the signal reconstruction so that the
algorithm adapts to the individual speakers dominant time-frequency
characteristics. Speech has a high peak to average ratio enabling
matching pursuit greedy heuristic of highest inner products to isolate
high energy speech components in high noise environments. Gabor
and gammatone atoms are both investigated with identical
logarithmically spaced center frequencies, and similar bandwidths.
The algorithm performs equally well for both Gabor and gammatone
atoms with no significant statistical differences. The algorithm
achieves 70% accuracy at a 0 dB SNR, 90% accuracy at a 5 dB SNR
and 98% accuracy at a 20dB SNR using 30d B SNR as a reference
for voice activity.
Abstract: OPEN_EmoRec_II is an open multimodal corpus with
experimentally induced emotions. In the first half of the experiment,
emotions were induced with standardized picture material and in the
second half during a human-computer interaction (HCI), realized
with a wizard-of-oz design. The induced emotions are based on the
dimensional theory of emotions (valence, arousal and dominance).
These emotional sequences - recorded with multimodal data (facial
reactions, speech, audio and physiological reactions) during a
naturalistic-like HCI-environment one can improve classification
methods on a multimodal level.
This database is the result of an HCI-experiment, for which 30
subjects in total agreed to a publication of their data including the
video material for research purposes*. The now available open
corpus contains sensory signal of: video, audio, physiology (SCL,
respiration, BVP, EMG Corrugator supercilii, EMG Zygomaticus
Major) and facial reactions annotations.
Abstract: OPEN_EmoRec_II is an open multimodal corpus with
experimentally induced emotions. In the first half of the experiment,
emotions were induced with standardized picture material and in the
second half during a human-computer interaction (HCI), realized
with a wizard-of-oz design. The induced emotions are based on the
dimensional theory of emotions (valence, arousal and dominance).
These emotional sequences - recorded with multimodal data (facial
reactions, speech, audio and physiological reactions) during a
naturalistic-like HCI-environment one can improve classification
methods on a multimodal level.
This database is the result of an HCI-experiment, for which 30
subjects in total agreed to a publication of their data including the
video material for research purposes*. The now available open
corpus contains sensory signal of: video, audio, physiology (SCL,
respiration, BVP, EMG Corrugator supercilii, EMG Zygomaticus
Major) and facial reactions annotations.
Abstract: The present paper presents the discussion of scholars
concerning speech impact, peculiarities of its realization, speech
strategies and techniques in particular. Departing from the viewpoints
of many prominent linguists, the paper suggests that manipulative
argumentation be viewed as a most pervasive speech strategy with a
certain set of techniques which are to be found in modern American
political discourse. The precedence of their occurrence allows us to
regard them as pragmatic patterns of speech impact realization in
effective public speaking.
Abstract: The performance and analysis of speech recognition
system is illustrated in this paper. An approach to recognize the
English word corresponding to digit (0-9) spoken by 2 different
speakers is captured in noise free environment. For feature extraction,
speech Mel frequency cepstral coefficients (MFCC) has been used
which gives a set of feature vectors from recorded speech samples.
Neural network model is used to enhance the recognition
performance. Feed forward neural network with back propagation
algorithm model is used. However other speech recognition
techniques such as HMM, DTW exist. All experiments are carried
out on Matlab.
Abstract: The 3D body movement signals captured during
human-human conversation include clues not only to the content of
people’s communication but also to their culture and personality.
This paper is concerned with automatic extraction of this information
from body movement signals. For the purpose of this research, we
collected a novel corpus from 27 subjects, arranged them into groups
according to their culture. We arranged each group into pairs and
each pair communicated with each other about different topics.
A state-of-art recognition system is applied to the problems of
person, culture, and topic recognition. We borrowed modeling,
classification, and normalization techniques from speech recognition.
We used Gaussian Mixture Modeling (GMM) as the main technique
for building our three systems, obtaining 77.78%, 55.47%, and
39.06% from the person, culture, and topic recognition systems
respectively. In addition, we combined the above GMM systems with
Support Vector Machines (SVM) to obtain 85.42%, 62.50%, and
40.63% accuracy for person, culture, and topic recognition
respectively.
Although direct comparison among these three recognition
systems is difficult, it seems that our person recognition system
performs best for both GMM and GMM-SVM, suggesting that intersubject
differences (i.e. subject’s personality traits) are a major
source of variation. When removing these traits from culture and
topic recognition systems using the Nuisance Attribute Projection
(NAP) and the Intersession Variability Compensation (ISVC)
techniques, we obtained 73.44% and 46.09% accuracy from culture
and topic recognition systems respectively.
Abstract: Speech Segmentation is the measure of the change
point detection for partitioning an input speech signal into regions
each of which accords to only one speaker. In this paper, we apply
two features based on multi-scale product (MP) of the clean speech,
namely the spectral centroid of MP, and the zero crossings rate of
MP. We focus on multi-scale product analysis as an important tool
for segmentation extraction. The MP is based on making the product
of the speech wavelet transform coefficients (WTC). We have
estimated our method on the Keele database. The results show the
effectiveness of our method. It indicates that the two features can find
word boundaries, and extracted the segments of the clean speech.
Abstract: Speech enhancement is a long standing problem with
numerous applications like teleconferencing, VoIP, hearing aids and
speech recognition. The motivation behind this research work is to
obtain a clean speech signal of higher quality by applying the optimal
noise cancellation technique. Real-time adaptive filtering algorithms
seem to be the best candidate among all categories of the speech
enhancement methods. In this paper, we propose a speech
enhancement method based on Recursive Least Squares (RLS)
adaptive filter of speech signals. Experiments were performed on
noisy data which was prepared by adding AWGN, Babble and Pink
noise to clean speech samples at -5dB, 0dB, 5dB and 10dB SNR
levels. We then compare the noise cancellation performance of
proposed RLS algorithm with existing NLMS algorithm in terms of
Mean Squared Error (MSE), Signal to Noise ratio (SNR) and SNR
Loss. Based on the performance evaluation, the proposed RLS
algorithm was found to be a better optimal noise cancellation
technique for speech signals.
Abstract: The paper deals with the usage of speech acts and
politeness strategies in an EFL classroom in Georgia (Rep of). It
explores the students’ and the teachers’ practice of the politeness
strategies and the speech acts of apology, thanking, request,
compliment / encouragement, command, agreeing / disagreeing,
addressing and code switching. The research method includes
observation as well as a questionnaire. The target group involves the
students from Georgian public schools and two certified, experienced
local English teachers. The analysis is based on Searle’s Speech Act
Theory and Brown and Levinson’s politeness strategies. The findings
show that the students have certain knowledge regarding politeness
yet they fail to apply them in English communication. In addition,
most of the speech acts from the classroom interaction are used by
the teachers and not the students. Thereby, it is suggested that
teachers should cultivate the students’ communicative competence
and attempt to give them opportunities to practise more English
speech acts than they do today.
Abstract: Code- mixing in spontaneous speech has been widely
discussed, but not in virtual situations; especially in context of the
third language learning students. Thus, this study is an attempt to
explore the linguistic characteristics of the mixing of Japanese,
English and Thai in a mobile Line chat room by students with their
background of English as L2, Japanese as L3 and Thai as mother
tongue. The result found that insertion of Thai content words is a very
common linguistic phenomenon embedded with the other two
languages in the sentences. As chatting is to be ‘relational’ or
‘interactional’, it affected the style of lexical choices to be speech-like,
more personal and emotionally-related. A personal pronoun in
Japanese is often mixed into the sentences. The Japanese
sentence-final question particle か “ka” was added to the end of the
sentence based on Thai grammar rules. Some unique characteristics
were created while chatting.
Abstract: In this study, we propose a novel technique for acoustic
echo suppression (AES) during speech recognition under barge-in
conditions. Conventional AES methods based on spectral subtraction
apply fixed weights to the estimated echo path transfer function
(EPTF) at the current signal segment and to the EPTF estimated until
the previous time interval. However, the effects of echo path changes
should be considered for eliminating the undesired echoes. We
describe a new approach that adaptively updates weight parameters in
response to abrupt changes in the acoustic environment due to
background noises or double-talk. Furthermore, we devised a voice
activity detector and an initial time-delay estimator for barge-in speech
recognition in communication networks. The initial time delay is
estimated using log-spectral distance measure, as well as
cross-correlation coefficients. The experimental results show that the
developed techniques can be successfully applied in barge-in speech
recognition systems.
Abstract: In this paper, Fuzzy C-Means clustering with
Expectation Maximization-Gaussian Mixture Model based hybrid
modeling algorithm is proposed for Continuous Tamil Speech
Recognition. The speech sentences from various speakers are used
for training and testing phase and objective measures are between the
proposed and existing Continuous Speech Recognition algorithms.
From the simulated results, it is observed that the proposed algorithm
improves the recognition accuracy and F-measure up to 3% as
compared to that of the existing algorithms for the speech signal from
various speakers. In addition, it reduces the Word Error Rate, Error
Rate and Error up to 4% as compared to that of the existing
algorithms. In all aspects, the proposed hybrid modeling for Tamil
speech recognition provides the significant improvements for speechto-
text conversion in various applications.
Abstract: In this paper, Least Mean Square (LMS) adaptive
noise reduction algorithm is proposed to enhance the speech signal
from the noisy speech. In this, the speech signal is enhanced by
varying the step size as the function of the input signal. Objective and
subjective measures are made under various noises for the proposed
and existing algorithms. From the experimental results, it is seen that
the proposed LMS adaptive noise reduction algorithm reduces Mean
square Error (MSE) and Log Spectral Distance (LSD) as compared to
that of the earlier methods under various noise conditions with
different input SNR levels. In addition, the proposed algorithm
increases the Peak Signal to Noise Ratio (PSNR) and Segmental SNR
improvement (ΔSNRseg) values; improves the Mean Opinion Score
(MOS) as compared to that of the various existing LMS adaptive
noise reduction algorithms. From these experimental results, it is
observed that the proposed LMS adaptive noise reduction algorithm
reduces the speech distortion and residual noise as compared to that
of the existing methods.
Abstract: The paper presents combined automatic speech
recognition (ASR) of English and machine translation (MT) for
English and Croatian and Croatian-English language pairs in the
domain of business correspondence. The first part presents results of
training the ASR commercial system on English data sets, enriched
by error analysis. The second part presents results of machine
translation performed by free online tool for English and Croatian
and Croatian-English language pairs. Human evaluation in terms of
usability is conducted and internal consistency calculated by
Cronbach's alpha coefficient, enriched by error analysis. Automatic
evaluation is performed by WER (Word Error Rate) and PER
(Position-independent word Error Rate) metrics, followed by
investigation of Pearson’s correlation with human evaluation.
Abstract: Frequent, continuous speech training has proven to be
a necessary part of a successful speech therapy process, but
constraints of traveling time and employment dispensation become
key obstacles especially for individuals living in remote areas or for
dependent children who have working parents. In order to ameliorate
speech difficulties with ample guidance from speech therapists, a
website has been developed that supports speech therapy and training
for people with articulation disorders in the standard Thai language.
This web-based program has the ability to record speech training
exercises for each speech trainee. The records will be stored in a
database for the speech therapist to investigate, evaluate, compare
and keep track of all trainees’ progress in detail. Speech trainees can
request live discussions via video conference call when needed.
Communication through this web-based program facilitates and
reduces training time in comparison to walk-in training or
appointments. This type of training also allows people with
articulation disorders to practice speech lessons whenever or
wherever is convenient for them, which can lead to a more regular
training processes.
Abstract: This action research accentuates the outcome of a development in English pronunciation, using principles of phonetics for English major students at Loei Rajabhat University. The research is split into 5 separate modules: 1) Organs of Speech and How to Produce Sounds, 2) Monopthongs, 3) Diphthongs, 4) Consonant sounds, and 5) Suprasegmental Features. Each module followed a 4 step action research process, 1) Planning, 2) Acting, 3) Observing, and 4) Reflecting. The research targeted 2nd year students who were majoring in English Education at Loei Rajabhat University during the academic year of 2011. A mixed methodology employing both quantitative and qualitative research was used, which put theory into action, taking segmental features up to suprasegmental features. Multiple tools were employed which included the following documents: pre-test and post-test papers, evaluation and assessment papers, group work assessment forms, a presentation grading form, an observation of participants form and a participant self-reflection form.
All 5 modules for the target group showed that results from the post-tests were higher than those of the pre-tests, with 0.01 statistical significance. All target groups attained results ranging from low to moderate and from moderate to high performance. The participants who attained low to moderate results had to re-sit the second round. During the first development stage, participants attended classes with group participation, in which they addressed planning through mutual co-operation and sharing of responsibility. Analytic induction of strong points for this operation illustrated that learner cognition, comprehension, application, and group practices were all present whereas the participants with weak results could be attributed to biological differences, differences in life and learning, or individual differences in responsiveness and self-discipline.
Participants who were required to be re-treated in Spiral 2 received the same treatment again. Results of tests from the 5 modules after the 2nd treatment were that the participants attained higher scores than those attained in the pre-test. Their assessment and development stages also showed improved results. They showed greater confidence at participating in activities, produced higher quality work, and correctly followed instructions for each activity. Analytic induction of strong and weak points for this operation remains the same as for Spiral 1, though there were improvements to problems which existed prior to undertaking the second treatment.
Abstract: The paper deals with cross-gender and cross-linguistic comparison of pitch characteristics for Tuvinian with two other Turkic languages - Uzbek and Azerbaijani, based on the results of statistical analysis of pitch parameter values and intonation patterns used by male and female speakers.
The main goal of our work is to obtain the ranges of pitch parameter values typical for Tuvinian speakers for the purpose of automatic language identification. We also propose a cross-gender analysis of declarative intonation in the poorly studied Tuvinian language.
The ranges of pitch parameter values were obtained by means of specially developed software that deals with the distribution of pitch values and allows us to obtain statistical language-specific pitch intervals.
Abstract: We consider the biggest challenge in speech recognition – noise reduction. Traditionally detected transient noise pulses are removed with the corrupted speech using pulse models. In this paper we propose to cope with the problem directly in Dynamic Time Warping domain. Bidirectional Dynamic Time Warping algorithm for the recognition of isolated words impacted by transient noise pulses is proposed. It uses simple transient noise pulse detector, employs bidirectional computation of dynamic time warping and directly manipulates with warping results. Experimental investigation with several alternative solutions confirms effectiveness of the proposed algorithm in the reduction of impact of noise on recognition process – 3.9% increase of the noisy speech recognition is achieved.
Abstract: In this paper a novel method for the detection of
clipping in speech signals is described. It is shown that the new
method has better performance than known clipping detection
methods, is easy to implement, and is robust to changes in signal
amplitude, size of data, etc. Statistical simulation results are
presented.
Abstract: In this work, a method of time delay estimation for
dual-channel acoustic signals (speech, music, etc.) recorded under
reverberant conditions is investigated. Standard methods based on
cross-correlation of the signals show poor results in cases involving
strong reverberation, large distances between microphones and
asynchronous recordings. Under similar conditions, a method based
on cross-correlation of temporal envelopes of the signals delivers a
delay estimation of acceptable quality. This method and its properties
are described and investigated in detail, including its limits of
applicability. The method’s optimal parameter estimation and a
comparison with other known methods of time delay estimation are
also provided.