Abstract: Machine learning is a new and exciting area of
artificial intelligence nowadays. Machine learning is the most
valuable, time, supervised, and cost-effective approach. It is not a
narrow learning approach; it also includes a wide range of methods
and techniques that can be applied to a wide range of complex realworld
problems and time domains. Biological image classification,
adaptive testing, computer vision, natural language processing, object
detection, cancer detection, face recognition, handwriting
recognition, speech recognition, and many other applications of
machine learning are widely used in research, industry, and
government. Every day, more data are generated, and conventional
machine learning techniques are becoming obsolete as users move to
distributed and real-time operations. By providing fundamental
knowledge of machine learning tools and research opportunities in
the field, the aim of this article is to serve as both a comprehensive
overview and a guide. A diverse set of machine learning resources is
demonstrated and contrasted with the key features in this survey.
Abstract: Speech to text in Malay language is a system that converts Malay speech into text. The Malay language recognition system is still limited, thus, this paper aims to investigate the performance of ten Malay words obtained from the online Malay news. The methodology consists of three stages, which are preprocessing, feature extraction, and speech classification. In preprocessing stage, the speech samples are filtered using pre emphasis. After that, feature extraction method is applied to the samples using Mel Frequency Cepstrum Coefficient (MFCC). Lastly, speech classification is performed using Feedforward Neural Network (FFNN). The accuracy of the classification is further investigated based on the hidden layer size. From experimentation, the classifier with 40 hidden neurons shows the highest classification rate which is 94%.
Abstract: The separation of speech signals has become a research
hotspot in the field of signal processing in recent years. It has
many applications and influences in teleconferencing, hearing aids,
speech recognition of machines and so on. The sounds received are
usually noisy. The issue of identifying the sounds of interest and
obtaining clear sounds in such an environment becomes a problem
worth exploring, that is, the problem of blind source separation.
This paper focuses on the under-determined blind source separation
(UBSS). Sparse component analysis is generally used for the problem
of under-determined blind source separation. The method is mainly
divided into two parts. Firstly, the clustering algorithm is used to
estimate the mixing matrix according to the observed signals. Then
the signal is separated based on the known mixing matrix. In this
paper, the problem of mixing matrix estimation is studied. This paper
proposes an improved algorithm to estimate the mixing matrix for
speech signals in the UBSS model. The traditional potential algorithm
is not accurate for the mixing matrix estimation, especially for low
signal-to noise ratio (SNR).In response to this problem, this paper
considers the idea of an improved potential function method to
estimate the mixing matrix. The algorithm not only avoids the inuence
of insufficient prior information in traditional clustering algorithm,
but also improves the estimation accuracy of mixing matrix. This
paper takes the mixing of four speech signals into two channels as
an example. The results of simulations show that the approach in this
paper not only improves the accuracy of estimation, but also applies
to any mixing matrix.
Abstract: In this paper, we propose an optimized brain computer
interface (BCI) system for unspoken speech recognition, based on
the fact that the constructions of unspoken words rely strongly on the
Wernicke area, situated in the temporal lobe. Our BCI system has four
modules: (i) the EEG Acquisition module based on a non-invasive
headset with 14 electrodes; (ii) the Preprocessing module to remove
noise and artifacts, using the Common Average Reference method;
(iii) the Features Extraction module, using Wavelet Packet Transform
(WPT); (iv) the Classification module based on a one-hidden layer
artificial neural network. The present study consists of comparing
the recognition accuracy of 5 Arabic words, when using all the
headset electrodes or only the 4 electrodes situated near the Wernicke
area, as well as the selection effect of the subbands produced by
the WPT module. After applying the articial neural network on the
produced database, we obtain, on the test dataset, an accuracy of
83.4% with all the electrodes and all the subbands of 8 levels of the
WPT decomposition. However, by using only the 4 electrodes near
Wernicke Area and the 6 middle subbands of the WPT, we obtain
a high reduction of the dataset size, equal to approximately 19% of
the total dataset, with 67.5% of accuracy rate. This reduction appears
particularly important to improve the design of a low cost and simple
to use BCI, trained for several words.
Abstract: Statement of the automatic speech recognition
problem, the assignment of speech recognition and the application
fields are shown in the paper. At the same time as Azerbaijan speech,
the establishment principles of speech recognition system and the
problems arising in the system are investigated. The computing algorithms of speech features, being the main part
of speech recognition system, are analyzed. From this point of view,
the determination algorithms of Mel Frequency Cepstral Coefficients
(MFCC) and Linear Predictive Coding (LPC) coefficients expressing
the basic speech features are developed. Combined use of cepstrals of
MFCC and LPC in speech recognition system is suggested to
improve the reliability of speech recognition system. To this end, the
recognition system is divided into MFCC and LPC-based recognition
subsystems. The training and recognition processes are realized in
both subsystems separately, and recognition system gets the decision
being the same results of each subsystems. This results in decrease of
error rate during recognition. The training and recognition processes are realized by artificial
neural networks in the automatic speech recognition system. The
neural networks are trained by the conjugate gradient method. In the
paper the problems observed by the number of speech features at
training the neural networks of MFCC and LPC-based speech
recognition subsystems are investigated. The variety of results of neural networks trained from different
initial points in training process is analyzed. Methodology of
combined use of neural networks trained from different initial points
in speech recognition system is suggested to improve the reliability
of recognition system and increase the recognition quality, and
obtained practical results are shown.
Abstract: In this paper, we present a wavelet coefficients masking
based on Local Binary Patterns (WLBP) approach to enhance the
temporal spectra of the wavelet coefficients for speech enhancement.
This technique exploits the wavelet denoising scheme, which splits
the degraded speech into pyramidal subband components and extracts
frequency information without losing temporal information. Speech
enhancement in each high-frequency subband is performed by binary
labels through the local binary pattern masking that encodes the ratio
between the original value of each coefficient and the values of the
neighbour coefficients. This approach enhances the high-frequency
spectra of the wavelet transform instead of eliminating them through
a threshold. A comparative analysis is carried out with conventional
speech enhancement algorithms, demonstrating that the proposed
technique achieves significant improvements in terms of PESQ, an
international recommendation of objective measure for estimating
subjective speech quality. Informal listening tests also show that
the proposed method in an acoustic context improves the quality
of speech, avoiding the annoying musical noise present in other
speech enhancement techniques. Experimental results obtained with a
DNN based speech recognizer in noisy environments corroborate the
superiority of the proposed scheme in the robust speech recognition
scenario.
Abstract: Speech recognition is of an important contribution in promoting new technologies in human computer interaction. Today, there is a growing need to employ speech technology in daily life and business activities. However, speech recognition is a challenging task that requires different stages before obtaining the desired output. Among automatic speech recognition (ASR) components is the feature extraction process, which parameterizes the speech signal to produce the corresponding feature vectors. Feature extraction process aims at approximating the linguistic content that is conveyed by the input speech signal. In speech processing field, there are several methods to extract speech features, however, Mel Frequency Cepstral Coefficients (MFCC) is the popular technique. It has been long observed that the MFCC is dominantly used in the well-known recognizers such as the Carnegie Mellon University (CMU) Sphinx and the Markov Model Toolkit (HTK). Hence, this paper focuses on the MFCC method as the standard choice to identify the different speech segments in order to obtain the language phonemes for further training and decoding steps. Due to MFCC good performance, the previous studies show that the MFCC dominates the Arabic ASR research. In this paper, we demonstrate MFCC as well as the intermediate steps that are performed to get these coefficients using the HTK toolkit.
Abstract: Recently, Automatic Speech Recognition (ASR) systems were used to assist children in language acquisition as it has the ability to detect human speech signal. Despite the benefits offered by the ASR system, there is a lack of ASR systems for Malay-speaking children. One of the contributing factors for this is the lack of continuous speech database for the target users. Though cross-lingual adaptation is a common solution for developing ASR systems for under-resourced language, it is not viable for children as there are very limited speech databases as a source model. In this research, we propose a two-stage adaptation for the development of ASR system for Malay-speaking children using a very limited database. The two stage adaptation comprises the cross-lingual adaptation (first stage) and cross-age adaptation. For the first stage, a well-known speech database that is phonetically rich and balanced, is adapted to the medium-sized Malay adults using supervised MLLR. The second stage adaptation uses the speech acoustic model generated from the first adaptation, and the target database is a small-sized database of the target users. We have measured the performance of the proposed technique using word error rate, and then compare them with the conventional benchmark adaptation. The two stage adaptation proposed in this research has better recognition accuracy as compared to the benchmark adaptation in recognizing children’s speech.
Abstract: Over the past few years, a lot of research has been
conducted to bring Automatic Speech Recognition (ASR) into various
areas of Air Traffic Control (ATC), such as air traffic control
simulation and training, monitoring live operators for with the aim
of safety improvements, air traffic controller workload measurement
and conducting analysis on large quantities controller-pilot speech.
Due to the high accuracy requirements of the ATC context and its
unique challenges, automatic speech recognition has not been widely
adopted in this field. With the aim of providing a good starting
point for researchers who are interested bringing automatic speech
recognition into ATC, this paper gives an overview of possibilities
and challenges of applying automatic speech recognition in air traffic
control. To provide this overview, we present an updated literature
review of speech recognition technologies in general, as well as
specific approaches relevant to the ATC context. Based on this
literature review, criteria for selecting speech recognition approaches
for the ATC domain are presented, and remaining challenges and
possible solutions are discussed.
Abstract: This research study aims to present a retrospective
study about speech recognition systems and artificial intelligence.
Speech recognition has become one of the widely used technologies,
as it offers great opportunity to interact and communicate with
automated machines. Precisely, it can be affirmed that speech
recognition facilitates its users and helps them to perform their daily
routine tasks, in a more convenient and effective manner. This
research intends to present the illustration of recent technological
advancements, which are associated with artificial intelligence.
Recent researches have revealed the fact that speech recognition is
found to be the utmost issue, which affects the decoding of speech. In
order to overcome these issues, different statistical models were
developed by the researchers. Some of the most prominent statistical
models include acoustic model (AM), language model (LM), lexicon
model, and hidden Markov models (HMM). The research will help in
understanding all of these statistical models of speech recognition.
Researchers have also formulated different decoding methods, which
are being utilized for realistic decoding tasks and constrained
artificial languages. These decoding methods include pattern
recognition, acoustic phonetic, and artificial intelligence. It has been
recognized that artificial intelligence is the most efficient and reliable
methods, which are being used in speech recognition.
Abstract: The performance and analysis of speech recognition
system is illustrated in this paper. An approach to recognize the
English word corresponding to digit (0-9) spoken by 2 different
speakers is captured in noise free environment. For feature extraction,
speech Mel frequency cepstral coefficients (MFCC) has been used
which gives a set of feature vectors from recorded speech samples.
Neural network model is used to enhance the recognition
performance. Feed forward neural network with back propagation
algorithm model is used. However other speech recognition
techniques such as HMM, DTW exist. All experiments are carried
out on Matlab.
Abstract: The 3D body movement signals captured during
human-human conversation include clues not only to the content of
people’s communication but also to their culture and personality.
This paper is concerned with automatic extraction of this information
from body movement signals. For the purpose of this research, we
collected a novel corpus from 27 subjects, arranged them into groups
according to their culture. We arranged each group into pairs and
each pair communicated with each other about different topics.
A state-of-art recognition system is applied to the problems of
person, culture, and topic recognition. We borrowed modeling,
classification, and normalization techniques from speech recognition.
We used Gaussian Mixture Modeling (GMM) as the main technique
for building our three systems, obtaining 77.78%, 55.47%, and
39.06% from the person, culture, and topic recognition systems
respectively. In addition, we combined the above GMM systems with
Support Vector Machines (SVM) to obtain 85.42%, 62.50%, and
40.63% accuracy for person, culture, and topic recognition
respectively.
Although direct comparison among these three recognition
systems is difficult, it seems that our person recognition system
performs best for both GMM and GMM-SVM, suggesting that intersubject
differences (i.e. subject’s personality traits) are a major
source of variation. When removing these traits from culture and
topic recognition systems using the Nuisance Attribute Projection
(NAP) and the Intersession Variability Compensation (ISVC)
techniques, we obtained 73.44% and 46.09% accuracy from culture
and topic recognition systems respectively.
Abstract: Speech enhancement is a long standing problem with
numerous applications like teleconferencing, VoIP, hearing aids and
speech recognition. The motivation behind this research work is to
obtain a clean speech signal of higher quality by applying the optimal
noise cancellation technique. Real-time adaptive filtering algorithms
seem to be the best candidate among all categories of the speech
enhancement methods. In this paper, we propose a speech
enhancement method based on Recursive Least Squares (RLS)
adaptive filter of speech signals. Experiments were performed on
noisy data which was prepared by adding AWGN, Babble and Pink
noise to clean speech samples at -5dB, 0dB, 5dB and 10dB SNR
levels. We then compare the noise cancellation performance of
proposed RLS algorithm with existing NLMS algorithm in terms of
Mean Squared Error (MSE), Signal to Noise ratio (SNR) and SNR
Loss. Based on the performance evaluation, the proposed RLS
algorithm was found to be a better optimal noise cancellation
technique for speech signals.
Abstract: In this study, we propose a novel technique for acoustic
echo suppression (AES) during speech recognition under barge-in
conditions. Conventional AES methods based on spectral subtraction
apply fixed weights to the estimated echo path transfer function
(EPTF) at the current signal segment and to the EPTF estimated until
the previous time interval. However, the effects of echo path changes
should be considered for eliminating the undesired echoes. We
describe a new approach that adaptively updates weight parameters in
response to abrupt changes in the acoustic environment due to
background noises or double-talk. Furthermore, we devised a voice
activity detector and an initial time-delay estimator for barge-in speech
recognition in communication networks. The initial time delay is
estimated using log-spectral distance measure, as well as
cross-correlation coefficients. The experimental results show that the
developed techniques can be successfully applied in barge-in speech
recognition systems.
Abstract: In this paper, Fuzzy C-Means clustering with
Expectation Maximization-Gaussian Mixture Model based hybrid
modeling algorithm is proposed for Continuous Tamil Speech
Recognition. The speech sentences from various speakers are used
for training and testing phase and objective measures are between the
proposed and existing Continuous Speech Recognition algorithms.
From the simulated results, it is observed that the proposed algorithm
improves the recognition accuracy and F-measure up to 3% as
compared to that of the existing algorithms for the speech signal from
various speakers. In addition, it reduces the Word Error Rate, Error
Rate and Error up to 4% as compared to that of the existing
algorithms. In all aspects, the proposed hybrid modeling for Tamil
speech recognition provides the significant improvements for speechto-
text conversion in various applications.
Abstract: The paper presents combined automatic speech
recognition (ASR) of English and machine translation (MT) for
English and Croatian and Croatian-English language pairs in the
domain of business correspondence. The first part presents results of
training the ASR commercial system on English data sets, enriched
by error analysis. The second part presents results of machine
translation performed by free online tool for English and Croatian
and Croatian-English language pairs. Human evaluation in terms of
usability is conducted and internal consistency calculated by
Cronbach's alpha coefficient, enriched by error analysis. Automatic
evaluation is performed by WER (Word Error Rate) and PER
(Position-independent word Error Rate) metrics, followed by
investigation of Pearson’s correlation with human evaluation.
Abstract: We consider the biggest challenge in speech recognition – noise reduction. Traditionally detected transient noise pulses are removed with the corrupted speech using pulse models. In this paper we propose to cope with the problem directly in Dynamic Time Warping domain. Bidirectional Dynamic Time Warping algorithm for the recognition of isolated words impacted by transient noise pulses is proposed. It uses simple transient noise pulse detector, employs bidirectional computation of dynamic time warping and directly manipulates with warping results. Experimental investigation with several alternative solutions confirms effectiveness of the proposed algorithm in the reduction of impact of noise on recognition process – 3.9% increase of the noisy speech recognition is achieved.
Abstract: A robust still image face localization algorithm
capable of operating in an unconstrained visual environment is
proposed. First, construction of a robust skin classifier within a
shifted HSV color space is described. Then various filtering
operations are performed to better isolate face candidates and
mitigate the effect of substantial non-skin regions. Finally, a novel
Bhattacharyya-based face detection algorithm is used to compare
candidate regions of interest with a unique illumination-dependent
face model probability distribution function approximation.
Experimental results show a 90% face detection success rate despite
the demands of the visually noisy environment.
Abstract: The goal of speech parameterization is to extract the relevant information about what is being spoken from the audio signal. In speech recognition systems Mel-Frequency Cepstral Coefficients (MFCC) and Relative Spectral Mel-Frequency Cepstral Coefficients (RASTA-MFCC) are the two main techniques used. It will be shown in this paper that it presents some modifications to the original MFCC method. In our work the effectiveness of proposed changes to MFCC called Modified Function Cepstral Coefficients (MODFCC) were tested and compared against the original MFCC and RASTA-MFCC features. The prosodic features such as jitter and shimmer are added to baseline spectral features. The above-mentioned techniques were tested with impulsive signals under various noisy conditions within AURORA databases.
Abstract: This paper describes a 3D modeling system in
Augmented Reality environment, named 3DARModeler. It can be
considered a simple version of 3D Studio Max with necessary
functions for a modeling system such as creating objects, applying
texture, adding animation, estimating real light sources and casting
shadows. The 3DARModeler introduces convenient, and effective
human-computer interaction to build 3D models by combining both
the traditional input method (mouse/keyboard) and the tangible input
method (markers). It has the ability to align a new virtual object with
the existing parts of a model. The 3DARModeler targets nontechnical
users. As such, they do not need much knowledge of
computer graphics and modeling techniques. All they have to do is
select basic objects, customize their attributes, and put them together
to build a 3D model in a simple and intuitive way as if they were
doing in the real world. Using the hierarchical modeling technique,
the users are able to group several basic objects to manage them as a
unified, complex object. The system can also connect with other 3D
systems by importing and exporting VRML/3Ds Max files. A
module of speech recognition is included in the system to provide
flexible user interfaces.