Abstract: As a popular rank-reduced vector space approach,
Latent Semantic Indexing (LSI) has been used in information
retrieval and other applications. In this paper, an LSI-based content
vector model for text classification is presented, which constructs
multiple augmented category LSI spaces and classifies text by their
content. The model integrates the class discriminative information
from the training data and is equipped with several pertinent feature
selection and text classification algorithms. The proposed classifier
has been applied to email classification and its experiments on a
benchmark spam testing corpus (PU1) have shown that the approach
represents a competitive alternative to other email classifiers based
on the well-known SVM and naïve Bayes algorithms.
Abstract: In this paper, a new adaptive Fourier decomposition
(AFD) based time-frequency speech analysis approach is proposed.
Given the fact that the fundamental frequency of speech signals often
undergo fluctuation, the classical short-time Fourier transform (STFT)
based spectrogram analysis suffers from the difficulty of window size
selection. AFD is a newly developed signal decomposition theory. It is
designed to deal with time-varying non-stationary signals. Its
outstanding characteristic is to provide instantaneous frequency for
each decomposed component, so the time-frequency analysis becomes
easier. Experiments are conducted based on the sample sentence in
TIMIT Acoustic-Phonetic Continuous Speech Corpus. The results
show that the AFD based time-frequency distribution outperforms the
STFT based one.
Abstract: Software Architecture plays a key role in software development but absence of formal description of Software Architecture causes different impede in software development. To cope with these difficulties, ontology has been used as artifact. This paper proposes ontology for Software Architectural design based on IEEE model for architecture description and Kruchten 4+1 model for viewpoints classification. For categorization of style and views, ISO/IEC 42010 has been used. Corpus method has been used to evaluate ontology. The main aim of the proposed ontology is to classify and locate Software Architectural design information.
Abstract: 28 healthy adult Maradi bucks were used to evaluate
sperm production and sperm storage capacity in the breed. Daily
sperm production (DSP) averaged 0.55±0.05x109, while the daily
sperm production/g (DSP/g) was 1.37±0.12 x107. Gonadal sperm
reserve was 1.99±0.18 x109, while the caput, upper corpus and lower
corpus averaged 0.58±0.04 x109, 0.36±0.02 x109 and 0.33±0.08 x109
respectively. The proximal cauda, mid cauda, distal cauda and ductus
deferens had values of 0.68±0.10 x109, 1.23±0.16 x109,1.87±0.
x109and 0.17±0.05 x109 respectively. The relative contributions of
the respective epididymal sections and ductus deferens to the total
extragonadal sperm reserves were, 11.11%, 6.89%, 6.32%, 13.03%,
23.56%, 35.82% and 3.26% respectively. Gonadal sperm reserves
were significantly higher (p0.05) to
mid cauda and distal cauda epididymal reserves.
Abstract: Text categorization - the assignment of natural language documents to one or more predefined categories based on their semantic content - is an important component in many information organization and management tasks. Performance of neural networks learning is known to be sensitive to the initial weights and architecture. This paper discusses the use multilayer neural network initialization with decision tree classifier for improving text categorization accuracy. An adaptation of the algorithm is proposed in which a decision tree from root node until a final leave is used for initialization of multilayer neural network. The experimental evaluation demonstrates this approach provides better classification accuracy with Reuters-21578 corpus, one of the standard benchmarks for text categorization tasks. We present results comparing the accuracy of this approach with multilayer neural network initialized with traditional random method and decision tree classifiers.
Abstract: This paper introduces a measure of similarity between
two clusterings of the same dataset produced by two different
algorithms, or even the same algorithm (K-means, for instance, with
different initializations usually produce different results in clustering
the same dataset). We then apply the measure to calculate the
similarity between pairs of clusterings, with special interest directed
at comparing the similarity between various machine clusterings and
human clustering of datasets. The similarity measure thus can be used
to identify the best (in terms of most similar to human) clustering
algorithm for a specific problem at hand. Experimental results
pertaining to the text categorization problem of a Portuguese corpus
(wherein a translation-into-English approach is used) are presented, as well as results on the well-known benchmark IRIS dataset. The
significance and other potential applications of the proposed measure
are discussed.
Abstract: This work concerns the evolution and the maintenance
of an ontological resource in relation with the evolution of the corpus
of texts from which it had been built.
The knowledge forming a text corpus, especially in dynamic domains,
is in continuous evolution. When a change in the corpus occurs, the
domain ontology must evolve accordingly. Most methods manage
ontology evolution independently from the corpus from which it is
built; in addition, they treat evolution just as a process of knowledge
addition, not considering other knowledge changes. We propose a
methodology for managing an evolving ontology from a text corpus
that evolves over time, while preserving the consistency and the
persistence of this ontology.
Our methodology is based on the changes made on the corpus to
reflect the evolution of the considered domain - augmented surgery
in our case. In this context, the results of text mining techniques,
as well as the ARCHONTE method slightly modified, are used to
support the evolution process.
Abstract: We report in this paper the model adopted by our
system of continuous speech recognition in Arab language SySRA
and the results obtained until now. This system uses the database
Arabdic-10 which is a corpus of word for the Arab language and
which was manually segmented. Phonetic decoding is represented
by an expert system where the knowledge base is translated in the
form of production rules. This expert system transforms a vocal
signal into a phonetic lattice. The higher level of the system takes
care of the recognition of the lattice thus obtained by deferring it in
the form of written sentences (orthographical Form). This level
contains initially the lexical analyzer which is not other than the
module of recognition. We subjected this analyzer to a set of
spectrograms obtained by dictating a score of sentences in Arab
language. The rate of recognition of these sentences is about 70%
which is, to our knowledge, the best result for the recognition of the
Arab language. The test set consists of twenty sentences from four
speakers not having taken part in the training.
Abstract: Term Extraction, a key data preparation step in Text
Mining, extracts the terms, i.e. relevant collocation of words,
attached to specific concepts (e.g. genetic-algorithms and decisiontrees
are terms associated to the concept “Machine Learning" ). In
this paper, the task of extracting interesting collocations is achieved
through a supervised learning algorithm, exploiting a few
collocations manually labelled as interesting/not interesting. From
these examples, the ROGER algorithm learns a numerical function,
inducing some ranking on the collocations. This ranking is optimized
using genetic algorithms, maximizing the trade-off between the false
positive and true positive rates (Area Under the ROC curve). This
approach uses a particular representation for the word collocations,
namely the vector of values corresponding to the standard statistical
interestingness measures attached to this collocation. As this
representation is general (over corpora and natural languages),
generality tests were performed by experimenting the ranking
function learned from an English corpus in Biology, onto a French
corpus of Curriculum Vitae, and vice versa, showing a good
robustness of the approaches compared to the state-of-the-art Support
Vector Machine (SVM).
Abstract: In this paper, we present a novel statistical approach to
corpus-based speech synthesis. Classically, phonetic information is
defined and considered as acoustic reference to be respected. In this
way, many studies were elaborated for acoustical unit classification.
This type of classification allows separating units according to their
symbolic characteristics. Indeed, target cost and concatenation cost
were classically defined for unit selection.
In Corpus-Based Speech Synthesis System, when using large text
corpora, cost functions were limited to a juxtaposition of symbolic
criteria and the acoustic information of units is not exploited in the
definition of the target cost.
In this manuscript, we token in our consideration the unit phonetic
information corresponding to acoustic information. This would be realized
by defining a probabilistic linguistic Bi-grams model basically
used for unit selection. The selected units would be extracted from
the English TIMIT corpora.
Abstract: The Automatic Speech Recognition (ASR) applied to
Arabic language is a challenging task. This is mainly related to the
language specificities which make the researchers facing multiple
difficulties such as the insufficient linguistic resources and the very
limited number of available transcribed Arabic speech corpora. In
this paper, we are interested in the development of a HMM-based
ASR system for Standard Arabic (SA) language. Our fundamental
research goal is to select the most appropriate acoustic parameters
describing each audio frame, acoustic models and speech recognition
unit. To achieve this purpose, we analyze the effect of varying frame
windowing (size and period), acoustic parameter number resulting
from features extraction methods traditionally used in ASR, speech
recognition unit, Gaussian number per HMM state and number of
embedded re-estimations of the Baum-Welch Algorithm. To evaluate
the proposed ASR system, a multi-speaker SA connected-digits
corpus is collected, transcribed and used throughout all experiments.
A further evaluation is conducted on a speaker-independent continue
SA speech corpus. The phonemes recognition rate is 94.02% which is
relatively high when comparing it with another ASR system
evaluated on the same corpus.
Abstract: In historical science and social science, the influence
of natural disaster upon society is a matter of great interest. In
recent years, some archives are made through many hands for natural
disasters, however it is inefficiency and waste. So, we suppose a
computer system to create a historical natural disaster archive. As
the target of this analysis, we consider newspaper articles. The news
articles are considered to be typical examples that prescribe the
temporal relations of affairs for natural disaster. In order to do this
analysis, we identify the occurrences in newspaper articles by some
index entries, considering the affairs which are specific to natural
disasters, and show the temporal relation between natural disasters.
We designed and implemented the automatic system of “extraction
of the occurrences of natural disaster" and “temporal relation table
for natural disaster."
Abstract: Oxyleotris marmorata is considered as
undomesticated fish, and its culture occasionally faces a problem of
food deprivation. The present study aims to evaluate alteration of
hematological indices, blood chemical associated with liver function
during 4 weeks of fasting. A non-linear relationships between fasting
days and hematological parameters (red blood cell number; y = -
0.002x2 + 0.041x + 1.249; R2=0.915, P0.05), mean corpuscular
volume; y = -0.180x2 + 2.183x + 149.61; R2=0.732, P>0.05, mean
corpuscular hemoglobin; y = -0.041x2 + 0.862x + 29.864; R2=0.818,
P>0.05 and mean corpuscular hemoglobin concentration; y = -
0.044x2 + 0.711x + 21.580; R2=0.730, P>0.05) were demonstrated.
Significant change in hematocrit (Ht) during fasting period was
observed. Ht elevated sharply increase at the first weeks of fasting
period. Higher Ht also was detected during week 2-4 of fasting time.
The significant reduction of hepatosomatic index was observed (y = -
0.007x2 - 0.096x + 1.414; R2=0.968, P0.05, serum glutamic oxaloacetic transaminase;
y = 0.005x2 – 0.201x2 + 1.297x + 33.256; R2=1, P0.05). Taken together, prolonged fasting has
deleterious effects on hematological indices, liver mass and enzyme
associated in liver function. The marked adverse effects occurred
after the first week of fasting state.
Abstract: Accounts of language acquisition differ significantly in their treatment of the role of prediction in language learning. In particular, nativist accounts posit that probabilistic learning about words and word sequences has little to do with how children come to use language. The accuracy of this claim was examined by testing whether distributional probabilities and frequency contributed to how well 3-4 year olds repeat simple word chunks. Corresponding chunks were the same length, expressed similar content, and were all grammatically acceptable, yet the results of the study showed marked differences in performance when overall distributional frequency varied. It was found that a distributional model of language predicted the empirical findings better than a number of other models, replicating earlier findings and showing that children attend to distributional probabilities in an adult corpus. This suggested that language is more prediction-and-error based, rather than on abstract rules which nativist camps suggest.
Abstract: Hypernetworks are a generalized graph structure
representing higher-order interactions between variables. We present a
method for self-organizing hypernetworks to learn an associative
memory of sentences and to recall the sentences from this memory.
This learning method is inspired by the “mental chemistry" model of
cognition and the “molecular self-assembly" technology in
biochemistry. Simulation experiments are performed on a corpus of
natural-language dialogues of approximately 300K sentences
collected from TV drama captions. We report on the sentence
completion performance as a function of the order of word-interaction
and the size of the learning corpus, and discuss the plausibility of this
architecture as a cognitive model of language learning and memory.
Abstract: Named Entity Recognition (NER) aims to classify each word of a document into predefined target named entity classes and is now-a-days considered to be fundamental for many Natural Language Processing (NLP) tasks such as information retrieval, machine translation, information extraction, question answering systems and others. This paper reports about the development of a NER system for Bengali and Hindi using Support Vector Machine (SVM). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Indian languages (ILs) is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named (NE) classes, such as Person name, Location name, Organization name and Miscellaneous name. We have used the annotated corpora of 122,467 tokens of Bengali and 502,974 tokens of Hindi tagged with the twelve different NE classes 1, defined as part of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (SSEAL) 2. In addition, we have manually annotated 150K wordforms of the Bengali news corpus, developed from the web-archive of a leading Bengali newspaper. We have also developed an unsupervised algorithm in order to generate the lexical context patterns from a part of the unlabeled Bengali news corpus. Lexical patterns have been used as the features of SVM in order to improve the system performance. The NER system has been tested with the gold standard test sets of 35K, and 60K tokens for Bengali, and Hindi, respectively. Evaluation results have demonstrated the recall, precision, and f-score values of 88.61%, 80.12%, and 84.15%, respectively, for Bengali and 80.23%, 74.34%, and 77.17%, respectively, for Hindi. Results show the improvement in the f-score by 5.13% with the use of context patterns. Statistical analysis, ANOVA is also performed to compare the performance of the proposed NER system with that of the existing HMM based system for both the languages.
Abstract: We present a method to create special domain
collections from news sites. The method only requires a single
sample article as a seed. No prior corpus statistics are needed and the
method is applicable to multiple languages. We examine various
similarity measures and the creation of document collections for
English and Japanese. The main contributions are as follows. First,
the algorithm can build special domain collections from as little as
one sample document. Second, unlike other algorithms it does not
require a second “general" corpus to compute statistics. Third, in our
testing the algorithm outperformed others in creating collections
made up of highly relevant articles.
Abstract: The paper presents the design concept of a unitselection
text-to-speech synthesis system for the Slovenian language.
Due to its modular and upgradable architecture, the system can be
used in a variety of speech user interface applications, ranging from
server carrier-grade voice portal applications, desktop user interfaces
to specialized embedded devices.
Since memory and processing power requirements are important
factors for a possible implementation in embedded devices, lexica
and speech corpora need to be reduced. We describe a simple and
efficient implementation of a greedy subset selection algorithm that
extracts a compact subset of high coverage text sentences. The
experiment on a reference text corpus showed that the subset
selection algorithm produced a compact sentence subset with a small
redundancy.
The adequacy of the spoken output was evaluated by several
subjective tests as they are recommended by the International
Telecommunication Union ITU.
Abstract: This paper deals with automatic sentence modality
recognition in French. In this work, only prosodic features are
considered. The sentences are recognized according to the three
following modalities: declarative, interrogative and exclamatory
sentences. This information will be used to animate a talking head for
deaf and hearing-impaired children. We first statistically study a real
radio corpus in order to assess the feasibility of the automatic
modeling of sentence types. Then, we test two sets of prosodic
features as well as two different classifiers and their combination. We
further focus our attention on questions recognition, as this modality
is certainly the most important one for the target application.
Abstract: Concatenative speech synthesis is a method that can
make speech sound which has naturalness and high-individuality of a
speaker by introducing a large speech corpus. Based on this method, in
this paper, we propose a voice conversion method whose conversion
speech has high-individuality and naturalness. The authors also have
two subjective evaluation experiments for evaluating individuality and
sound quality of conversion speech. From the results, following three
facts have be confirmed: (a) the proposal method can convert the
individuality of speakers well, (b) employing the framework of unit
selection (especially join cost) of concatenative speech synthesis into
conventional voice conversion improves the sound quality of
conversion speech, and (c) the proposal method is robust against the
difference of genders between a source speaker and a target speaker.