Abstract: Sentiment analysis (SA) has received growing
attention in Arabic language research. However, few studies have yet
to directly apply SA to Arabic due to lack of a publicly available
dataset for this language. This paper partially bridges this gap due to
its focus on one of the Arabic dialects which is the Saudi dialect. This
paper presents annotated data set of 4700 for Saudi dialect sentiment
analysis with (K= 0.807). Our next work is to extend this corpus and
creation a large-scale lexicon for Saudi dialect from the corpus.
Abstract: Let us consider that the entire universe is composed of
a single hydrogen atom within which the electron is moving around
the proton. In this case, according to classical theories of physics,
radiation, photons respectively, should be absorbed by the electron.
Depending on the number of photons absorbed, the electron radius of
rotation around the proton is established. Until now, the principle of
photons absorption by electrons and the electron transition to a new
energy level, namely to a higher radius of rotation around the proton,
is not clarified in physics. This paper aims to demonstrate that
radiation, photons respectively, have mass and negative electrostatic
charge similar to electrons but infinitely smaller. The experiments
which demonstrate this theory are simple: thermal expansion,
photoelectric effect and thermonuclear reaction.
Abstract: The growth in the volume of text data such as books
and articles in libraries for centuries has imposed to establish
effective mechanisms to locate them. Early techniques such as
abstraction, indexing and the use of classification categories have
marked the birth of a new field of research called "Information
Retrieval". Information Retrieval (IR) can be defined as the task of
defining models and systems whose purpose is to facilitate access to
a set of documents in electronic form (corpus) to allow a user to find
the relevant ones for him, that is to say, the contents which matches
with the information needs of the user. This paper presents a new
semantic indexing approach of a documentary corpus. The indexing
process starts first by a term weighting phase to determine the
importance of these terms in the documents. Then the use of a
thesaurus like Wordnet allows moving to the conceptual level.
Each candidate concept is evaluated by determining its level of
representation of the document, that is to say, the importance of the
concept in relation to other concepts of the document. Finally, the
semantic index is constructed by attaching to each concept of the
ontology, the documents of the corpus in which these concepts are
found.
Abstract: OPEN_EmoRec_II is an open multimodal corpus with
experimentally induced emotions. In the first half of the experiment,
emotions were induced with standardized picture material and in the
second half during a human-computer interaction (HCI), realized
with a wizard-of-oz design. The induced emotions are based on the
dimensional theory of emotions (valence, arousal and dominance).
These emotional sequences - recorded with multimodal data (facial
reactions, speech, audio and physiological reactions) during a
naturalistic-like HCI-environment one can improve classification
methods on a multimodal level.
This database is the result of an HCI-experiment, for which 30
subjects in total agreed to a publication of their data including the
video material for research purposes*. The now available open
corpus contains sensory signal of: video, audio, physiology (SCL,
respiration, BVP, EMG Corrugator supercilii, EMG Zygomaticus
Major) and facial reactions annotations.
Abstract: OPEN_EmoRec_II is an open multimodal corpus with
experimentally induced emotions. In the first half of the experiment,
emotions were induced with standardized picture material and in the
second half during a human-computer interaction (HCI), realized
with a wizard-of-oz design. The induced emotions are based on the
dimensional theory of emotions (valence, arousal and dominance).
These emotional sequences - recorded with multimodal data (facial
reactions, speech, audio and physiological reactions) during a
naturalistic-like HCI-environment one can improve classification
methods on a multimodal level.
This database is the result of an HCI-experiment, for which 30
subjects in total agreed to a publication of their data including the
video material for research purposes*. The now available open
corpus contains sensory signal of: video, audio, physiology (SCL,
respiration, BVP, EMG Corrugator supercilii, EMG Zygomaticus
Major) and facial reactions annotations.
Abstract: Increasing prevalence of childhood obesity has
increased the interest in early and late indicators of gaining weight.
Cell blood counts may be indicators of pro-inflammatory states. The
aim was to evaluate associations of hematological parameters,
including hematocrit (HTC), hemoglobin, blood cell counts and their
indices with the degree of obesity in pediatric population. A total of
249; -139 morbidly obese (MO), 82 healthy normal weight (NW) and
28 overweight (OW) children were included into the scope of the
study. WHO BMI-for age percentiles were used to form age- and sexmatched
groups. Informed consent forms and the Ethics Committee
approval were obtained. Anthropometric measurements were
performed. Hematological parameters were determined. Statistical
analyses were performed using SPSS. The degree for statistical
significance was p≤0.05. Significant differences (p=0.000) between
waist-to-hip ratios and head-to- neck ratios (hnrs) of MO and NW
children were detected. A significant difference between hnrs of OW
and MO children (p=0.000) was observed. Red cell distribution width
(RDW) was higher in OW children than NW group (p=0.030). Such
finding couldn’t be detected between MO and NW groups. Increased
RDW was prominent in OW children. The decrease in mean
corpuscular hemoglobin concentration (MCHC) values in MO
children was sharper than the values in OW children (p=0.006 vs
p=0.042) compared to those in NW group. Statistically higher HTC
levels were observed between MO-NW (p=0.014), but none between
OW-NW. Though the cause-effect relationship between obesity and
erythrocyte indices still needs further investigation, alterations in
RDW, HTC, MCHC during obesity may be of significance in the
early life.
Abstract: The 3D body movement signals captured during
human-human conversation include clues not only to the content of
people’s communication but also to their culture and personality.
This paper is concerned with automatic extraction of this information
from body movement signals. For the purpose of this research, we
collected a novel corpus from 27 subjects, arranged them into groups
according to their culture. We arranged each group into pairs and
each pair communicated with each other about different topics.
A state-of-art recognition system is applied to the problems of
person, culture, and topic recognition. We borrowed modeling,
classification, and normalization techniques from speech recognition.
We used Gaussian Mixture Modeling (GMM) as the main technique
for building our three systems, obtaining 77.78%, 55.47%, and
39.06% from the person, culture, and topic recognition systems
respectively. In addition, we combined the above GMM systems with
Support Vector Machines (SVM) to obtain 85.42%, 62.50%, and
40.63% accuracy for person, culture, and topic recognition
respectively.
Although direct comparison among these three recognition
systems is difficult, it seems that our person recognition system
performs best for both GMM and GMM-SVM, suggesting that intersubject
differences (i.e. subject’s personality traits) are a major
source of variation. When removing these traits from culture and
topic recognition systems using the Nuisance Attribute Projection
(NAP) and the Intersession Variability Compensation (ISVC)
techniques, we obtained 73.44% and 46.09% accuracy from culture
and topic recognition systems respectively.
Abstract: The growth in the volume of text data such as books
and articles in libraries for centuries has imposed to establish
effective mechanisms to locate them. Early techniques such as
abstraction, indexing and the use of classification categories have
marked the birth of a new field of research called "Information
Retrieval". Information Retrieval (IR) can be defined as the task of
defining models and systems whose purpose is to facilitate access to
a set of documents in electronic form (corpus) to allow a user to find
the relevant ones for him, that is to say, the contents which matches
with the information needs of the user.
Most of the models of information retrieval use a specific data
structure to index a corpus which is called "inverted file" or "reverse
index".
This inverted file collects information on all terms over the corpus
documents specifying the identifiers of documents that contain the
term in question, the frequency of each term in the documents of the
corpus, the positions of the occurrences of the word...
In this paper we use an oriented object database (db4o) instead of
the inverted file, that is to say, instead to search a term in the inverted
file, we will search it in the db4o database.
The purpose of this work is to make a comparative study to see if
the oriented object databases may be competing for the inverse index
in terms of access speed and resource consumption using a large
volume of data.
Abstract: Textual data plays an important role in the modern
world. The possibilities of applying data mining techniques to
uncover hidden information present in large volumes of text
collections is immense. The Growing Self Organizing Map (GSOM)
is a highly successful member of the Self Organising Map family
and has been used as a clustering and visualisation tool across wide
range of disciplines to discover hidden patterns present in the data.
A comprehensive analysis of the GSOM’s capabilities as a text
clustering and visualisation tool has so far not been published. These
functionalities, namely map visualisation capabilities, automatic
cluster identification and hierarchical clustering capabilities are
presented in this paper and are further demonstrated with experiments
on a benchmark text corpus.
Abstract: Corpus luteum cross sectional (by ultrasonography) and plasma progesterone (by DELFIA) were estimated in early pregnant and non pregnant cows on days 14th and 20th to 23rd post insemination. On day 14th, corpus luteum sectional area was 348.43 mm2 in pregnant and 387.84mm2 in non pregnant cows. Within days 20th to 23rd, corpus luteum sectional area ranged between 342.06 and 367.90 mm2 in pregnant and between 193.85 and 270.69 mm2 in non pregnant cows. Plasma progesterone level was 2.43 ng/ml in pregnant and 2.46 ng/ml in non pregnant cows on day 14th, while during days 20th to 23rd the level ranged between 2.47 and 2.84 ng/ml in pregnant and between 0.53 and 1.17 ng/ml in non pregnant cows. Results of both luteal tissue areas as well as plasma progesterone levels were highly significantly deferent (P
Abstract: Email has become a fast and cheap means of online
communication. The main threat to email is Unsolicited Bulk Email
(UBE), commonly called spam email. The current work aims at
identification of unigrams in more than 2700 UBE that advertise
body-enhancement drugs. The identification is based on the
requirement that the unigram is neither present in dictionary, nor is a
slang term. The motives of the paper are many fold. This is an
attempt to analyze spamming behaviour and employment of wordmutation
technique. On the side-lines of the paper, we have
attempted to better understand the spam, the slang and their interplay.
The problem has been addressed by employing Tokenization
technique and Unigram BOW model. We found that the non-lexicon
words constitute nearly 66% of total number of lexis of corpus
whereas non-slang words constitute nearly 2.4% of non-lexicon
words. Further, non-lexicon non-slang unigrams composed of 2
lexicon words, form more than 71% of the total number of such
unigrams. To the best of our knowledge, this is the first attempt to
analyze usage of non-lexicon non-slang unigrams in any kind of
UBE.
Abstract: In this paper, we propose a method of resolving dependency ambiguities of Korean subordinate clauses based on Support Vector Machines (SVMs). Dependency analysis of clauses is well known to be one of the most difficult tasks in parsing sentences, especially in Korean. In order to solve this problem, we assume that the dependency relation of Korean subordinate clauses is the dependency relation among verb phrase, verb and endings in the clauses. As a result, this problem is represented as a binary classification task. In order to apply SVMs to this problem, we selected two kinds of features: static and dynamic features. The experimental results on STEP2000 corpus show that our system achieves the accuracy of 73.5%.
Abstract: This work proposes an approach to address automatic
text summarization. This approach is a trainable summarizer, which
takes into account several features, including sentence position,
positive keyword, negative keyword, sentence centrality, sentence
resemblance to the title, sentence inclusion of name entity, sentence
inclusion of numerical data, sentence relative length, Bushy path of
the sentence and aggregated similarity for each sentence to generate
summaries. First we investigate the effect of each sentence feature on
the summarization task. Then we use all features score function to
train genetic algorithm (GA) and mathematical regression (MR)
models to obtain a suitable combination of feature weights. The
proposed approach performance is measured at several compression
rates on a data corpus composed of 100 English religious articles.
The results of the proposed approach are promising.
Abstract: In this paper, we use Radial Basis Function Networks
(RBFN) for solving the problem of environmental interference
cancellation of speech signal. We show that the Second Order Thin-
Plate Spline (SOTPS) kernel cancels the interferences effectively.
For make comparison, we test our experiments on two conventional
most used RBFN kernels: the Gaussian and First order TPS (FOTPS)
basis functions. The speech signals used here were taken from the
OGI Multi-Language Telephone Speech Corpus database and were
corrupted with six type of environmental noise from NOISEX-92
database. Experimental results show that the SOTPS kernel can
considerably outperform the Gaussian and FOTPS functions on
speech interference cancellation problem.
Abstract: This paper summarizes the results of some experiments for finding the effective features for disambiguation of Turkish verbs. Word sense disambiguation is a current area of investigation in which verbs have the dominant role. Generally verbs have more senses than the other types of words in the average and detecting these features for verbs may lead to some improvements for other word types. In this paper we have considered only the syntactical features that can be obtained from the corpus and tested by using some famous machine learning algorithms.
Abstract: Paper deals with the topic of questions as important
components of information behavior in the school. By analyzing the
Corpus Schola2010, the state of contemporary education in terms of
questioning is proven unsatisfactory: 80% of the questions are asked
by teachers; most of teacher-s questions are asked at the beginning of
the first grade, than their number decreases and is settling down on
80±10 questions per lesson. The average number of questions within
one lesson per one pupil is generally less than one whole question.
The highest values are achieved in the first, sixth, eighth and tenth
grade,, i.e. in the transition years in which pupils are moving into
higher levels of education and every following year it declines. We
can state Czech school do not support questioning and question skill
of their pupils, thereby typical Czech schools are neglecting the
development of thinking, reasoning and cooperation of their pupils.
Abstract: We report in this paper the procedure of a system of
automatic speech recognition based on techniques of the dynamic
programming. The technique of temporal retiming is a technique
used to synchronize between two forms to compare. We will see how
this technique is adapted to the field of the automatic speech
recognition. We will expose, in a first place, the theory of the
function of retiming which is used to compare and to adjust an
unknown form with a whole of forms of reference constituting the
vocabulary of the application. Then we will give, in the second place,
the various algorithms necessary to their implementation on machine.
The algorithms which we will present were tested on part of the
corpus of words in Arab language Arabdic-10 [4] and gave whole
satisfaction. These algorithms are effective insofar as we apply them
to the small ones or average vocabularies.
Abstract: Word sense disambiguation is one of the most important open problems in natural language processing applications such as information retrieval and machine translation. Many approach strategies can be employed to resolve word ambiguity with a reasonable degree of accuracy. These strategies are: knowledgebased, corpus-based, and hybrid-based. This paper pays attention to the corpus-based strategy that employs an unsupervised learning method for disambiguation. We report our investigation of Latent Semantic Indexing (LSI), an information retrieval technique and unsupervised learning, to the task of Thai noun and verbal word sense disambiguation. The Latent Semantic Indexing has been shown to be efficient and effective for Information Retrieval. For the purposes of this research, we report experiments on two Thai polysemous words, namely /hua4/ and /kep1/ that are used as a representative of Thai nouns and verbs respectively. The results of these experiments demonstrate the effectiveness and indicate the potential of applying vector-based distributional information measures to semantic disambiguation.
Abstract: Nowadays, ontologies are the only widely accepted paradigm for the management of sharable and reusable knowledge in a way that allows its automatic interpretation. They are collaboratively created across the Web and used to index, search and annotate documents. The vast majority of the ontology based approaches, however, focus on indexing texts at document level. Recently, with the advances in ontological engineering, it became clear that information indexing can largely benefit from the use of general purpose ontologies which aid the indexing of documents at word level. This paper presents a concept indexing algorithm, which adds ontology information to words and phrases and allows full text to be searched, browsed and analyzed at different levels of abstraction. This algorithm uses a general purpose ontology, OntoRo, and an ontologically tagged corpus, OntoCorp, both developed for the purpose of this research. OntoRo and OntoCorp are used in a two-stage supervised machine learning process aimed at generating ontology tagging rules. The first experimental tests show a tagging accuracy of 78.91% which is encouraging in terms of the further improvement of the algorithm.
Abstract: Performance of any continuous speech recognition system is highly dependent on performance of the acoustic models. Generally, development of the robust spoken language technology relies on the availability of large amounts of data. Common way to cope with little data for training each state of Markov models is treebased state tying. This tying method applies contextual questions to tie states. Manual procedure for question generation suffers from human errors and is time consuming. Various automatically generated questions are used to construct decision tree. There are three approaches to generate questions to construct HMMs based on decision tree. One approach is based on misrecognized phonemes, another approach basically uses feature table and the other is based on state distributions corresponding to context-independent subword units. In this paper, all these methods of automatic question generation are applied to the decision tree on FARSDAT corpus in Persian language and their results are compared with those of manually generated questions. The results show that automatically generated questions yield much better results and can replace manually generated questions in Persian language.