Abstract: This paper presents and benchmarks a number of
end-to-end Deep Learning based models for metaphor detection in
Greek. We combine Convolutional Neural Networks and Recurrent
Neural Networks with representation learning to bear on the metaphor
detection problem for the Greek language. The models presented
achieve exceptional accuracy scores, significantly improving the
previous state-of-the-art results, which had already achieved accuracy
0.82. Furthermore, no special preprocessing, feature engineering or
linguistic knowledge is used in this work. The methods presented
achieve accuracy of 0.92 and F-score 0.92 with Convolutional
Neural Networks (CNNs) and bidirectional Long Short Term Memory
networks (LSTMs). Comparable results of 0.91 accuracy and 0.91
F-score are also achieved with bidirectional Gated Recurrent Units
(GRUs) and Convolutional Recurrent Neural Nets (CRNNs). The
models are trained and evaluated only on the basis of training tuples,
the related sentences and their labels. The outcome is a state-of-the-art
collection of metaphor detection models, trained on limited labelled
resources, which can be extended to other languages and similar
tasks.
Abstract: This proposal aims for semantic enrichment between
glossaries using the Simple Knowledge Organization System (SKOS)
vocabulary to discover synonyms, hyponyms and hyperonyms
semiautomatically, in Brazilian Portuguese, generating new semantic
relationships based on WordNet. To evaluate the quality of this
proposed model, experiments were performed by the use of two sets
containing new relations, being one generated automatically and the
other manually mapped by the domain expert. The applied evaluation
metrics were precision, recall, f-score, and confidence interval. The
results obtained demonstrate that the applied method in the field of
Oil Production and Extraction (E&P) is effective, which suggests that
it can be used to improve the quality of terminological mappings.
The procedure, although adding complexity in its elaboration, can be
reproduced in others domains.
Abstract: A major challenge in medical studies, especially those that are longitudinal, is the problem of missing measurements which hinders the effective application of many machine learning algorithms. Furthermore, recent Alzheimer's Disease studies have focused on the delineation of Early Mild Cognitive Impairment (EMCI) and Late Mild Cognitive Impairment (LMCI) from cognitively normal controls (CN) which is essential for developing effective and early treatment methods. To address the aforementioned challenges, this paper explores the potential of using the eXtreme Gradient Boosting (XGBoost) algorithm in handling missing values in multiclass classification. We seek a generalized classification scheme where all prodromal stages of the disease are considered simultaneously in the classification and decision-making processes. Given the large number of subjects (1631) included in this study and in the presence of almost 28% missing values, we investigated the performance of XGBoost on the classification of the four classes of AD, NC, EMCI, and LMCI. Using 10-fold cross validation technique, XGBoost is shown to outperform other state-of-the-art classification algorithms by 3% in terms of accuracy and F-score. Our model achieved an accuracy of 80.52%, a precision of 80.62% and recall of 80.51%, supporting the more natural and promising multiclass classification.
Abstract: With recent trends in Big Data and advancements
in Information and Communication Technologies, the healthcare
industry is at the stage of its transition from clinician oriented to
technology oriented. Many people around the world die of cancer
because the diagnosis of disease was not done at an early stage.
Nowadays, the computational methods in the form of Machine
Learning (ML) are used to develop automated decision support
systems that can diagnose cancer with high confidence in a timely
manner. This paper aims to carry out the comparative evaluation
of a selected set of ML classifiers on two existing datasets: breast
cancer and cervical cancer. The ML classifiers compared in this study
are Decision Tree (DT), Support Vector Machine (SVM), k-Nearest
Neighbor (k-NN), Logistic Regression, Ensemble (Bagged Tree) and
Artificial Neural Networks (ANN). The evaluation is carried out based
on standard evaluation metrics Precision (P), Recall (R), F1-score and
Accuracy. The experimental results based on the evaluation metrics
show that ANN showed the highest-level accuracy (99.4%) when
tested with breast cancer dataset. On the other hand, when these
ML classifiers are tested with the cervical cancer dataset, Ensemble
(Bagged Tree) technique gave better accuracy (93.1%) in comparison
to other classifiers.
Abstract: The present approach deals with the identification of Emotions and classification of Emotional patterns at Phrase-level with respect to Positive and Negative Orientation. The proposed approach considers emotion triggered terms, its co-occurrence terms and also associated sentences for recognizing emotions. The proposed approach uses Part of Speech Tagging and Emotion Actifiers for classification. Here sentence patterns are broken into phrases and Neuro-Fuzzy model is used to classify which results in 16 patterns of emotional phrases. Suitable intensities are assigned for capturing the degree of emotion contents that exist in semantics of patterns. These emotional phrases are assigned weights which supports in deciding the Positive and Negative Orientation of emotions. The approach uses web documents for experimental purpose and the proposed classification approach performs well and achieves good F-Scores.
Abstract: Feature Selection is significant in order to perform constructive classification in the area of cancer diagnosis. However, a large number of features compared to the number of samples makes the task of classification computationally very hard and prone to errors in microarray gene expression datasets. In this paper, we present an innovative method for selecting highly informative gene subsets of gene expression data that effectively classifies the cancer data into tumorous and non-tumorous. The hybrid gene selection technique comprises of combined Mutual Information and Fisher score to select informative genes. The gene selection is validated by classification using Support Vector Machine (SVM) which is a supervised learning algorithm capable of solving complex classification problems. The results obtained from improved Mutual Information and F-Score with SVM as a classifier has produced efficient results.
Abstract: This paper presents a classifier ensemble approach for
predicting the survivability of the breast cancer patients using the
latest database version of the Surveillance, Epidemiology, and End
Results (SEER) Program of the National Cancer Institute. The system
consists of two main components; features selection and classifier
ensemble components. The features selection component divides the
features in SEER database into four groups. After that it tries to find
the most important features among the four groups that maximizes the
weighted average F-score of a certain classification algorithm. The
ensemble component uses three different classifiers, each of which
models different set of features from SEER through the features
selection module. On top of them, another classifier is used to give
the final decision based on the output decisions and confidence
scores from each of the underlying classifiers. Different classification
algorithms have been examined; the best setup found is by using the
decision tree, Bayesian network, and Na¨ıve Bayes algorithms for the
underlying classifiers and Na¨ıve Bayes for the classifier ensemble
step. The system outperforms all published systems to date when
evaluated against the exact same data of SEER (period of 1973-2002).
It gives 87.39% weighted average F-score compared to 85.82% and
81.34% of the other published systems. By increasing the data size to
cover the whole database (period of 1973-2014), the overall weighted
average F-score jumps to 92.4% on the held out unseen test set.
Abstract: Information sharing and gathering are important in the rapid advancement era of technology. The existence of WWW has caused rapid growth of information explosion. Readers are overloaded with too many lengthy text documents in which they are more interested in shorter versions. Oil and gas industry could not escape from this predicament. In this paper, we develop an Automated Text Summarization System known as AutoTextSumm to extract the salient points of oil and gas drilling articles by incorporating statistical approach, keywords identification, synonym words and sentence-s position. In this study, we have conducted interviews with Petroleum Engineering experts and English Language experts to identify the list of most commonly used keywords in the oil and gas drilling domain. The system performance of AutoTextSumm is evaluated using the formulae of precision, recall and F-score. Based on the experimental results, AutoTextSumm has produced satisfactory performance with F-score of 0.81.
Abstract: Automatic Extraction of Event information from
social text stream (emails, social network sites, blogs etc) is a vital
requirement for many applications like Event Planning and
Management systems and security applications. The key information
components needed from Event related text are Event title, location,
participants, date and time. Emails have very unique distinctions over
other social text streams from the perspective of layout and format
and conversation style and are the most commonly used
communication channel for broadcasting and planning events.
Therefore we have chosen emails as our dataset. In our work, we
have employed two statistical NLP methods, named as Finite State
Machines (FSM) and Hidden Markov Model (HMM) for the
extraction of event related contextual information. An application
has been developed providing a comparison among the two methods
over the event extraction task. It comprises of two modules, one for
each method, and works for both bulk as well as direct user input.
The results are evaluated using Precision, Recall and F-Score.
Experiments show that both methods produce high performance and
accuracy, however HMM was good enough over Title extraction and
FSM proved to be better for Venue, Date, and time.
Abstract: Named Entity Recognition (NER) aims to classify each word of a document into predefined target named entity classes and is now-a-days considered to be fundamental for many Natural Language Processing (NLP) tasks such as information retrieval, machine translation, information extraction, question answering systems and others. This paper reports about the development of a NER system for Bengali and Hindi using Support Vector Machine (SVM). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Indian languages (ILs) is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named (NE) classes, such as Person name, Location name, Organization name and Miscellaneous name. We have used the annotated corpora of 122,467 tokens of Bengali and 502,974 tokens of Hindi tagged with the twelve different NE classes 1, defined as part of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (SSEAL) 2. In addition, we have manually annotated 150K wordforms of the Bengali news corpus, developed from the web-archive of a leading Bengali newspaper. We have also developed an unsupervised algorithm in order to generate the lexical context patterns from a part of the unlabeled Bengali news corpus. Lexical patterns have been used as the features of SVM in order to improve the system performance. The NER system has been tested with the gold standard test sets of 35K, and 60K tokens for Bengali, and Hindi, respectively. Evaluation results have demonstrated the recall, precision, and f-score values of 88.61%, 80.12%, and 84.15%, respectively, for Bengali and 80.23%, 74.34%, and 77.17%, respectively, for Hindi. Results show the improvement in the f-score by 5.13% with the use of context patterns. Statistical analysis, ANOVA is also performed to compare the performance of the proposed NER system with that of the existing HMM based system for both the languages.