Abstract: As emails communications have no consistent
authentication procedure to ensure the authenticity, we present an
investigation analysis approach for detecting forged emails based on
Random Forests and Naïve Bays classifiers. Instead of investigating
the email headers, we use the body content to extract a unique writing
style for all the possible suspects. Our approach consists of four main
steps: (1) The cybercrime investigator extract different effective
features including structural, lexical, linguistic, and syntactic
evidence from previous emails for all the possible suspects, (2) The
extracted features vectors are normalized to increase the accuracy
rate. (3) The normalized features are then used to train the learning
engine, (4) upon receiving the anonymous email (M); we apply the
feature extraction process to produce a feature vector. Finally, using
the machine learning classifiers the email is assigned to one of the
suspects- whose writing style closely matches M. Experimental
results on real data sets show the improved performance of the
proposed method and the ability of identifying the authors with a
very limited number of features.
Abstract: For a given specific problem an efficient algorithm has
been the matter of study. However, an alternative approach orthogonal
to this approach comes out, which is called a reduction. In general
for a given specific problem this reduction approach studies how to
convert an original problem into subproblems. This paper proposes
a formal modeling language to support this reduction approach. We
show three examples from the wide area of learning problems. The
benefit is a fast prototyping of algorithms for a given new problem.
Abstract: Named Entity Recognition (NER) aims to classify each word of a document into predefined target named entity classes and is now-a-days considered to be fundamental for many Natural Language Processing (NLP) tasks such as information retrieval, machine translation, information extraction, question answering systems and others. This paper reports about the development of a NER system for Bengali and Hindi using Support Vector Machine (SVM). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Indian languages (ILs) is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named (NE) classes, such as Person name, Location name, Organization name and Miscellaneous name. We have used the annotated corpora of 122,467 tokens of Bengali and 502,974 tokens of Hindi tagged with the twelve different NE classes 1, defined as part of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (SSEAL) 2. In addition, we have manually annotated 150K wordforms of the Bengali news corpus, developed from the web-archive of a leading Bengali newspaper. We have also developed an unsupervised algorithm in order to generate the lexical context patterns from a part of the unlabeled Bengali news corpus. Lexical patterns have been used as the features of SVM in order to improve the system performance. The NER system has been tested with the gold standard test sets of 35K, and 60K tokens for Bengali, and Hindi, respectively. Evaluation results have demonstrated the recall, precision, and f-score values of 88.61%, 80.12%, and 84.15%, respectively, for Bengali and 80.23%, 74.34%, and 77.17%, respectively, for Hindi. Results show the improvement in the f-score by 5.13% with the use of context patterns. Statistical analysis, ANOVA is also performed to compare the performance of the proposed NER system with that of the existing HMM based system for both the languages.
Abstract: Intelligent systems based on machine learning
techniques, such as classification, clustering, are gaining wide spread
popularity in real world applications. This paper presents work on
developing a software system for predicting crop yield, for example
oil-palm yield, from climate and plantation data. At the core of our
system is a method for unsupervised partitioning of data for finding
spatio-temporal patterns in climate data using kernel methods which
offer strength to deal with complex data. This work gets inspiration
from the notion that a non-linear data transformation into some high
dimensional feature space increases the possibility of linear
separability of the patterns in the transformed space. Therefore, it
simplifies exploration of the associated structure in the data. Kernel
methods implicitly perform a non-linear mapping of the input data
into a high dimensional feature space by replacing the inner products
with an appropriate positive definite function. In this paper we
present a robust weighted kernel k-means algorithm incorporating
spatial constraints for clustering the data. The proposed algorithm
can effectively handle noise, outliers and auto-correlation in the
spatial data, for effective and efficient data analysis by exploring
patterns and structures in the data, and thus can be used for
predicting oil-palm yield by analyzing various factors affecting the
yield.
Abstract: In the recent past Learning Classifier Systems have
been successfully used for data mining. Learning Classifier System
(LCS) is basically a machine learning technique which combines
evolutionary computing, reinforcement learning, supervised or
unsupervised learning and heuristics to produce adaptive systems. A
LCS learns by interacting with an environment from which it
receives feedback in the form of numerical reward. Learning is
achieved by trying to maximize the amount of reward received. All
LCSs models more or less, comprise four main components; a finite
population of condition–action rules, called classifiers; the
performance component, which governs the interaction with the
environment; the credit assignment component, which distributes the
reward received from the environment to the classifiers accountable
for the rewards obtained; the discovery component, which is
responsible for discovering better rules and improving existing ones
through a genetic algorithm. The concatenate of the production rules
in the LCS form the genotype, and therefore the GA should operate
on a population of classifier systems. This approach is known as the
'Pittsburgh' Classifier Systems. Other LCS that perform their GA at
the rule level within a population are known as 'Mitchigan' Classifier
Systems. The most predominant representation of the discovered
knowledge is the standard production rules (PRs) in the form of IF P
THEN D. The PRs, however, are unable to handle exceptions and do
not exhibit variable precision. The Censored Production Rules
(CPRs), an extension of PRs, were proposed by Michalski and
Winston that exhibit variable precision and supports an efficient
mechanism for handling exceptions. A CPR is an augmented
production rule of the form: IF P THEN D UNLESS C, where
Censor C is an exception to the rule. Such rules are employed in
situations, in which conditional statement IF P THEN D holds
frequently and the assertion C holds rarely. By using a rule of this
type we are free to ignore the exception conditions, when the
resources needed to establish its presence are tight or there is simply
no information available as to whether it holds or not. Thus, the IF P
THEN D part of CPR expresses important information, while the
UNLESS C part acts only as a switch and changes the polarity of D
to ~D. In this paper Pittsburgh style LCSs approach is used for
automated discovery of CPRs. An appropriate encoding scheme is
suggested to represent a chromosome consisting of fixed size set of
CPRs. Suitable genetic operators are designed for the set of CPRs
and individual CPRs and also appropriate fitness function is proposed
that incorporates basic constraints on CPR. Experimental results are
presented to demonstrate the performance of the proposed learning
classifier system.
Abstract: This paper discusses the designing of knowledge
integration of clinical information extracted from distributed medical
ontologies in order to ameliorate a machine learning-based multilabel
coding assignment system. The proposed approach is
implemented using a decision tree technique of the machine learning
on the university hospital data for patients with Coronary Heart
Disease (CHD). The preliminary results obtained show a satisfactory
finding that the use of medical ontologies improves the overall
system performance.
Abstract: As the Internet continues to grow at a rapid pace as
the primary medium for communications and commerce and as
telecommunication networks and systems continue to expand their
global reach, digital information has become the most popular and
important information resource and our dependence upon the
underlying cyber infrastructure has been increasing significantly.
Unfortunately, as our dependency has grown, so has the threat to the
cyber infrastructure from spammers, attackers and criminal
enterprises. In this paper, we propose a new machine learning based
network intrusion detection framework for cyber security. The
detection process of the framework consists of two stages: model
construction and intrusion detection. In the model construction stage,
a semi-supervised machine learning algorithm is applied to a
collected set of network audit data to generate a profile of normal
network behavior and in the intrusion detection stage, input network
events are analyzed and compared with the patterns gathered in the
profile, and some of them are then flagged as anomalies should these
events are sufficiently far from the expected normal behavior. The
proposed framework is particularly applicable to the situations where
there is only a small amount of labeled network training data
available, which is very typical in real world network environments.
Abstract: Knowledge is indispensable but voluminous knowledge becomes a bottleneck for efficient processing. A great challenge for data mining activity is the generation of large number of potential rules as a result of mining process. In fact sometimes result size is comparable to the original data. Traditional data mining pruning activities such as support do not sufficiently reduce the huge rule space. Moreover, many practical applications are characterized by continual change of data and knowledge, thereby making knowledge voluminous with each change. The most predominant representation of the discovered knowledge is the standard Production Rules (PRs) in the form If P Then D. Michalski & Winston proposed Censored Production Rules (CPRs), as an extension of production rules, that exhibit variable precision and supports an efficient mechanism for handling exceptions. A CPR is an augmented production rule of the form: If P Then D Unless C, where C (Censor) is an exception to the rule. Such rules are employed in situations in which the conditional statement 'If P Then D' holds frequently and the assertion C holds rarely. By using a rule of this type we are free to ignore the exception conditions, when the resources needed to establish its presence, are tight or there is simply no information available as to whether it holds or not. Thus the 'If P Then D' part of the CPR expresses important information while the Unless C part acts only as a switch changes the polarity of D to ~D. In this paper a scheme based on Dempster-Shafer Theory (DST) interpretation of a CPR is suggested for discovering CPRs from the discovered flat PRs. The discovery of CPRs from flat rules would result in considerable reduction of the already discovered rules. The proposed scheme incrementally incorporates new knowledge and also reduces the size of knowledge base considerably with each episode. Examples are given to demonstrate the behaviour of the proposed scheme. The suggested cumulative learning scheme would be useful in mining data streams.
Abstract: Support vector machines (SVMs) have shown
superior performance compared to other machine learning techniques,
especially in classification problems. Yet one limitation of SVMs is
the lack of an explanation capability which is crucial in some
applications, e.g. in the medical and security domains. In this paper, a
novel approach for eclectic rule-extraction from support vector
machines is presented. This approach utilizes the knowledge acquired
by the SVM and represented in its support vectors as well as the
parameters associated with them. The approach includes three stages;
training, propositional rule-extraction and rule quality evaluation.
Results from four different experiments have demonstrated the value
of the approach for extracting comprehensible rules of high accuracy
and fidelity.
Abstract: This paper presents a novel methodology for Maximum Power Point Tracking (MPPT) of a grid-connected 20 kW Photovoltaic (PV) system using neuro-fuzzy network. The proposed method predicts the reference PV voltage guarantying optimal power transfer between the PV generator and the main utility grid. The neuro-fuzzy network is composed of a fuzzy rule-based classifier and three Radial Basis Function Neural Networks (RBFNN). Inputs of the network (irradiance and temperature) are classified before they are fed into the appropriated RBFNN for either training or estimation process while the output is the reference voltage. The main advantage of the proposed methodology, comparing to a conventional single neural network-based approach, is the distinct generalization ability regarding to the nonlinear and dynamic behavior of a PV generator. In fact, the neuro-fuzzy network is a neural network based multi-model machine learning that defines a set of local models emulating the complex and non-linear behavior of a PV generator under a wide range of operating conditions. Simulation results under several rapid irradiance variations proved that the proposed MPPT method fulfilled the highest efficiency comparing to a conventional single neural network.
Abstract: The Major Depressive Disorder has been a burden of
medical expense in Taiwan as well as the situation around the world.
Major Depressive Disorder can be defined into different categories by
previous human activities. According to machine learning, we can
classify emotion in correct textual language in advance. It can help
medical diagnosis to recognize the variance in Major Depressive
Disorder automatically. Association language incremental is the
characteristic and relationship that can discovery words in sentence.
There is an overlapping-category problem for classification. In this
paper, we would like to improve the performance in classification in
principle of no overlapping-category problems. We present an
approach that to discovery words in sentence and it can find in high
frequency in the same time and can-t overlap in each category, called
Association Language Features by its Category (ALFC).
Experimental results show that ALFC distinguish well in Major
Depressive Disorder and have better performance. We also compare
the approach with baseline and mutual information that use single
words alone or correlation measure.
Abstract: Interpretation of aerial images is an important task in
various applications. Image segmentation can be viewed as the essential
step for extracting information from aerial images. Among many
developed segmentation methods, the technique of clustering has been
extensively investigated and used. However, determining the number
of clusters in an image is inherently a difficult problem, especially
when a priori information on the aerial image is unavailable. This
study proposes a support vector machine approach for clustering
aerial images. Three cluster validity indices, distance-based index,
Davies-Bouldin index, and Xie-Beni index, are utilized as quantitative
measures of the quality of clustering results. Comparisons on the
effectiveness of these indices and various parameters settings on the
proposed methods are conducted. Experimental results are provided
to illustrate the feasibility of the proposed approach.
Abstract: This paper applies Bayesian Networks to support
information extraction from unstructured, ungrammatical, and
incoherent data sources for semantic annotation. A tool has been
developed that combines ontologies, machine learning, and
information extraction and probabilistic reasoning techniques to
support the extraction process. Data acquisition is performed with the
aid of knowledge specified in the form of ontology. Due to the
variable size of information available on different data sources, it is
often the case that the extracted data contains missing values for
certain variables of interest. It is desirable in such situations to
predict the missing values. The methodology, presented in this paper,
first learns a Bayesian network from the training data and then uses it
to predict missing data and to resolve conflicts. Experiments have
been conducted to analyze the performance of the presented
methodology. The results look promising as the methodology
achieves high degree of precision and recall for information
extraction and reasonably good accuracy for predicting missing
values.
Abstract: Knowledge Discovery in Databases (KDD) has
evolved into an important and active area of research because of
theoretical challenges and practical applications associated with the
problem of discovering (or extracting) interesting and previously
unknown knowledge from very large real-world databases. Rough
Set Theory (RST) is a mathematical formalism for representing
uncertainty that can be considered an extension of the classical set
theory. It has been used in many different research areas, including
those related to inductive machine learning and reduction of
knowledge in knowledge-based systems. One important concept
related to RST is that of a rough relation. In this paper we presented
the current status of research on applying rough set theory to KDD,
which will be helpful for handle the characteristics of real-world
databases. The main aim is to show how rough set and rough set
analysis can be effectively used to extract knowledge from large
databases.
Abstract: Availability of high dimensional biological datasets such as from gene expression, proteomic, and metabolic experiments can be leveraged for the diagnosis and prognosis of diseases. Many classification methods in this area have been studied to predict disease states and separate between predefined classes such as patients with a special disease versus healthy controls. However, most of the existing research only focuses on a specific dataset. There is a lack of generic comparison between classifiers, which might provide a guideline for biologists or bioinformaticians to select the proper algorithm for new datasets. In this study, we compare the performance of popular classifiers, which are Support Vector Machine (SVM), Logistic Regression, k-Nearest Neighbor (k-NN), Naive Bayes, Decision Tree, and Random Forest based on mock datasets. We mimic common biological scenarios simulating various proportions of real discriminating biomarkers and different effect sizes thereof. The result shows that SVM performs quite stable and reaches a higher AUC compared to other methods. This may be explained due to the ability of SVM to minimize the probability of error. Moreover, Decision Tree with its good applicability for diagnosis and prognosis shows good performance in our experimental setup. Logistic Regression and Random Forest, however, strongly depend on the ratio of discriminators and perform better when having a higher number of discriminators.
Abstract: Bone remodeling occurs by the balanced action of
bone resorbing osteoclasts (OC) and bone-building osteoblasts.
Increased bone resorption by excessive OC activity contributes
to malignant and non-malignant diseases including osteoporosis.
To study OC differentiation and function, OC formed in
in vitro cultures are currently counted manually, a tedious
procedure which is prone to inter-observer differences. Aiming
for an automated OC-quantification system, classification of
OC and precursor cells was done on fluorescence microscope
images based on the distinct appearance of fluorescent nuclei.
Following ellipse fitting to nuclei, a combination of eight
features enabled clustering of OC and precursor cell nuclei.
After evaluating different machine-learning techniques, LOGREG
achieved 74% correctly classified OC and precursor cell
nuclei, outperforming human experts (best expert: 55%). In
combination with the automated detection of total cell areas,
this system allows to measure various cell parameters and most
importantly to quantify proteins involved in osteoclastogenesis.