Abstract: Most of greenhouse growers desire a determined amount of yields in order to accurately meet market requirements. The purpose of this paper is to model a simple but often satisfactory supervised classification method. The original naive Bayes have a serious weakness, which is producing redundant predictors. In this paper, utilized regularization technique was used to obtain a computationally efficient classifier based on naive Bayes. The suggested construction, utilized L1-penalty, is capable of clearing redundant predictors, where a modification of the LARS algorithm is devised to solve this problem, making this method applicable to a wide range of data. In the experimental section, a study conducted to examine the effect of redundant and irrelevant predictors, and test the method on WSG data set for tomato yields, where there are many more predictors than data, and the urge need to predict weekly yield is the goal of this approach. Finally, the modified approach is compared with several naive Bayes variants and other classification algorithms (SVM and kNN), and is shown to be fairly good.
Abstract: Sensor-based Activity Recognition systems usually accounts which sensors have been activated to perform an activity. The system then combines the conditional probabilities of those sensors to represent different activities and takes the decision based on that. However, the information about the sensors which are not activated may also be of great help in deciding which activity has been performed. This paper proposes an approach where the sensory data related to both usage and non-usage of objects are utilized to make the classification of activities. Experimental results also show the promising performance of the proposed method.
Abstract: Microaneurysm is a key indicator of diabetic retinopathy that can potentially cause damage to retina. Early detection and automatic quantification are the keys to prevent further damage. In this paper, which focuses on automatic microaneurysm detection in images acquired through non-dilated pupils, we present a series of experiments on feature selection and automatic microaneurysm pixel classification. We found that the best feature set is a combination of 10 features: the pixel-s intensity of shade corrected image, the pixel hue, the standard deviation of shade corrected image, DoG4, the area of the candidate MA, the perimeter of the candidate MA, the eccentricity of the candidate MA, the circularity of the candidate MA, the mean intensity of the candidate MA on shade corrected image and the ratio of the major axis length and minor length of the candidate MA. The overall sensitivity, specificity, precision, and accuracy are 84.82%, 99.99%, 89.01%, and 99.99%, respectively.
Abstract: Naïve Bayes classifiers are simple probabilistic
classifiers. Classification extracts patterns by using data file with a set
of labeled training examples and is currently one of the most
significant areas in data mining. However, Naïve Bayes assumes the
independence among the features. Structural learning among the
features thus helps in the classification problem. In this study, the use
of structural learning in Bayesian Network is proposed to be applied
where there are relationships between the features when using the
Naïve Bayes. The improvement in the classification using structural
learning is shown if there exist relationship between the features or
when they are not independent.
Abstract: Mobile agents are a powerful approach to develop distributed systems since they migrate to hosts on which they have the resources to execute individual tasks. In a dynamic environment like a peer-to-peer network, Agents have to be generated frequently and dispatched to the network. Thus they will certainly consume a certain amount of bandwidth of each link in the network if there are too many agents migration through one or several links at the same time, they will introduce too much transferring overhead to the links eventually, these links will be busy and indirectly block the network traffic, therefore, there is a need of developing routing algorithms that consider about traffic load. In this paper we seek to create cooperation between a probabilistic manner according to the quality measure of the network traffic situation and the agent's migration decision making to the next hop based on decision tree learning algorithms.
Abstract: Data mining can be called as a technique to extract
information from data. It is the process of obtaining hidden
information and then turning it into qualified knowledge by statistical
and artificial intelligence technique. One of its application areas is
medical area to form decision support systems for diagnosis just by
inventing meaningful information from given medical data. In this
study a decision support system for diagnosis of illness that make use
of data mining and three different artificial intelligence classifier
algorithms namely Multilayer Perceptron, Naive Bayes Classifier and
J.48. Pima Indian dataset of UCI Machine Learning Repository was
used. This dataset includes urinary and blood test results of 768
patients. These test results consist of 8 different feature vectors.
Obtained classifying results were compared with the previous studies.
The suggestions for future studies were presented.
Abstract: Network security attacks are the violation of
information security policy that received much attention to the
computational intelligence society in the last decades. Data mining
has become a very useful technique for detecting network intrusions
by extracting useful knowledge from large number of network data
or logs. Naïve Bayesian classifier is one of the most popular data
mining algorithm for classification, which provides an optimal way
to predict the class of an unknown example. It has been tested that
one set of probability derived from data is not good enough to have
good classification rate. In this paper, we proposed a new learning
algorithm for mining network logs to detect network intrusions
through naïve Bayesian classifier, which first clusters the network
logs into several groups based on similarity of logs, and then
calculates the prior and conditional probabilities for each group of
logs. For classifying a new log, the algorithm checks in which cluster
the log belongs and then use that cluster-s probability set to classify
the new log. We tested the performance of our proposed algorithm by
employing KDD99 benchmark network intrusion detection dataset,
and the experimental results proved that it improves detection rates
as well as reduces false positives for different types of network
intrusions.
Abstract: Searching similar documents and document
management subjects have important place in text mining. One of the
most important parts of similar document research studies is the
process of classifying or clustering the documents. In this study, a
similar document search approach that includes discussion of out the
case of belonging to multiple categories (multiple categories
problem) has been carried. The proposed method that based on Fuzzy
Similarity Classification (FSC) has been compared with Rocchio
algorithm and naive Bayes method which are widely used in text
mining. Empirical results show that the proposed method is quite
successful and can be applied effectively. For the second stage,
multiple categories vector method based on information of categories
regarding to frequency of being seen together has been used.
Empirical results show that achievement is increased almost two
times, when proposed method is compared with classical approach.
Abstract: Naive Bayes Nearest Neighbor (NBNN) and its variants, i,e., local NBNN and the NBNN kernels, are local feature-based classifiers that have achieved impressive performance in image classification. By exploiting instance-to-class (I2C) distances (instance means image/video in image/video classification), they avoid quantization errors of local image descriptors in the bag of words (BoW) model. However, the performances of NBNN, local NBNN and the NBNN kernels have not been validated on video analysis. In this paper, we introduce these three classifiers into human action recognition and conduct comprehensive experiments on the benchmark KTH and the realistic HMDB datasets. The results shows that those I2C based classifiers consistently outperform the SVM classifier with the BoW model.
Abstract: As a popular rank-reduced vector space approach,
Latent Semantic Indexing (LSI) has been used in information
retrieval and other applications. In this paper, an LSI-based content
vector model for text classification is presented, which constructs
multiple augmented category LSI spaces and classifies text by their
content. The model integrates the class discriminative information
from the training data and is equipped with several pertinent feature
selection and text classification algorithms. The proposed classifier
has been applied to email classification and its experiments on a
benchmark spam testing corpus (PU1) have shown that the approach
represents a competitive alternative to other email classifiers based
on the well-known SVM and naïve Bayes algorithms.
Abstract: In this study, a high accuracy protein-protein interaction
prediction method is developed. The importance of the proposed
method is that it only uses sequence information of proteins while
predicting interaction. The method extracts phylogenetic profiles of
proteins by using their sequence information. Combining the phylogenetic
profiles of two proteins by checking existence of homologs
in different species and fitting this combined profile into a statistical
model, it is possible to make predictions about the interaction status
of two proteins.
For this purpose, we apply a collection of pattern recognition
techniques on the dataset of combined phylogenetic profiles of protein
pairs. Support Vector Machines, Feature Extraction using ReliefF,
Naive Bayes Classification, K-Nearest Neighborhood Classification,
Decision Trees, and Random Forest Classification are the methods
we applied for finding the classification method that best predicts
the interaction status of protein pairs. Random Forest Classification
outperformed all other methods with a prediction accuracy of 76.93%
Abstract: Keystroke authentication is a new access control system
to identify legitimate users via their typing behavior. In this paper,
machine learning techniques are adapted for keystroke authentication.
Seven learning methods are used to build models to differentiate user
keystroke patterns. The selected classification methods are Decision
Tree, Naive Bayesian, Instance Based Learning, Decision Table, One
Rule, Random Tree and K-star. Among these methods, three of them
are studied in more details. The results show that machine learning
is a feasible alternative for keystroke authentication. Compared to
the conventional Nearest Neighbour method in the recent research,
learning methods especially Decision Tree can be more accurate. In
addition, the experiment results reveal that 3-Grams is more accurate
than 2-Grams and 4-Grams for feature extraction. Also, combination
of attributes tend to result higher accuracy.
Abstract: Recent years have seen a growing trend towards the
integration of multiple information sources to support large-scale
prediction of protein-protein interaction (PPI) networks in model
organisms. Despite advances in computational approaches, the
combination of multiple “omic" datasets representing the same type
of data, e.g. different gene expression datasets, has not been
rigorously studied. Furthermore, there is a need to further investigate
the inference capability of powerful approaches, such as fullyconnected
Bayesian networks, in the context of the prediction of PPI
networks. This paper addresses these limitations by proposing a
Bayesian approach to integrate multiple datasets, some of which
encode the same type of “omic" data to support the identification of
PPI networks. The case study reported involved the combination of
three gene expression datasets relevant to human heart failure (HF).
In comparison with two traditional methods, Naive Bayesian and
maximum likelihood ratio approaches, the proposed technique can
accurately identify known PPI and can be applied to infer potentially
novel interactions.
Abstract: In this paper, a new learning approach for network
intrusion detection using naïve Bayesian classifier and ID3 algorithm
is presented, which identifies effective attributes from the training
dataset, calculates the conditional probabilities for the best attribute
values, and then correctly classifies all the examples of training and
testing dataset. Most of the current intrusion detection datasets are
dynamic, complex and contain large number of attributes. Some of
the attributes may be redundant or contribute little for detection
making. It has been successfully tested that significant attribute
selection is important to design a real world intrusion detection
systems (IDS). The purpose of this study is to identify effective
attributes from the training dataset to build a classifier for network
intrusion detection using data mining algorithms. The experimental
results on KDD99 benchmark intrusion detection dataset demonstrate
that this new approach achieves high classification rates and reduce
false positives using limited computational resources.
Abstract: The problem of spam has been seriously troubling the Internet community during the last few years and currently reached an alarming scale. Observations made at CERN (European Organization for Nuclear Research located in Geneva, Switzerland) show that spam mails can constitute up to 75% of daily SMTP traffic. A naïve Bayesian classifier based on a Bag Of Words representation of an email is widely used to stop this unwanted flood as it combines good performance with simplicity of the training and classification processes. However, facing the constantly changing patterns of spam, it is necessary to assure online adaptability of the classifier. This work proposes combining such a classifier with another NBC (naïve Bayesian classifier) based on pairs of adjacent words. Only the latter will be retrained with examples of spam reported by users. Tests are performed on considerable sets of mails both from public spam archives and CERN mailboxes. They suggest that this architecture can increase spam recall without affecting the classifier precision as it happens when only the NBC based on single words is retrained.
Abstract: In this paper a combined feature selection method is
proposed which takes advantages of sample domain filtering,
resampling and feature subset evaluation methods to reduce
dimensions of huge datasets and select reliable features. This method
utilizes both feature space and sample domain to improve the process
of feature selection and uses a combination of Chi squared with
Consistency attribute evaluation methods to seek reliable features.
This method consists of two phases. The first phase filters and
resamples the sample domain and the second phase adopts a hybrid
procedure to find the optimal feature space by applying Chi squared,
Consistency subset evaluation methods and genetic search.
Experiments on various sized datasets from UCI Repository of
Machine Learning databases show that the performance of five
classifiers (Naïve Bayes, Logistic, Multilayer Perceptron, Best First
Decision Tree and JRIP) improves simultaneously and the
classification error for these classifiers decreases considerably. The
experiments also show that this method outperforms other feature
selection methods.
Abstract: This paper proposes a technique to protect against
email bombing. The technique employs a statistical approach, Naïve
Bayes (NB), and Neural Networks to show that it is possible to
differentiate between good and bad traffic to protect against email
bombing attacks. Neural networks and Naïve Bayes can be trained
by utilizing many email messages that include both input and output
data for legitimate and non-legitimate emails. The input to the model
includes the contents of the body of the messages, the subject, and
the headers. This information will be used to determine if the email
is normal or an attack email. Preliminary tests suggest that Naïve
Bayes can be trained to produce an accurate response to confirm
which email represents an attack.
Abstract: As emails communications have no consistent
authentication procedure to ensure the authenticity, we present an
investigation analysis approach for detecting forged emails based on
Random Forests and Naïve Bays classifiers. Instead of investigating
the email headers, we use the body content to extract a unique writing
style for all the possible suspects. Our approach consists of four main
steps: (1) The cybercrime investigator extract different effective
features including structural, lexical, linguistic, and syntactic
evidence from previous emails for all the possible suspects, (2) The
extracted features vectors are normalized to increase the accuracy
rate. (3) The normalized features are then used to train the learning
engine, (4) upon receiving the anonymous email (M); we apply the
feature extraction process to produce a feature vector. Finally, using
the machine learning classifiers the email is assigned to one of the
suspects- whose writing style closely matches M. Experimental
results on real data sets show the improved performance of the
proposed method and the ability of identifying the authors with a
very limited number of features.
Abstract: This paper presents a semi-supervised learning algorithm called Iterative-Cross Training (ICT) to solve the Web pages classification problems. We apply Inductive logic programming (ILP) as a strong learner in ICT. The objective of this research is to evaluate the potential of the strong learner in order to boost the performance of the weak learner of ICT. We compare the result with the supervised Naive Bayes, which is the well-known algorithm for the text classification problem. The performance of our learning algorithm is also compare with other semi-supervised learning algorithms which are Co-Training and EM. The experimental results show that ICT algorithm outperforms those algorithms and the performance of the weak learner can be enhanced by ILP system.
Abstract: In this paper, we present a new learning algorithm for
anomaly based network intrusion detection using improved self
adaptive naïve Bayesian tree (NBTree), which induces a hybrid of
decision tree and naïve Bayesian classifier. The proposed approach
scales up the balance detections for different attack types and keeps
the false positives at acceptable level in intrusion detection. In
complex and dynamic large intrusion detection dataset, the detection
accuracy of naïve Bayesian classifier does not scale up as well as
decision tree. It has been successfully tested in other problem
domains that naïve Bayesian tree improves the classification rates in
large dataset. In naïve Bayesian tree nodes contain and split as
regular decision-trees, but the leaves contain naïve Bayesian
classifiers. The experimental results on KDD99 benchmark network
intrusion detection dataset demonstrate that this new approach scales
up the detection rates for different attack types and reduces false
positives in network intrusion detection.