Abstract: We assume an IoT-based smart-home environment where the on-off status of each of the electrical appliances including the room lights can be recognized in a real time by monitoring and analyzing the smart meter data. At any moment in such an environment, we can recognize what the household or the user is doing by referring to the status data of the appliances. In this paper, we focus on a smart-home service that is to activate a robot vacuum cleaner at right time by recognizing the user situation, which requires a situation-aware model that can distinguish the situations that allow vacuum cleaning (Yes) from those that do not (No). We learn as our candidate models a few classifiers such as naïve Bayes, decision tree, and logistic regression that can map the appliance-status data into Yes and No situations. Our training and test data are obtained from simulations of user behaviors, in which a sequence of user situations such as cooking, eating, dish washing, and so on is generated with the status of the relevant appliances changed in accordance with the situation changes. During the simulation, both the situation transition and the resulting appliance status are determined stochastically. To compare the performances of the aforementioned classifiers we obtain their learning curves for different types of users through simulations. The result of our empirical study reveals that naïve Bayes achieves a slightly better classification accuracy than the other compared classifiers.
Abstract: Texture is an important characteristic in real and
synthetic scenes. Texture analysis plays a critical role in inspecting
surfaces and provides important techniques in a variety of
applications. Although several descriptors have been presented to
extract texture features, the development of object recognition is still a
difficult task due to the complex aspects of texture. Recently, many
robust and scaling-invariant image features such as SIFT, SURF and
ORB have been successfully used in image retrieval and object
recognition. In this paper, we have tried to compare the performance
for texture classification using these feature descriptors with k-means
clustering. Different classifiers including K-NN, Naive Bayes, Back
Propagation Neural Network , Decision Tree and Kstar were applied in
three texture image sets - UIUCTex, KTH-TIPS and Brodatz,
respectively. Experimental results reveal SIFTS as the best average
accuracy rate holder in UIUCTex, KTH-TIPS and SURF is
advantaged in Brodatz texture set. BP neuro network works best in the
test set classification among all used classifiers.
Abstract: Liver cancer is one of the common diseases that cause the death. Early detection is important to diagnose and reduce the incidence of death. Improvements in medical imaging and image processing techniques have significantly enhanced interpretation of medical images. Computer-Aided Diagnosis (CAD) systems based on these techniques play a vital role in the early detection of liver disease and hence reduce liver cancer death rate. This paper presents an automated CAD system consists of three stages; firstly, automatic liver segmentation and lesion’s detection. Secondly, extracting features. Finally, classifying liver lesions into benign and malignant by using the novel contrasting feature-difference approach. Several types of intensity, texture features are extracted from both; the lesion area and its surrounding normal liver tissue. The difference between the features of both areas is then used as the new lesion descriptors. Machine learning classifiers are then trained on the new descriptors to automatically classify liver lesions into benign or malignant. The experimental results show promising improvements. Moreover, the proposed approach can overcome the problems of varying ranges of intensity and textures between patients, demographics, and imaging devices and settings.
Abstract: Advances in spatial and spectral resolution of satellite
images have led to tremendous growth in large image databases. The
data we acquire through satellites, radars, and sensors consists of
important geographical information that can be used for remote
sensing applications such as region planning, disaster management.
Spatial data classification and object recognition are important tasks
for many applications. However, classifying objects and identifying
them manually from images is a difficult task. Object recognition is
often considered as a classification problem, this task can be
performed using machine-learning techniques. Despite of many
machine-learning algorithms, the classification is done using
supervised classifiers such as Support Vector Machines (SVM) as the
area of interest is known. We proposed a classification method,
which considers neighboring pixels in a region for feature extraction
and it evaluates classifications precisely according to neighboring
classes for semantic interpretation of region of interest (ROI). A
dataset has been created for training and testing purpose; we
generated the attributes by considering pixel intensity values and
mean values of reflectance. We demonstrated the benefits of using
knowledge discovery and data-mining techniques, which can be on
image data for accurate information extraction and classification from
high spatial resolution remote sensing imagery.
Abstract: As smartphones are equipped with various sensors,
there have been many studies focused on using these sensors to create
valuable applications. Human activity recognition is one such
application motivated by various welfare applications, such as the
support for the elderly, measurement of calorie consumption, lifestyle
and exercise patterns analyses, and so on. One of the challenges one
faces when using smartphone sensors for activity recognition is that
the number of sensors should be minimized to save battery power. In
this paper, we show that a fairly accurate classifier can be built that
can distinguish ten different activities by using only a single sensor
data, i.e., the smartphone accelerometer data. The approach that we
adopt to deal with this twelve-class problem uses various methods.
The features used for classifying these activities include not only the
magnitude of acceleration vector at each time point, but also the
maximum, the minimum, and the standard deviation of vector
magnitude within a time window. The experiments compared the
performance of four kinds of basic multi-class classifiers and the
performance of four kinds of ensemble learning methods based on
three kinds of basic multi-class classifiers. The results show that
while the method with the highest accuracy is ECOC based on
Random forest.
Abstract: In order to help the expert to validate association rules
extracted from data, some quality measures are proposed in the
literature. We distinguish two categories: objective and subjective
measures. The first one depends on a fixed threshold and on data
quality from which the rules are extracted. The second one consists
on providing to the expert some tools in the objective to explore and
visualize rules during the evaluation step. However, the number of
extracted rules to validate remains high. Thus, the manually mining
rules task is very hard. To solve this problem, we propose, in this
paper, a semi-automatic method to assist the expert during the
association rule's validation. Our method uses rule-based
classification as follow: (i) We transform association rules into
classification rules (classifiers), (ii) We use the generated classifiers
for data classification. (iii) We visualize association rules with their
quality classification to give an idea to the expert and to assist him
during validation process.
Abstract: Phonocardiography is important in appraisal of
congenital heart disease and pulmonary hypertension as it reflects the
duration of right ventricular systoles. The systolic murmur in patients
with intra-cardiac shunt decreases as pulmonary hypertension
develops and may eventually disappear completely as the pulmonary
pressure reaches systemic level. Phonocardiography and auscultation
are non-invasive, low-cost, and accurate methods to assess heart
disease. In this work an objective signal processing tool to extract
information from phonocardiography signal using Wavelet is
proposed to classify the murmur as normal or abnormal. Since the
feature vector is large, a Binary Particle Swarm Optimization (PSO)
with mutation for feature selection is proposed. The extracted
features improve the classification accuracy and were tested across
various classifiers including Naïve Bayes, kNN, C4.5, and SVM.
Abstract: By the evolvement in technology, the way of
expressing opinions switched direction to the digital world. The
domain of politics, as one of the hottest topics of opinion mining
research, merged together with the behavior analysis for affiliation
determination in texts, which constitutes the subject of this paper.
This study aims to classify the text in news/blogs either as
Republican or Democrat with the minimum number of features. As
an initial set, 68 features which 64 were constituted by Linguistic
Inquiry and Word Count (LIWC) features were tested against 14
benchmark classification algorithms. In the later experiments, the
dimensions of the feature vector reduced based on the 7 feature
selection algorithms. The results show that the “Decision Tree”,
“Rule Induction” and “M5 Rule” classifiers when used with “SVM”
and “IGR” feature selection algorithms performed the best up to
82.5% accuracy on a given dataset. Further tests on a single feature
and the linguistic based feature sets showed the similar results. The
feature “Function”, as an aggregate feature of the linguistic category,
was found as the most differentiating feature among the 68 features
with the accuracy of 81% in classifying articles either as Republican
or Democrat.
Abstract: The paper presents new results concerning selection of
optimal information fusion formula for ensembles of C-OTDR
channels. The goal of information fusion is to create an integral
classificator designed for effective classification of seismoacoustic
target events. The LPBoost (LP-β and LP-B variants), the Multiple
Kernel Learning, and Weighing of Inversely as Lipschitz Constants
(WILC) approaches were compared. The WILC is a brand new
approach to optimal fusion of Lipschitz Classifiers Ensembles.
Results of practical usage are presented.
Abstract: Margin-Based Principle has been proposed for a long
time, it has been proved that this principle could reduce the
structural risk and improve the performance in both theoretical
and practical aspects. Meanwhile, feed-forward neural network is
a traditional classifier, which is very hot at present with a deeper
architecture. However, the training algorithm of feed-forward neural
network is developed and generated from Widrow-Hoff Principle that
means to minimize the squared error. In this paper, we propose
a new training algorithm for feed-forward neural networks based
on Margin-Based Principle, which could effectively promote the
accuracy and generalization ability of neural network classifiers
with less labelled samples and flexible network. We have conducted
experiments on four UCI open datasets and achieved good results
as expected. In conclusion, our model could handle more sparse
labelled and more high-dimension dataset in a high accuracy while
modification from old ANN method to our method is easy and almost
free of work.
Abstract: The problems arising from unbalanced data sets
generally appear in real world applications. Due to unequal class
distribution, many researchers have found that the performance of
existing classifiers tends to be biased towards the majority class. The
k-nearest neighbors’ nonparametric discriminant analysis is a method
that was proposed for classifying unbalanced classes with good
performance. In this study, the methods of discriminant analysis are
of interest in investigating misclassification error rates for classimbalanced
data of three diabetes risk groups. The purpose of this
study was to compare the classification performance between
parametric discriminant analysis and nonparametric discriminant
analysis in a three-class classification of class-imbalanced data of
diabetes risk groups. Data from a project maintaining healthy
conditions for 599 employees of a government hospital in Bangkok
were obtained for the classification problem. The employees were
divided into three diabetes risk groups: non-risk (90%), risk (5%),
and diabetic (5%). The original data including the variables of
diabetes risk group, age, gender, blood glucose, and BMI were
analyzed and bootstrapped for 50 and 100 samples, 599 observations
per sample, for additional estimation of the misclassification error
rate. Each data set was explored for the departure of multivariate
normality and the equality of covariance matrices of the three risk
groups. Both the original data and the bootstrap samples showed nonnormality
and unequal covariance matrices. The parametric linear
discriminant function, quadratic discriminant function, and the
nonparametric k-nearest neighbors’ discriminant function were
performed over 50 and 100 bootstrap samples and applied to the
original data. Searching the optimal classification rule, the choices of
prior probabilities were set up for both equal proportions (0.33: 0.33:
0.33) and unequal proportions of (0.90:0.05:0.05), (0.80: 0.10: 0.10)
and (0.70, 0.15, 0.15). The results from 50 and 100 bootstrap samples
indicated that the k-nearest neighbors approach when k=3 or k=4 and
the defined prior probabilities of non-risk: risk: diabetic as 0.90:
0.05:0.05 or 0.80:0.10:0.10 gave the smallest error rate of
misclassification. The k-nearest neighbors approach would be
suggested for classifying a three-class-imbalanced data of diabetes
risk groups.
Abstract: In the past few years, the amount of malicious software
increased exponentially and, therefore, machine learning algorithms
became instrumental in identifying clean and malware files through
(semi)-automated classification. When working with very large
datasets, the major challenge is to reach both a very high malware
detection rate and a very low false positive rate. Another challenge
is to minimize the time needed for the machine learning algorithm to
do so. This paper presents a comparative study between different
machine learning techniques such as linear classifiers, ensembles,
decision trees or various hybrids thereof. The training dataset consists
of approximately 2 million clean files and 200.000 infected files,
which is a realistic quantitative mixture. The paper investigates the
above mentioned methods with respect to both their performance
(detection rate and false positive rate) and their practicability.
Abstract: This paper introduces an original method for
guaranteed estimation of the accuracy for an ensemble of Lipschitz
classifiers. The solution was obtained as a finite closed set of
alternative hypotheses, which contains an object of classification with
probability of not less than the specified value. Thus, the
classification is represented by a set of hypothetical classes. In this
case, the smaller the cardinality of the discrete set of hypothetical
classes is, the higher is the classification accuracy. Experiments have
shown that if cardinality of the classifiers ensemble is increased then
the cardinality of this set of hypothetical classes is reduced. The
problem of the guaranteed estimation of the accuracy for an ensemble
of Lipschitz classifiers is relevant in multichannel classification of
target events in C-OTDR monitoring systems. Results of suggested
approach practical usage to accuracy control in C-OTDR monitoring
systems are present.
Abstract: This paper introduces an original method of
parametric optimization of the structure for multimodal decisionlevel
fusion scheme which combines the results of the partial solution
of the classification task obtained from assembly of the mono-modal
classifiers. As a result, a multimodal fusion classifier which has the
minimum value of the total error rate has been obtained.
Abstract: Neurons in the nervous system communicate with
each other by producing electrical signals called spikes. To
investigate the physiological function of nervous system it is essential
to study the activity of neurons by detecting and sorting spikes in the
recorded signal. In this paper a method is proposed for considering
the spike sorting problem which is based on the nonlinear modeling
of spikes using exponential autoregressive model. The genetic
algorithm is utilized for model parameter estimation. In this regard
some selected model coefficients are used as features for sorting
purposes. For optimal selection of model coefficients, self-organizing
feature map is used. The results show that modeling of spikes with
nonlinear autoregressive model outperforms its linear counterpart.
Also the extracted features based on the coefficients of exponential
autoregressive model are better than wavelet based extracted features
and get more compact and well-separated clusters. In the case of
spikes different in small-scale structures where principal component
analysis fails to get separated clouds in the feature space, the
proposed method can obtain well-separated cluster which removes
the necessity of applying complex classifiers.
Abstract: A Distributed Denial of Service (DDoS) attack is a
major threat to cyber security. It originates from the network layer or
the application layer of compromised/attacker systems which are
connected to the network. The impact of this attack ranges from the
simple inconvenience to use a particular service to causing major
failures at the targeted server. When there is heavy traffic flow to a
target server, it is necessary to classify the legitimate access and
attacks. In this paper, a novel method is proposed to detect DDoS
attacks from the traces of traffic flow. An access matrix is created
from the traces. As the access matrix is multi dimensional, Principle
Component Analysis (PCA) is used to reduce the attributes used for
detection. Two classifiers Naive Bayes and K-Nearest neighborhood
are used to classify the traffic as normal or abnormal. The
performance of the classifier with PCA selected attributes and actual
attributes of access matrix is compared by the detection rate and
False Positive Rate (FPR).
Abstract: Artificial Immune Systems (AIS), inspired by the
human immune system, are algorithms and mechanisms which are
self-adaptive and self-learning classifiers capable of recognizing and
classifying by learning, long-term memory and association. Unlike
other human system inspired techniques like genetic algorithms and
neural networks, AIS includes a range of algorithms modeling on
different immune mechanism of the body. In this paper, a mechanism
of a human immune system based on apoptosis is adopted to build an
Intrusion Detection System (IDS) to protect computer networks.
Features are selected from network traffic using Fisher Score. Based
on the selected features, the record/connection is classified as either
an attack or normal traffic by the proposed methodology. Simulation
results demonstrates that the proposed AIS based on apoptosis
performs better than existing AIS for intrusion detection.
Abstract: Different strategies and tools are available at the oil
and gas industry for detecting and analyzing tension and possible
fractures in borehole walls. Most of these techniques are based on
manual observation of the captured borehole images. While this
strategy may be possible and convenient with small images and few
data, it may become difficult and suitable to errors when big
databases of images must be treated. While the patterns may differ
among the image area, depending on many characteristics (drilling
strategy, rock components, rock strength, etc.). In this work we
propose the inclusion of data-mining classification strategies in order
to create a knowledge database of the segmented curves. These
classifiers allow that, after some time using and manually pointing
parts of borehole images that correspond to tension regions and
breakout areas, the system will indicate and suggest automatically
new candidate regions, with higher accuracy. We suggest the use of
different classifiers methods, in order to achieve different knowledge
dataset configurations.
Abstract: ‘Steganalysis’ is one of the challenging and attractive interests for the researchers with the development of information hiding techniques. It is the procedure to detect the hidden information from the stego created by known steganographic algorithm. In this paper, a novel feature based image steganalysis technique is proposed. Various statistical moments have been used along with some similarity metric. The proposed steganalysis technique has been designed based on transformation in four wavelet domains, which include Haar, Daubechies, Symlets and Biorthogonal. Each domain is being subjected to various classifiers, namely K-nearest-neighbor, K* Classifier, Locally weighted learning, Naive Bayes classifier, Neural networks, Decision trees and Support vector machines. The experiments are performed on a large set of pictures which are available freely in image database. The system also predicts the different message length definitions.
Abstract: A brief review of the empirical studies on the methodology of the stock market decision support would indicate that they are at a threshold of validating the accuracy of the traditional and the fuzzy, artificial neural network and the decision trees. Many researchers have been attempting to compare these models using various data sets worldwide. However, the research community is on the way to the conclusive confidence in the emerged models. This paper attempts to use the automotive sector stock prices from National Stock Exchange (NSE), India and analyze them for the intra-sectorial support for stock market decisions. The study identifies the significant variables and their lags which affect the price of the stocks using OLS analysis and decision tree classifiers.