Abstract: Computer aided diagnosis systems provide vital
opinion to radiologists in the detection of early signs of breast cancer
from mammogram images. Architectural distortions, masses and
microcalcifications are the major abnormalities. In this paper, a
computer aided diagnosis system has been proposed for
distinguishing abnormal mammograms with architectural distortion
from normal mammogram. Four types of texture features GLCM
texture, GLRLM texture, fractal texture and spectral texture features
for the regions of suspicion are extracted. Support vector machine
has been used as classifier in this study. The proposed system yielded
an overall sensitivity of 96.47% and an accuracy of 96% for
mammogram images collected from digital database for screening
mammography database.
Abstract: One of the most critical decision points in the design of a
face recognition system is the choice of an appropriate face representation.
Effective feature descriptors are expected to convey sufficient, invariant
and non-redundant facial information. In this work we propose a set of
Hahn moments as a new approach for feature description. Hahn moments
have been widely used in image analysis due to their invariance, nonredundancy
and the ability to extract features either globally and locally.
To assess the applicability of Hahn moments to Face Recognition we
conduct two experiments on the Olivetti Research Laboratory (ORL)
database and University of Notre-Dame (UND) X1 biometric collection.
Fusion of the global features along with the features from local facial
regions are used as an input for the conventional k-NN classifier. The
method reaches an accuracy of 93% of correctly recognized subjects for
the ORL database and 94% for the UND database.
Abstract: Speaker Identification (SI) is the task of establishing
identity of an individual based on his/her voice characteristics. The SI
task is typically achieved by two-stage signal processing: training and
testing. The training process calculates speaker specific feature
parameters from the speech and generates speaker models
accordingly. In the testing phase, speech samples from unknown
speakers are compared with the models and classified. Even though
performance of speaker identification systems has improved due to
recent advances in speech processing techniques, there is still need of
improvement. In this paper, a Closed-Set Tex-Independent Speaker
Identification System (CISI) based on a Multiple Classifier System
(MCS) is proposed, using Mel Frequency Cepstrum Coefficient
(MFCC) as feature extraction and suitable combination of vector
quantization (VQ) and Gaussian Mixture Model (GMM) together
with Expectation Maximization algorithm (EM) for speaker
modeling. The use of Voice Activity Detector (VAD) with a hybrid
approach based on Short Time Energy (STE) and Statistical
Modeling of Background Noise in the pre-processing step of the
feature extraction yields a better and more robust automatic speaker
identification system. Also investigation of Linde-Buzo-Gray (LBG)
clustering algorithm for initialization of GMM, for estimating the
underlying parameters, in the EM step improved the convergence rate
and systems performance. It also uses relative index as confidence
measures in case of contradiction in identification process by GMM
and VQ as well. Simulation results carried out on voxforge.org
speech database using MATLAB highlight the efficacy of the
proposed method compared to earlier work.
Abstract: In order to help the expert to validate association rules
extracted from data, some quality measures are proposed in the
literature. We distinguish two categories: objective and subjective
measures. The first one depends on a fixed threshold and on data
quality from which the rules are extracted. The second one consists
on providing to the expert some tools in the objective to explore and
visualize rules during the evaluation step. However, the number of
extracted rules to validate remains high. Thus, the manually mining
rules task is very hard. To solve this problem, we propose, in this
paper, a semi-automatic method to assist the expert during the
association rule's validation. Our method uses rule-based
classification as follow: (i) We transform association rules into
classification rules (classifiers), (ii) We use the generated classifiers
for data classification. (iii) We visualize association rules with their
quality classification to give an idea to the expert and to assist him
during validation process.
Abstract: Phonocardiography is important in appraisal of
congenital heart disease and pulmonary hypertension as it reflects the
duration of right ventricular systoles. The systolic murmur in patients
with intra-cardiac shunt decreases as pulmonary hypertension
develops and may eventually disappear completely as the pulmonary
pressure reaches systemic level. Phonocardiography and auscultation
are non-invasive, low-cost, and accurate methods to assess heart
disease. In this work an objective signal processing tool to extract
information from phonocardiography signal using Wavelet is
proposed to classify the murmur as normal or abnormal. Since the
feature vector is large, a Binary Particle Swarm Optimization (PSO)
with mutation for feature selection is proposed. The extracted
features improve the classification accuracy and were tested across
various classifiers including Naïve Bayes, kNN, C4.5, and SVM.
Abstract: This paper presents an approach for the classification of
an unstructured format description for identification of file formats.
The main contribution of this work is the employment of data mining
techniques to support file format selection with just the unstructured
text description that comprises the most important format features for
a particular organisation. Subsequently, the file format indentification
method employs file format classifier and associated configurations to
support digital preservation experts with an estimation of required file
format. Our goal is to make use of a format specification knowledge
base aggregated from a different Web sources in order to select file
format for a particular institution. Using the naive Bayes method,
the decision support system recommends to an expert, the file format
for his institution. The proposed methods facilitate the selection of
file format and the quality of a digital preservation process. The
presented approach is meant to facilitate decision making for the
preservation of digital content in libraries and archives using domain
expert knowledge and specifications of file formats. To facilitate
decision-making, the aggregated information about the file formats is
presented as a file format vocabulary that comprises most common
terms that are characteristic for all researched formats. The goal is to
suggest a particular file format based on this vocabulary for analysis
by an expert. The sample file format calculation and the calculation
results including probabilities are presented in the evaluation section.
Abstract: By the evolvement in technology, the way of
expressing opinions switched direction to the digital world. The
domain of politics, as one of the hottest topics of opinion mining
research, merged together with the behavior analysis for affiliation
determination in texts, which constitutes the subject of this paper.
This study aims to classify the text in news/blogs either as
Republican or Democrat with the minimum number of features. As
an initial set, 68 features which 64 were constituted by Linguistic
Inquiry and Word Count (LIWC) features were tested against 14
benchmark classification algorithms. In the later experiments, the
dimensions of the feature vector reduced based on the 7 feature
selection algorithms. The results show that the “Decision Tree”,
“Rule Induction” and “M5 Rule” classifiers when used with “SVM”
and “IGR” feature selection algorithms performed the best up to
82.5% accuracy on a given dataset. Further tests on a single feature
and the linguistic based feature sets showed the similar results. The
feature “Function”, as an aggregate feature of the linguistic category,
was found as the most differentiating feature among the 68 features
with the accuracy of 81% in classifying articles either as Republican
or Democrat.
Abstract: Thousands of organisations store important and
confidential information related to them, their customers, and their
business partners in databases all across the world. The stored data
ranges from less sensitive (e.g. first name, last name, date of birth) to
more sensitive data (e.g. password, pin code, and credit card
information). Losing data, disclosing confidential information or
even changing the value of data are the severe damages that
Structured Query Language injection (SQLi) attack can cause on a
given database. It is a code injection technique where malicious SQL
statements are inserted into a given SQL database by simply using a
web browser. In this paper, we propose an effective pattern
recognition neural network model for detection and classification of
SQLi attacks. The proposed model is built from three main elements
of: a Uniform Resource Locator (URL) generator in order to generate
thousands of malicious and benign URLs, a URL classifier in order
to: 1) classify each generated URL to either a benign URL or a
malicious URL and 2) classify the malicious URLs into different
SQLi attack categories, and a NN model in order to: 1) detect either a
given URL is a malicious URL or a benign URL and 2) identify the
type of SQLi attack for each malicious URL. The model is first
trained and then evaluated by employing thousands of benign and
malicious URLs. The results of the experiments are presented in
order to demonstrate the effectiveness of the proposed approach.
Abstract: The paper presents new results concerning selection of
optimal information fusion formula for ensembles of C-OTDR
channels. The goal of information fusion is to create an integral
classificator designed for effective classification of seismoacoustic
target events. The LPBoost (LP-β and LP-B variants), the Multiple
Kernel Learning, and Weighing of Inversely as Lipschitz Constants
(WILC) approaches were compared. The WILC is a brand new
approach to optimal fusion of Lipschitz Classifiers Ensembles.
Results of practical usage are presented.
Abstract: Margin-Based Principle has been proposed for a long
time, it has been proved that this principle could reduce the
structural risk and improve the performance in both theoretical
and practical aspects. Meanwhile, feed-forward neural network is
a traditional classifier, which is very hot at present with a deeper
architecture. However, the training algorithm of feed-forward neural
network is developed and generated from Widrow-Hoff Principle that
means to minimize the squared error. In this paper, we propose
a new training algorithm for feed-forward neural networks based
on Margin-Based Principle, which could effectively promote the
accuracy and generalization ability of neural network classifiers
with less labelled samples and flexible network. We have conducted
experiments on four UCI open datasets and achieved good results
as expected. In conclusion, our model could handle more sparse
labelled and more high-dimension dataset in a high accuracy while
modification from old ANN method to our method is easy and almost
free of work.
Abstract: The problems arising from unbalanced data sets
generally appear in real world applications. Due to unequal class
distribution, many researchers have found that the performance of
existing classifiers tends to be biased towards the majority class. The
k-nearest neighbors’ nonparametric discriminant analysis is a method
that was proposed for classifying unbalanced classes with good
performance. In this study, the methods of discriminant analysis are
of interest in investigating misclassification error rates for classimbalanced
data of three diabetes risk groups. The purpose of this
study was to compare the classification performance between
parametric discriminant analysis and nonparametric discriminant
analysis in a three-class classification of class-imbalanced data of
diabetes risk groups. Data from a project maintaining healthy
conditions for 599 employees of a government hospital in Bangkok
were obtained for the classification problem. The employees were
divided into three diabetes risk groups: non-risk (90%), risk (5%),
and diabetic (5%). The original data including the variables of
diabetes risk group, age, gender, blood glucose, and BMI were
analyzed and bootstrapped for 50 and 100 samples, 599 observations
per sample, for additional estimation of the misclassification error
rate. Each data set was explored for the departure of multivariate
normality and the equality of covariance matrices of the three risk
groups. Both the original data and the bootstrap samples showed nonnormality
and unequal covariance matrices. The parametric linear
discriminant function, quadratic discriminant function, and the
nonparametric k-nearest neighbors’ discriminant function were
performed over 50 and 100 bootstrap samples and applied to the
original data. Searching the optimal classification rule, the choices of
prior probabilities were set up for both equal proportions (0.33: 0.33:
0.33) and unequal proportions of (0.90:0.05:0.05), (0.80: 0.10: 0.10)
and (0.70, 0.15, 0.15). The results from 50 and 100 bootstrap samples
indicated that the k-nearest neighbors approach when k=3 or k=4 and
the defined prior probabilities of non-risk: risk: diabetic as 0.90:
0.05:0.05 or 0.80:0.10:0.10 gave the smallest error rate of
misclassification. The k-nearest neighbors approach would be
suggested for classifying a three-class-imbalanced data of diabetes
risk groups.
Abstract: This study investigates the use of a time-series of
MODIS NDVI data to identify agricultural land cover change on an
annual time step (2007 - 2012) and characterize the trend. Following
an ISODATA classification of the MODIS imagery to selectively
mask areas not agriculture or semi-natural, NDVI signatures were
created to identify areas cereals and vineyards with the aid of
ancillary, pictometry and field sample data for 2010. The NDVI
signature curve and training samples were used to create a decision
tree model in WEKA 3.6.9 using decision tree classifier (J48)
algorithm; Model 1 including ISODATA classification and Model 2
not. These two models were then used to classify all data for the
study area for 2010, producing land cover maps with classification
accuracies of 77% and 80% for Model 1 and 2 respectively. Model 2
was subsequently used to create land cover classification and change
detection maps for all other years. Subtle changes and areas of
consistency (unchanged) were observed in the agricultural classes
and crop practices. Over the years as predicted by the land cover
classification. Forty one percent of the catchment comprised of
cereals with 35% possibly following a crop rotation system.
Vineyards largely remained constant with only one percent
conversion to vineyard from other land cover classes.
Abstract: In the past few years, the amount of malicious software
increased exponentially and, therefore, machine learning algorithms
became instrumental in identifying clean and malware files through
(semi)-automated classification. When working with very large
datasets, the major challenge is to reach both a very high malware
detection rate and a very low false positive rate. Another challenge
is to minimize the time needed for the machine learning algorithm to
do so. This paper presents a comparative study between different
machine learning techniques such as linear classifiers, ensembles,
decision trees or various hybrids thereof. The training dataset consists
of approximately 2 million clean files and 200.000 infected files,
which is a realistic quantitative mixture. The paper investigates the
above mentioned methods with respect to both their performance
(detection rate and false positive rate) and their practicability.
Abstract: This paper introduces an original method for
guaranteed estimation of the accuracy for an ensemble of Lipschitz
classifiers. The solution was obtained as a finite closed set of
alternative hypotheses, which contains an object of classification with
probability of not less than the specified value. Thus, the
classification is represented by a set of hypothetical classes. In this
case, the smaller the cardinality of the discrete set of hypothetical
classes is, the higher is the classification accuracy. Experiments have
shown that if cardinality of the classifiers ensemble is increased then
the cardinality of this set of hypothetical classes is reduced. The
problem of the guaranteed estimation of the accuracy for an ensemble
of Lipschitz classifiers is relevant in multichannel classification of
target events in C-OTDR monitoring systems. Results of suggested
approach practical usage to accuracy control in C-OTDR monitoring
systems are present.
Abstract: In this paper, we used data mining to extract
biomedical knowledge. In general, complex biomedical data
collected in studies of populations are treated by statistical methods,
although they are robust, they are not sufficient in themselves to
harness the potential wealth of data. For that you used in step two
learning algorithms: the Decision Trees and Support Vector Machine
(SVM). These supervised classification methods are used to make the
diagnosis of thyroid disease. In this context, we propose to promote
the study and use of symbolic data mining techniques.
Abstract: The performance and analysis of speech recognition
system is illustrated in this paper. An approach to recognize the
English word corresponding to digit (0-9) spoken by 2 different
speakers is captured in noise free environment. For feature extraction,
speech Mel frequency cepstral coefficients (MFCC) has been used
which gives a set of feature vectors from recorded speech samples.
Neural network model is used to enhance the recognition
performance. Feed forward neural network with back propagation
algorithm model is used. However other speech recognition
techniques such as HMM, DTW exist. All experiments are carried
out on Matlab.
Abstract: The goal of image segmentation is to cluster pixels
into salient image regions. Segmentation could be used for object
recognition, occlusion boundary estimation within motion or stereo
systems, image compression, image editing, or image database lookup.
In this paper, we present a color image segmentation using
support vector machine (SVM) pixel classification. Firstly, the pixel
level color and texture features of the image are extracted and they
are used as input to the SVM classifier. These features are extracted
using the homogeneity model and Gabor Filter. With the extracted
pixel level features, the SVM Classifier is trained by using FCM
(Fuzzy C-Means).The image segmentation takes the advantage of
both the pixel level information of the image and also the ability of
the SVM Classifier. The Experiments show that the proposed method
has a very good segmentation result and a better efficiency, increases
the quality of the image segmentation compared with the other
segmentation methods proposed in the literature.
Abstract: This paper introduces an original method of
parametric optimization of the structure for multimodal decisionlevel
fusion scheme which combines the results of the partial solution
of the classification task obtained from assembly of the mono-modal
classifiers. As a result, a multimodal fusion classifier which has the
minimum value of the total error rate has been obtained.
Abstract: Neurons in the nervous system communicate with
each other by producing electrical signals called spikes. To
investigate the physiological function of nervous system it is essential
to study the activity of neurons by detecting and sorting spikes in the
recorded signal. In this paper a method is proposed for considering
the spike sorting problem which is based on the nonlinear modeling
of spikes using exponential autoregressive model. The genetic
algorithm is utilized for model parameter estimation. In this regard
some selected model coefficients are used as features for sorting
purposes. For optimal selection of model coefficients, self-organizing
feature map is used. The results show that modeling of spikes with
nonlinear autoregressive model outperforms its linear counterpart.
Also the extracted features based on the coefficients of exponential
autoregressive model are better than wavelet based extracted features
and get more compact and well-separated clusters. In the case of
spikes different in small-scale structures where principal component
analysis fails to get separated clouds in the feature space, the
proposed method can obtain well-separated cluster which removes
the necessity of applying complex classifiers.
Abstract: A Distributed Denial of Service (DDoS) attack is a
major threat to cyber security. It originates from the network layer or
the application layer of compromised/attacker systems which are
connected to the network. The impact of this attack ranges from the
simple inconvenience to use a particular service to causing major
failures at the targeted server. When there is heavy traffic flow to a
target server, it is necessary to classify the legitimate access and
attacks. In this paper, a novel method is proposed to detect DDoS
attacks from the traces of traffic flow. An access matrix is created
from the traces. As the access matrix is multi dimensional, Principle
Component Analysis (PCA) is used to reduce the attributes used for
detection. Two classifiers Naive Bayes and K-Nearest neighborhood
are used to classify the traffic as normal or abnormal. The
performance of the classifier with PCA selected attributes and actual
attributes of access matrix is compared by the detection rate and
False Positive Rate (FPR).