Abstract: We propose a novel graphical technique (SVision) for
intrusion detection, which pictures the network as a community of
hosts independently roaming in a 3D space defined by the set of
services that they use. The aim of SVision is to graphically cluster
the hosts into normal and abnormal ones, highlighting only the ones
that are considered as a threat to the network. Our experimental
results using DARPA 1999 and 2000 intrusion detection and
evaluation datasets show the proposed technique as a good candidate
for the detection of various threats of the network such as vertical
and horizontal scanning, Denial of Service (DoS), and Distributed
DoS (DDoS) attacks.
Abstract: An adaptive spatial Gaussian mixture model is proposed for clustering based color image segmentation. A new clustering objective function which incorporates the spatial information is introduced in the Bayesian framework. The weighting parameter for controlling the importance of spatial information is made adaptive to the image content to augment the smoothness towards piecewisehomogeneous region and diminish the edge-blurring effect and hence the name adaptive spatial finite mixture model. The proposed approach is compared with the spatially variant finite mixture model for pixel labeling. The experimental results with synthetic and Berkeley dataset demonstrate that the proposed method is effective in improving the segmentation and it can be employed in different practical image content understanding applications.
Abstract: The purpose of this paper is to detect human in images.
This paper proposes a method for extracting human body feature descriptors consisting of projected edge component series. The feature descriptor can express appearances and shapes of human with local
and global distribution of edges. Our method evaluated with a linear SVM classifier on Daimler-Chrysler pedestrian dataset, and test with
various sub-region size. The result shows that the accuracy level of
proposed method similar to Histogram of Oriented Gradients(HOG)
feature descriptor and feature extraction process is simple and faster than existing methods.
Abstract: User-based Collaborative filtering (CF), one of the
most prevailing and efficient recommendation techniques, provides
personalized recommendations to users based on the opinions of other
users. Although the CF technique has been successfully applied in
various applications, it suffers from serious sparsity problems. The
cloud-model approach addresses the sparsity problems by
constructing the user-s global preference represented by a cloud
eigenvector. The user-based CF approach works well with dense
datasets while the cloud-model CF approach has a greater
performance when the dataset is sparse. In this paper, we present a
hybrid approach that integrates the predictions from both the
user-based CF and the cloud-model CF approaches. The experimental
results show that the proposed hybrid approach can ameliorate the
sparsity problem and provide an improved prediction quality.
Abstract: In order to develop forest management strategies in
tropical forest in Malaysia, surveying the forest resources and
monitoring the forest area affected by logging activities is essential.
There are tremendous effort has been done in classification of land
cover related to forest resource management in this country as it is a
priority in all aspects of forest mapping using remote sensing and
related technology such as GIS. In fact classification process is a
compulsory step in any remote sensing research. Therefore, the main
objective of this paper is to assess classification accuracy of
classified forest map on Landsat TM data from difference number of
reference data (200 and 388 reference data). This comparison was
made through observation (200 reference data), and interpretation
and observation approaches (388 reference data). Five land cover
classes namely primary forest, logged over forest, water bodies, bare
land and agricultural crop/mixed horticultural can be identified by
the differences in spectral wavelength. Result showed that an overall
accuracy from 200 reference data was 83.5 % (kappa value
0.7502459; kappa variance 0.002871), which was considered
acceptable or good for optical data. However, when 200 reference
data was increased to 388 in the confusion matrix, the accuracy
slightly improved from 83.5% to 89.17%, with Kappa statistic
increased from 0.7502459 to 0.8026135, respectively. The accuracy
in this classification suggested that this strategy for the selection of
training area, interpretation approaches and number of reference data
used were importance to perform better classification result.
Abstract: In this paper we describe a hybrid technique of Minimax search and aggregate Mahalanobis distance function synthesis to evolve Awale game player. The hybrid technique helps to suggest a move in a short amount of time without looking into endgame database. However, the effectiveness of the technique is heavily dependent on the training dataset of the Awale strategies utilized. The evolved player was tested against Awale shareware program and the result is appealing.
Abstract: Obtaining labeled data in supervised learning is often
difficult and expensive, and thus the trained learning algorithm tends
to be overfitting due to small number of training data. As a result,
some researchers have focused on using unlabeled data which may
not necessary to follow the same generative distribution as the labeled
data to construct a high-level feature for improving performance on
supervised learning tasks. In this paper, we investigate the impact of
the relationship between unlabeled and labeled data for classification
performance. Specifically, we will apply difference unlabeled data
which have different degrees of relation to the labeled data for
handwritten digit classification task based on MNIST dataset. Our
experimental results show that the higher the degree of relation
between unlabeled and labeled data, the better the classification
performance. Although the unlabeled data that is completely from
different generative distribution to the labeled data provides the lowest
classification performance, we still achieve high classification performance.
This leads to expanding the applicability of the supervised
learning algorithms using unsupervised learning.
Abstract: Gene expression profiling is rapidly evolving into a
powerful technique for investigating tumor malignancies. The
researchers are overwhelmed with the microarray-based platforms
and methods that confer them the freedom to conduct large-scale
gene expression profiling measurements. Simultaneously,
investigations into cross-platform integration methods have started
gaining momentum due to their underlying potential to help
comprehend a myriad of broad biological issues in tumor diagnosis,
prognosis, and therapy. However, comparing results from different
platforms remains to be a challenging task as various inherent
technical differences exist between the microarray platforms. In this
paper, we explain a simple ratio-transformation method, which can
provide some common ground for cDNA and Affymetrix platform
towards cross-platform integration. The method is based on the
characteristic data attributes of Affymetrix- and cDNA- platform. In
the work, we considered seven childhood leukemia patients and their
gene expression levels in either platform. With a dataset of 822
differentially expressed genes from both these platforms, we carried
out a specific ratio-treatment to Affymetrix data, which subsequently
showed an improvement in the relationship with the cDNA data.
Abstract: This paper presents a hand vein authentication system
using fast spatial correlation of hand vein patterns. In order to
evaluate the system performance, a prototype was designed and a
dataset of 50 persons of different ages above 16 and of different
gender, each has 10 images per person was acquired at different
intervals, 5 images for left hand and 5 images for right hand. In
verification testing analysis, we used 3 images to represent the
templates and 2 images for testing. Each of the 2 images is matched
with the existing 3 templates. FAR of 0.02% and FRR of 3.00 %
were reported at threshold 80. The system efficiency at this threshold
was found to be 99.95%. The system can operate at a 97% genuine
acceptance rate and 99.98 % genuine reject rate, at corresponding
threshold of 80. The EER was reported as 0.25 % at threshold 77. We
verified that no similarity exists between right and left hand vein
patterns for the same person over the acquired dataset sample.
Finally, this distinct 100 hand vein patterns dataset sample can be
accessed by researchers and students upon request for testing other
methods of hand veins matching.
Abstract: Automatic face detection is a complex problem in
image processing. Many methods exist to solve this problem such as
template matching, Fisher Linear Discriminate, Neural Networks,
SVM, and MRC. Success has been achieved with each method to
varying degrees and complexities. In proposed algorithm we used
upright, frontal faces for single gray scale images with decent
resolution and under good lighting condition. In the field of face
recognition technique the single face is matched with single face
from the training dataset. The author proposed a neural network
based face detection algorithm from the photographs as well as if any
test data appears it check from the online scanned training dataset.
Experimental result shows that the algorithm detected up to 95%
accuracy for any image.
Abstract: Software effort estimation is the process of predicting
the most realistic use of effort required to develop or maintain
software based on incomplete, uncertain and/or noisy input. Effort
estimates may be used as input to project plans, iteration plans,
budgets. There are various models like Halstead, Walston-Felix,
Bailey-Basili, Doty and GA Based models which have already used
to estimate the software effort for projects. In this study Statistical
Models, Fuzzy-GA and Neuro-Fuzzy (NF) Inference Systems are
experimented to estimate the software effort for projects. The
performances of the developed models were tested on NASA
software project datasets and results are compared with the Halstead,
Walston-Felix, Bailey-Basili, Doty and Genetic Algorithm Based
models mentioned in the literature. The result shows that the NF
Model has the lowest MMRE and RMSE values. The NF Model
shows the best results as compared with the Fuzzy-GA based hybrid
Inference System and other existing Models that are being used for
the Effort Prediction with lowest MMRE and RMSE values.
Abstract: Recent years have seen a growing trend towards the
integration of multiple information sources to support large-scale
prediction of protein-protein interaction (PPI) networks in model
organisms. Despite advances in computational approaches, the
combination of multiple “omic" datasets representing the same type
of data, e.g. different gene expression datasets, has not been
rigorously studied. Furthermore, there is a need to further investigate
the inference capability of powerful approaches, such as fullyconnected
Bayesian networks, in the context of the prediction of PPI
networks. This paper addresses these limitations by proposing a
Bayesian approach to integrate multiple datasets, some of which
encode the same type of “omic" data to support the identification of
PPI networks. The case study reported involved the combination of
three gene expression datasets relevant to human heart failure (HF).
In comparison with two traditional methods, Naive Bayesian and
maximum likelihood ratio approaches, the proposed technique can
accurately identify known PPI and can be applied to infer potentially
novel interactions.
Abstract: In this paper, we present a new algorithm for clustering data in large datasets using image processing approaches. First the dataset is mapped into a binary image plane. The synthesized image is then processed utilizing efficient image processing techniques to cluster the data in the dataset. Henceforth, the algorithm avoids exhaustive search to identify clusters. The algorithm considers only a small set of the data that contains critical boundary information sufficient to identify contained clusters. Compared to available data clustering techniques, the proposed algorithm produces similar quality results and outperforms them in execution time and storage requirements.
Abstract: Concrete strength evaluated from compression tests
on cores is affected by several factors causing differences from the
in-situ strength at the location from which the core specimen was
extracted. Among the factors, there is the damage possibly occurring
during the drilling phase that generally leads to underestimate the
actual in-situ strength. In order to quantify this effect, in this study
two wide datasets have been examined, including: (i) about 500 core
specimens extracted from Reinforced Concrete existing structures,
and (ii) about 600 cube specimens taken during the construction of
new structures in the framework of routine acceptance control. The
two experimental datasets have been compared in terms of
compression strength and specific weight values, accounting for the
main factors affecting a concrete property, that is type and amount of
cement, aggregates' grading, type and maximum size of aggregates,
water/cement ratio, placing and curing modality, concrete age. The
results show that the magnitude of the strength reduction due to
drilling damage is strongly affected by the actual properties of
concrete, being inversely proportional to its strength. Therefore, the
application of a single value of the correction coefficient, as generally
suggested in the technical literature and in structural codes, appears
inappropriate. A set of values of the drilling damage coefficient is
suggested as a function of the strength obtained from compressive
tests on cores.
Abstract: Sickness absence represents a major economic and
social issue. Analysis of sick leave data is a recurrent challenge to analysts because of the complexity of the data structure which is
often time dependent, highly skewed and clumped at zero. Ignoring these features to make statistical inference is likely to be inefficient
and misguided. Traditional approaches do not address these problems. In this study, we discuss model methodologies in terms of statistical techniques for addressing the difficulties with sick leave data. We also introduce and demonstrate a new method by performing a longitudinal assessment of long-term absenteeism using
a large registration dataset as a working example available from the Helsinki Health Study for municipal employees from Finland during the period of 1990-1999. We present a comparative study on model
selection and a critical analysis of the temporal trends, the occurrence
and degree of long-term sickness absences among municipal employees. The strengths of this working example include the large
sample size over a long follow-up period providing strong evidence in supporting of the new model. Our main goal is to propose a way to
select an appropriate model and to introduce a new methodology for analysing sickness absence data as well as to demonstrate model
applicability to complicated longitudinal data.
Abstract: Rule Discovery is an important technique for mining knowledge from large databases. Use of objective measures for discovering interesting rules lead to another data mining problem, although of reduced complexity. Data mining researchers have studied subjective measures of interestingness to reduce the volume of discovered rules to ultimately improve the overall efficiency of KDD process. In this paper we study novelty of the discovered rules as a subjective measure of interestingness. We propose a hybrid approach that uses objective and subjective measures to quantify novelty of the discovered rules in terms of their deviations from the known rules. We analyze the types of deviation that can arise between two rules and categorize the discovered rules according to the user specified threshold. We implement the proposed framework and experiment with some public datasets. The experimental results are quite promising.
Abstract: Building intelligent traffic guide systems has been an
interesting subject recently. A good system should be able to observe
all important visual information to be able to analyze the context of
the scene. To do so, signs in general, and traffic signs in particular,
are usually taken into account as they contain rich information to
these systems. Therefore, many researchers have put an effort on
sign recognition field. Sign localization or sign detection is the most
important step in the sign recognition process. This step filters out
non informative area in the scene, and locates candidates in later
steps. In this paper, we apply a new approach in detecting sign
locations using a new color invariant model. Experiments are carried
out with different datasets introduced in other works where authors
claimed the difficulty in detecting signs under unfavorable imaging
conditions. Our method is simple, fast and most importantly it gives
a high detection rate in locating signs.
Abstract: This work presents a neural network model for the
clustering analysis of data based on Self Organizing Maps (SOM).
The model evolves during the training stage towards a hierarchical
structure according to the input requirements. The hierarchical structure
symbolizes a specialization tool that provides refinements of the
classification process. The structure behaves like a single map with
different resolutions depending on the region to analyze. The benefits
and performance of the algorithm are discussed in application to the
Iris dataset, a classical example for pattern recognition.
Abstract: Free and open source software is gaining popularity at
an unprecedented rate of growth. Organizations despite some
concerns about the quality have been using them for various
purposes. One of the biggest concerns about free and open source
software is post release software defects and their fixing. Many
believe that there is no appropriate support available to fix the bugs.
On the contrary some believe that due to the active involvement of
internet user in online forums, they become a major source of
communicating the identification and fixing of defects in open source
software. The research model of this empirical investigation
establishes and studies the relationship between open source software
defects and online public forums. The results of this empirical study
provide evidence about the realities of software defects myths of
open source software. We used a dataset consist of 616 open source
software projects covering a broad range of categories to study the
research model of this investigation. The results of this investigation
show that online forums play a significant role identifying and fixing
the defects in open source software.
Abstract: In this paper, a new learning approach for network
intrusion detection using naïve Bayesian classifier and ID3 algorithm
is presented, which identifies effective attributes from the training
dataset, calculates the conditional probabilities for the best attribute
values, and then correctly classifies all the examples of training and
testing dataset. Most of the current intrusion detection datasets are
dynamic, complex and contain large number of attributes. Some of
the attributes may be redundant or contribute little for detection
making. It has been successfully tested that significant attribute
selection is important to design a real world intrusion detection
systems (IDS). The purpose of this study is to identify effective
attributes from the training dataset to build a classifier for network
intrusion detection using data mining algorithms. The experimental
results on KDD99 benchmark intrusion detection dataset demonstrate
that this new approach achieves high classification rates and reduce
false positives using limited computational resources.