Svision: Visual Identification of Scanning and Denial of Service Attacks

We propose a novel graphical technique (SVision) for intrusion detection, which pictures the network as a community of hosts independently roaming in a 3D space defined by the set of services that they use. The aim of SVision is to graphically cluster the hosts into normal and abnormal ones, highlighting only the ones that are considered as a threat to the network. Our experimental results using DARPA 1999 and 2000 intrusion detection and evaluation datasets show the proposed technique as a good candidate for the detection of various threats of the network such as vertical and horizontal scanning, Denial of Service (DoS), and Distributed DoS (DDoS) attacks.

Color Image Segmentation using Adaptive Spatial Gaussian Mixture Model

An adaptive spatial Gaussian mixture model is proposed for clustering based color image segmentation. A new clustering objective function which incorporates the spatial information is introduced in the Bayesian framework. The weighting parameter for controlling the importance of spatial information is made adaptive to the image content to augment the smoothness towards piecewisehomogeneous region and diminish the edge-blurring effect and hence the name adaptive spatial finite mixture model. The proposed approach is compared with the spatially variant finite mixture model for pixel labeling. The experimental results with synthetic and Berkeley dataset demonstrate that the proposed method is effective in improving the segmentation and it can be employed in different practical image content understanding applications.

Human Detection using Projected Edge Feature

The purpose of this paper is to detect human in images. This paper proposes a method for extracting human body feature descriptors consisting of projected edge component series. The feature descriptor can express appearances and shapes of human with local and global distribution of edges. Our method evaluated with a linear SVM classifier on Daimler-Chrysler pedestrian dataset, and test with various sub-region size. The result shows that the accuracy level of proposed method similar to Histogram of Oriented Gradients(HOG) feature descriptor and feature extraction process is simple and faster than existing methods.

A Hybrid Recommender System based on Collaborative Filtering and Cloud Model

User-based Collaborative filtering (CF), one of the most prevailing and efficient recommendation techniques, provides personalized recommendations to users based on the opinions of other users. Although the CF technique has been successfully applied in various applications, it suffers from serious sparsity problems. The cloud-model approach addresses the sparsity problems by constructing the user-s global preference represented by a cloud eigenvector. The user-based CF approach works well with dense datasets while the cloud-model CF approach has a greater performance when the dataset is sparse. In this paper, we present a hybrid approach that integrates the predictions from both the user-based CF and the cloud-model CF approaches. The experimental results show that the proposed hybrid approach can ameliorate the sparsity problem and provide an improved prediction quality.

Satellite Data Classification Accuracy Assessment Based from Reference Dataset

In order to develop forest management strategies in tropical forest in Malaysia, surveying the forest resources and monitoring the forest area affected by logging activities is essential. There are tremendous effort has been done in classification of land cover related to forest resource management in this country as it is a priority in all aspects of forest mapping using remote sensing and related technology such as GIS. In fact classification process is a compulsory step in any remote sensing research. Therefore, the main objective of this paper is to assess classification accuracy of classified forest map on Landsat TM data from difference number of reference data (200 and 388 reference data). This comparison was made through observation (200 reference data), and interpretation and observation approaches (388 reference data). Five land cover classes namely primary forest, logged over forest, water bodies, bare land and agricultural crop/mixed horticultural can be identified by the differences in spectral wavelength. Result showed that an overall accuracy from 200 reference data was 83.5 % (kappa value 0.7502459; kappa variance 0.002871), which was considered acceptable or good for optical data. However, when 200 reference data was increased to 388 in the confusion matrix, the accuracy slightly improved from 83.5% to 89.17%, with Kappa statistic increased from 0.7502459 to 0.8026135, respectively. The accuracy in this classification suggested that this strategy for the selection of training area, interpretation approaches and number of reference data used were importance to perform better classification result.

Investigating the Performance of Minimax Search and Aggregate Mahalanobis Distance Function in Evolving an Ayo/Awale Player

In this paper we describe a hybrid technique of Minimax search and aggregate Mahalanobis distance function synthesis to evolve Awale game player. The hybrid technique helps to suggest a move in a short amount of time without looking into endgame database. However, the effectiveness of the technique is heavily dependent on the training dataset of the Awale strategies utilized. The evolved player was tested against Awale shareware program and the result is appealing.

Analysis of Relation between Unlabeled and Labeled Data to Self-Taught Learning Performance

Obtaining labeled data in supervised learning is often difficult and expensive, and thus the trained learning algorithm tends to be overfitting due to small number of training data. As a result, some researchers have focused on using unlabeled data which may not necessary to follow the same generative distribution as the labeled data to construct a high-level feature for improving performance on supervised learning tasks. In this paper, we investigate the impact of the relationship between unlabeled and labeled data for classification performance. Specifically, we will apply difference unlabeled data which have different degrees of relation to the labeled data for handwritten digit classification task based on MNIST dataset. Our experimental results show that the higher the degree of relation between unlabeled and labeled data, the better the classification performance. Although the unlabeled data that is completely from different generative distribution to the labeled data provides the lowest classification performance, we still achieve high classification performance. This leads to expanding the applicability of the supervised learning algorithms using unsupervised learning.

A Simple Affymetrix Ratio-transformation Method Yields Comparable Expression Level Quantifications with cDNA Data

Gene expression profiling is rapidly evolving into a powerful technique for investigating tumor malignancies. The researchers are overwhelmed with the microarray-based platforms and methods that confer them the freedom to conduct large-scale gene expression profiling measurements. Simultaneously, investigations into cross-platform integration methods have started gaining momentum due to their underlying potential to help comprehend a myriad of broad biological issues in tumor diagnosis, prognosis, and therapy. However, comparing results from different platforms remains to be a challenging task as various inherent technical differences exist between the microarray platforms. In this paper, we explain a simple ratio-transformation method, which can provide some common ground for cDNA and Affymetrix platform towards cross-platform integration. The method is based on the characteristic data attributes of Affymetrix- and cDNA- platform. In the work, we considered seven childhood leukemia patients and their gene expression levels in either platform. With a dataset of 822 differentially expressed genes from both these platforms, we carried out a specific ratio-treatment to Affymetrix data, which subsequently showed an improvement in the relationship with the cDNA data.

Biometric Authentication Using Fast Correlation of Near Infrared Hand Vein Patterns

This paper presents a hand vein authentication system using fast spatial correlation of hand vein patterns. In order to evaluate the system performance, a prototype was designed and a dataset of 50 persons of different ages above 16 and of different gender, each has 10 images per person was acquired at different intervals, 5 images for left hand and 5 images for right hand. In verification testing analysis, we used 3 images to represent the templates and 2 images for testing. Each of the 2 images is matched with the existing 3 templates. FAR of 0.02% and FRR of 3.00 % were reported at threshold 80. The system efficiency at this threshold was found to be 99.95%. The system can operate at a 97% genuine acceptance rate and 99.98 % genuine reject rate, at corresponding threshold of 80. The EER was reported as 0.25 % at threshold 77. We verified that no similarity exists between right and left hand vein patterns for the same person over the acquired dataset sample. Finally, this distinct 100 hand vein patterns dataset sample can be accessed by researchers and students upon request for testing other methods of hand veins matching.

Neural Network Based Approach for Face Detection cum Face Recognition

Automatic face detection is a complex problem in image processing. Many methods exist to solve this problem such as template matching, Fisher Linear Discriminate, Neural Networks, SVM, and MRC. Success has been achieved with each method to varying degrees and complexities. In proposed algorithm we used upright, frontal faces for single gray scale images with decent resolution and under good lighting condition. In the field of face recognition technique the single face is matched with single face from the training dataset. The author proposed a neural network based face detection algorithm from the photographs as well as if any test data appears it check from the online scanned training dataset. Experimental result shows that the algorithm detected up to 95% accuracy for any image.

A Model for Estimation of Efforts in Development of Software Systems

Software effort estimation is the process of predicting the most realistic use of effort required to develop or maintain software based on incomplete, uncertain and/or noisy input. Effort estimates may be used as input to project plans, iteration plans, budgets. There are various models like Halstead, Walston-Felix, Bailey-Basili, Doty and GA Based models which have already used to estimate the software effort for projects. In this study Statistical Models, Fuzzy-GA and Neuro-Fuzzy (NF) Inference Systems are experimented to estimate the software effort for projects. The performances of the developed models were tested on NASA software project datasets and results are compared with the Halstead, Walston-Felix, Bailey-Basili, Doty and Genetic Algorithm Based models mentioned in the literature. The result shows that the NF Model has the lowest MMRE and RMSE values. The NF Model shows the best results as compared with the Fuzzy-GA based hybrid Inference System and other existing Models that are being used for the Effort Prediction with lowest MMRE and RMSE values.

An Integrative Bayesian Approach to Supporting the Prediction of Protein-Protein Interactions: A Case Study in Human Heart Failure

Recent years have seen a growing trend towards the integration of multiple information sources to support large-scale prediction of protein-protein interaction (PPI) networks in model organisms. Despite advances in computational approaches, the combination of multiple “omic" datasets representing the same type of data, e.g. different gene expression datasets, has not been rigorously studied. Furthermore, there is a need to further investigate the inference capability of powerful approaches, such as fullyconnected Bayesian networks, in the context of the prediction of PPI networks. This paper addresses these limitations by proposing a Bayesian approach to integrate multiple datasets, some of which encode the same type of “omic" data to support the identification of PPI networks. The case study reported involved the combination of three gene expression datasets relevant to human heart failure (HF). In comparison with two traditional methods, Naive Bayesian and maximum likelihood ratio approaches, the proposed technique can accurately identify known PPI and can be applied to infer potentially novel interactions.

IMDC: An Image-Mapped Data Clustering Technique for Large Datasets

In this paper, we present a new algorithm for clustering data in large datasets using image processing approaches. First the dataset is mapped into a binary image plane. The synthesized image is then processed utilizing efficient image processing techniques to cluster the data in the dataset. Henceforth, the algorithm avoids exhaustive search to identify clusters. The algorithm considers only a small set of the data that contains critical boundary information sufficient to identify contained clusters. Compared to available data clustering techniques, the proposed algorithm produces similar quality results and outperforms them in execution time and storage requirements.

Experimental Evaluation of Drilling Damage on the Strength of Cores Extracted from RC Buildings

Concrete strength evaluated from compression tests on cores is affected by several factors causing differences from the in-situ strength at the location from which the core specimen was extracted. Among the factors, there is the damage possibly occurring during the drilling phase that generally leads to underestimate the actual in-situ strength. In order to quantify this effect, in this study two wide datasets have been examined, including: (i) about 500 core specimens extracted from Reinforced Concrete existing structures, and (ii) about 600 cube specimens taken during the construction of new structures in the framework of routine acceptance control. The two experimental datasets have been compared in terms of compression strength and specific weight values, accounting for the main factors affecting a concrete property, that is type and amount of cement, aggregates' grading, type and maximum size of aggregates, water/cement ratio, placing and curing modality, concrete age. The results show that the magnitude of the strength reduction due to drilling damage is strongly affected by the actual properties of concrete, being inversely proportional to its strength. Therefore, the application of a single value of the correction coefficient, as generally suggested in the technical literature and in structural codes, appears inappropriate. A set of values of the drilling damage coefficient is suggested as a function of the strength obtained from compressive tests on cores.

On Methodologies for Analysing Sickness Absence Data: An Insight into a New Method

Sickness absence represents a major economic and social issue. Analysis of sick leave data is a recurrent challenge to analysts because of the complexity of the data structure which is often time dependent, highly skewed and clumped at zero. Ignoring these features to make statistical inference is likely to be inefficient and misguided. Traditional approaches do not address these problems. In this study, we discuss model methodologies in terms of statistical techniques for addressing the difficulties with sick leave data. We also introduce and demonstrate a new method by performing a longitudinal assessment of long-term absenteeism using a large registration dataset as a working example available from the Helsinki Health Study for municipal employees from Finland during the period of 1990-1999. We present a comparative study on model selection and a critical analysis of the temporal trends, the occurrence and degree of long-term sickness absences among municipal employees. The strengths of this working example include the large sample size over a long follow-up period providing strong evidence in supporting of the new model. Our main goal is to propose a way to select an appropriate model and to introduce a new methodology for analysing sickness absence data as well as to demonstrate model applicability to complicated longitudinal data.

A Hybrid Approach for Quantification of Novelty in Rule Discovery

Rule Discovery is an important technique for mining knowledge from large databases. Use of objective measures for discovering interesting rules lead to another data mining problem, although of reduced complexity. Data mining researchers have studied subjective measures of interestingness to reduce the volume of discovered rules to ultimately improve the overall efficiency of KDD process. In this paper we study novelty of the discovered rules as a subjective measure of interestingness. We propose a hybrid approach that uses objective and subjective measures to quantify novelty of the discovered rules in terms of their deviations from the known rules. We analyze the types of deviation that can arise between two rules and categorize the discovered rules according to the user specified threshold. We implement the proposed framework and experiment with some public datasets. The experimental results are quite promising.

A Fast Sign Localization System Using Discriminative Color Invariant Segmentation

Building intelligent traffic guide systems has been an interesting subject recently. A good system should be able to observe all important visual information to be able to analyze the context of the scene. To do so, signs in general, and traffic signs in particular, are usually taken into account as they contain rich information to these systems. Therefore, many researchers have put an effort on sign recognition field. Sign localization or sign detection is the most important step in the sign recognition process. This step filters out non informative area in the scene, and locates candidates in later steps. In this paper, we apply a new approach in detecting sign locations using a new color invariant model. Experiments are carried out with different datasets introduced in other works where authors claimed the difficulty in detecting signs under unfavorable imaging conditions. Our method is simple, fast and most importantly it gives a high detection rate in locating signs.

Hierarchical Clustering Analysis with SOM Networks

This work presents a neural network model for the clustering analysis of data based on Self Organizing Maps (SOM). The model evolves during the training stage towards a hierarchical structure according to the input requirements. The hierarchical structure symbolizes a specialization tool that provides refinements of the classification process. The structure behaves like a single map with different resolutions depending on the region to analyze. The benefits and performance of the algorithm are discussed in application to the Iris dataset, a classical example for pattern recognition.

Defects in Open Source Software: The Role of Online Forums

Free and open source software is gaining popularity at an unprecedented rate of growth. Organizations despite some concerns about the quality have been using them for various purposes. One of the biggest concerns about free and open source software is post release software defects and their fixing. Many believe that there is no appropriate support available to fix the bugs. On the contrary some believe that due to the active involvement of internet user in online forums, they become a major source of communicating the identification and fixing of defects in open source software. The research model of this empirical investigation establishes and studies the relationship between open source software defects and online public forums. The results of this empirical study provide evidence about the realities of software defects myths of open source software. We used a dataset consist of 616 open source software projects covering a broad range of categories to study the research model of this investigation. The results of this investigation show that online forums play a significant role identifying and fixing the defects in open source software.

Adaptive Network Intrusion Detection Learning: Attribute Selection and Classification

In this paper, a new learning approach for network intrusion detection using naïve Bayesian classifier and ID3 algorithm is presented, which identifies effective attributes from the training dataset, calculates the conditional probabilities for the best attribute values, and then correctly classifies all the examples of training and testing dataset. Most of the current intrusion detection datasets are dynamic, complex and contain large number of attributes. Some of the attributes may be redundant or contribute little for detection making. It has been successfully tested that significant attribute selection is important to design a real world intrusion detection systems (IDS). The purpose of this study is to identify effective attributes from the training dataset to build a classifier for network intrusion detection using data mining algorithms. The experimental results on KDD99 benchmark intrusion detection dataset demonstrate that this new approach achieves high classification rates and reduce false positives using limited computational resources.