Scaling up Detection Rates and Reducing False Positives in Intrusion Detection using NBTree

In this paper, we present a new learning algorithm for anomaly based network intrusion detection using improved self adaptive naïve Bayesian tree (NBTree), which induces a hybrid of decision tree and naïve Bayesian classifier. The proposed approach scales up the balance detections for different attack types and keeps the false positives at acceptable level in intrusion detection. In complex and dynamic large intrusion detection dataset, the detection accuracy of naïve Bayesian classifier does not scale up as well as decision tree. It has been successfully tested in other problem domains that naïve Bayesian tree improves the classification rates in large dataset. In naïve Bayesian tree nodes contain and split as regular decision-trees, but the leaves contain naïve Bayesian classifiers. The experimental results on KDD99 benchmark network intrusion detection dataset demonstrate that this new approach scales up the detection rates for different attack types and reduces false positives in network intrusion detection.

Human Action Recognition Based on Ridgelet Transform and SVM

In this paper, a novel algorithm based on Ridgelet Transform and support vector machine is proposed for human action recognition. The Ridgelet transform is a directional multi-resolution transform and it is more suitable for describing the human action by performing its directional information to form spatial features vectors. The dynamic transition between the spatial features is carried out using both the Principal Component Analysis and clustering algorithm K-means. First, the Principal Component Analysis is used to reduce the dimensionality of the obtained vectors. Then, the kmeans algorithm is then used to perform the obtained vectors to form the spatio-temporal pattern, called set-of-labels, according to given periodicity of human action. Finally, a Support Machine classifier is used to discriminate between the different human actions. Different tests are conducted on popular Datasets, such as Weizmann and KTH. The obtained results show that the proposed method provides more significant accuracy rate and it drives more robustness in very challenging situations such as lighting changes, scaling and dynamic environment

Determination and Assessment of Ground Motion and Spectral Parameters for Iran

Many studies have been conducted for derivation of attenuation relationships worldwide, however few relationships have been developed to use for the seismic region of Iranian plateau and only few of these studies have been conducted for derivation of attenuation relationships for parameters such as uniform duration. Uniform duration is the total time during which the acceleration is larger than a given threshold value (default is 5% of PGA). In this study, the database was same as that used previously by Ghodrati Amiri et al. (2007) with same correction methods for earthquake records in Iran. However in this study, records from earthquakes with MS< 4.0 were excluded from this database, each record has individually filtered afterward, and therefore the dataset has been expanded. These new set of attenuation relationships for Iran are derived based on tectonic conditions with soil classification into rock and soil. Earthquake parameters were chosen to be hypocentral distance and magnitude in order to make it easier to use the relationships for seismic hazard analysis. Tehran is the capital city of Iran wit ha large number of important structures. In this study, a probabilistic approach has been utilized for seismic hazard assessment of this city. The resulting uniform duration against return period diagrams are suggested to be used in any projects in the area.

The Response Relation between Climate Change and NDVI over the Qinghai-Tibet plateau

Based on a long-term vegetation index dataset of NDVI and meteorological data from 68 meteorological stations in the Qinghai-Tibet plateau and their relations with major climate factors were analyzed. The results show the following: 1) The linear trends of temperature in the Qinghai-Tibet plateau indicate that the temperature in the plateau generally increased, but it rose faster in the last 20 years. 2) The most significant NDVI increase occurred in the eastern and southern plateau. However, the western and northern plateau demonstrate a decreasing trend. 3) There is a significant positive linear correlation between NDVI and temperature and a negative correlation between NDVI and mean wind speed. However, no significant statistical relationship was found between NDVI and relative humidity, precipitation or sunshine duration.4) The changes in NDVI for the plateau are driven by temperature-precipitation, but for the desert and forest areas, the relation changes to precipitation-temperature-wind velocity and wind velocity-temperature-precipitation.

Genetic Programming Approach for Multi-Category Pattern Classification Appliedto Network Intrusions Detection

This paper describes a new approach of classification using genetic programming. The proposed technique consists of genetically coevolving a population of non-linear transformations on the input data to be classified, and map them to a new space with a reduced dimension, in order to get a maximum inter-classes discrimination. The classification of new samples is then performed on the transformed data, and so become much easier. Contrary to the existing GP-classification techniques, the proposed one use a dynamic repartition of the transformed data in separated intervals, the efficacy of a given intervals repartition is handled by the fitness criterion, with a maximum classes discrimination. Experiments were first performed using the Fisher-s Iris dataset, and then, the KDD-99 Cup dataset was used to study the intrusion detection and classification problem. Obtained results demonstrate that the proposed genetic approach outperform the existing GP-classification methods [1],[2] and [3], and give a very accepted results compared to other existing techniques proposed in [4],[5],[6],[7] and [8].

Improving Classification Accuracy with Discretization on Datasets Including Continuous Valued Features

This study analyzes the effect of discretization on classification of datasets including continuous valued features. Six datasets from UCI which containing continuous valued features are discretized with entropy-based discretization method. The performance improvement between the dataset with original features and the dataset with discretized features is compared with k-nearest neighbors, Naive Bayes, C4.5 and CN2 data mining classification algorithms. As the result the classification accuracies of the six datasets are improved averagely by 1.71% to 12.31%.

Mining Implicit Knowledge to Predict Political Risk by Providing Novel Framework with Using Bayesian Network

Nowadays predicting political risk level of country has become a critical issue for investors who intend to achieve accurate information concerning stability of the business environments. Since, most of the times investors are layman and nonprofessional IT personnel; this paper aims to propose a framework named GECR in order to help nonexpert persons to discover political risk stability across time based on the political news and events. To achieve this goal, the Bayesian Networks approach was utilized for 186 political news of Pakistan as sample dataset. Bayesian Networks as an artificial intelligence approach has been employed in presented framework, since this is a powerful technique that can be applied to model uncertain domains. The results showed that our framework along with Bayesian Networks as decision support tool, predicted the political risk level with a high degree of accuracy.

Improving Worm Detection with Artificial Neural Networks through Feature Selection and Temporal Analysis Techniques

Computer worm detection is commonly performed by antivirus software tools that rely on prior explicit knowledge of the worm-s code (detection based on code signatures). We present an approach for detection of the presence of computer worms based on Artificial Neural Networks (ANN) using the computer's behavioral measures. Identification of significant features, which describe the activity of a worm within a host, is commonly acquired from security experts. We suggest acquiring these features by applying feature selection methods. We compare three different feature selection techniques for the dimensionality reduction and identification of the most prominent features to capture efficiently the computer behavior in the context of worm activity. Additionally, we explore three different temporal representation techniques for the most prominent features. In order to evaluate the different techniques, several computers were infected with five different worms and 323 different features of the infected computers were measured. We evaluated each technique by preprocessing the dataset according to each one and training the ANN model with the preprocessed data. We then evaluated the ability of the model to detect the presence of a new computer worm, in particular, during heavy user activity on the infected computers.

Artificial Neural Network based Web Application Firewall for SQL Injection

In recent years with the rapid development of Internet and the Web, more and more web applications have been deployed in many fields and organizations such as finance, military, and government. Together with that, hackers have found more subtle ways to attack web applications. According to international statistics, SQL Injection is one of the most popular vulnerabilities of web applications. The consequences of this type of attacks are quite dangerous, such as sensitive information could be stolen or authentication systems might be by-passed. To mitigate the situation, several techniques have been adopted. In this research, a security solution is proposed using Artificial Neural Network to protect web applications against this type of attacks. The solution has been experimented on sample datasets and has given promising result. The solution has also been developed in a prototypic web application firewall called ANNbWAF.

Solving Facility Location Problem on Cluster Computing

Computation of facility location problem for every location in the country is not easy simultaneously. Solving the problem is described by using cluster computing. A technique is to design parallel algorithm by using local search with single swap method in order to solve that problem on clusters. Parallel implementation is done by the use of portable parallel programming, Message Passing Interface (MPI), on Microsoft Windows Compute Cluster. In this paper, it presents the algorithm that used local search with single swap method and implementation of the system of a facility to be opened by using MPI on cluster. If large datasets are considered, the process of calculating a reasonable cost for a facility becomes time consuming. The result shows parallel computation of facility location problem on cluster speedups and scales well as problem size increases.

An ensemble of Weighted Support Vector Machines for Ordinal Regression

Instead of traditional (nominal) classification we investigate the subject of ordinal classification or ranking. An enhanced method based on an ensemble of Support Vector Machines (SVM-s) is proposed. Each binary classifier is trained with specific weights for each object in the training data set. Experiments on benchmark datasets and synthetic data indicate that the performance of our approach is comparable to state of the art kernel methods for ordinal regression. The ensemble method, which is straightforward to implement, provides a very good sensitivity-specificity trade-off for the highest and lowest rank.

A Survey: Clustering Ensembles Techniques

The clustering ensembles combine multiple partitions generated by different clustering algorithms into a single clustering solution. Clustering ensembles have emerged as a prominent method for improving robustness, stability and accuracy of unsupervised classification solutions. So far, many contributions have been done to find consensus clustering. One of the major problems in clustering ensembles is the consensus function. In this paper, firstly, we introduce clustering ensembles, representation of multiple partitions, its challenges and present taxonomy of combination algorithms. Secondly, we describe consensus functions in clustering ensembles including Hypergraph partitioning, Voting approach, Mutual information, Co-association based functions and Finite mixture model, and next explain their advantages, disadvantages and computational complexity. Finally, we compare the characteristics of clustering ensembles algorithms such as computational complexity, robustness, simplicity and accuracy on different datasets in previous techniques.

Fast Database Indexing for Large Protein Sequence Collections Using Parallel N-Gram Transformation Algorithm

With the rapid development in the field of life sciences and the flooding of genomic information, the need for faster and scalable searching methods has become urgent. One of the approaches that were investigated is indexing. The indexing methods have been categorized into three categories which are the lengthbased index algorithms, transformation-based algorithms and mixed techniques-based algorithms. In this research, we focused on the transformation based methods. We embedded the N-gram method into the transformation-based method to build an inverted index table. We then applied the parallel methods to speed up the index building time and to reduce the overall retrieval time when querying the genomic database. Our experiments show that the use of N-Gram transformation algorithm is an economical solution; it saves time and space too. The result shows that the size of the index is smaller than the size of the dataset when the size of N-Gram is 5 and 6. The parallel N-Gram transformation algorithm-s results indicate that the uses of parallel programming with large dataset are promising which can be improved further.

Ensemble Learning with Decision Tree for Remote Sensing Classification

In recent years, a number of works proposing the combination of multiple classifiers to produce a single classification have been reported in remote sensing literature. The resulting classifier, referred to as an ensemble classifier, is generally found to be more accurate than any of the individual classifiers making up the ensemble. As accuracy is the primary concern, much of the research in the field of land cover classification is focused on improving classification accuracy. This study compares the performance of four ensemble approaches (boosting, bagging, DECORATE and random subspace) with a univariate decision tree as base classifier. Two training datasets, one without ant noise and other with 20 percent noise was used to judge the performance of different ensemble approaches. Results with noise free data set suggest an improvement of about 4% in classification accuracy with all ensemble approaches in comparison to the results provided by univariate decision tree classifier. Highest classification accuracy of 87.43% was achieved by boosted decision tree. A comparison of results with noisy data set suggests that bagging, DECORATE and random subspace approaches works well with this data whereas the performance of boosted decision tree degrades and a classification accuracy of 79.7% is achieved which is even lower than that is achieved (i.e. 80.02%) by using unboosted decision tree classifier.

An Efficient and Generic Hybrid Framework for High Dimensional Data Clustering

Clustering in high dimensional space is a difficult problem which is recurrent in many fields of science and engineering, e.g., bioinformatics, image processing, pattern reorganization and data mining. In high dimensional space some of the dimensions are likely to be irrelevant, thus hiding the possible clustering. In very high dimensions it is common for all the objects in a dataset to be nearly equidistant from each other, completely masking the clusters. Hence, performance of the clustering algorithm decreases. In this paper, we propose an algorithmic framework which combines the (reduct) concept of rough set theory with the k-means algorithm to remove the irrelevant dimensions in a high dimensional space and obtain appropriate clusters. Our experiment on test data shows that this framework increases efficiency of the clustering process and accuracy of the results.

Annual Power Load Forecasting Using Support Vector Regression Machines: A Study on Guangdong Province of China 1985-2008

Load forecasting has always been the essential part of an efficient power system operation and planning. A novel approach based on support vector machines is proposed in this paper for annual power load forecasting. Different kernel functions are selected to construct a combinatorial algorithm. The performance of the new model is evaluated with a real-world dataset, and compared with two neural networks and some traditional forecasting techniques. The results show that the proposed method exhibits superior performance.

Automated Knowledge Engineering

This article outlines conceptualization and implementation of an intelligent system capable of extracting knowledge from databases. Use of hybridized features of both the Rough and Fuzzy Set theory render the developed system flexibility in dealing with discreet as well as continuous datasets. A raw data set provided to the system, is initially transformed in a computer legible format followed by pruning of the data set. The refined data set is then processed through various Rough Set operators which enable discovery of parameter relationships and interdependencies. The discovered knowledge is automatically transformed into a rule base expressed in Fuzzy terms. Two exemplary cancer repository datasets (for Breast and Lung Cancer) have been used to test and implement the proposed framework.

Impact of Environmental Factors on Profit Efficiency of Rice Production: A Study in Vietnam-s Red River Delta

Environmental factors affect agriculture production productivity and efficiency resulted in changing of profit efficiency. This paper attempts to estimate the impacts of environmental factors to profitability of rice farmers in the Red River Delta of Vietnam. The dataset was extracted from 349 rice farmers using personal interviews. Both OLS and MLE trans-log profit functions were used in this study. Five production inputs and four environmental factors were included in these functions. The estimation of the stochastic profit frontier with a two-stage approach was used to measure profitability. The results showed that the profit efficiency was about 75% on the average and environmental factors change profit efficiency significantly beside farm specific characteristics. Plant disease, soil fertility, irrigation apply and water pollution were the four environmental factors cause profit loss in rice production. The result indicated that farmers should reduce household size, farm plots, apply row seeding technique and improve environmental factors to obtain high profit efficiency with special consideration is given for irrigation water quality improvement.

Fuzzy Relatives of the CLARANS Algorithm With Application to Text Clustering

This paper introduces new algorithms (Fuzzy relative of the CLARANS algorithm FCLARANS and Fuzzy c Medoids based on randomized search FCMRANS) for fuzzy clustering of relational data. Unlike existing fuzzy c-medoids algorithm (FCMdd) in which the within cluster dissimilarity of each cluster is minimized in each iteration by recomputing new medoids given current memberships, FCLARANS minimizes the same objective function minimized by FCMdd by changing current medoids in such away that that the sum of the within cluster dissimilarities is minimized. Computing new medoids may be effected by noise because outliers may join the computation of medoids while the choice of medoids in FCLARANS is dictated by the location of a predominant fraction of points inside a cluster and, therefore, it is less sensitive to the presence of outliers. In FCMRANS the step of computing new medoids in FCMdd is modified to be based on randomized search. Furthermore, a new initialization procedure is developed that add randomness to the initialization procedure used with FCMdd. Both FCLARANS and FCMRANS are compared with the robust and linearized version of fuzzy c-medoids (RFCMdd). Experimental results with different samples of the Reuter-21578, Newsgroups (20NG) and generated datasets with noise show that FCLARANS is more robust than both RFCMdd and FCMRANS. Finally, both FCMRANS and FCLARANS are more efficient and their outputs are almost the same as that of RFCMdd in terms of classification rate.

Solving Machine Loading Problem in Flexible Manufacturing Systems Using Particle Swarm Optimization

In this paper, a particle swarm optimization (PSO) algorithm is proposed to solve machine loading problem in flexible manufacturing system (FMS), with bicriterion objectives of minimizing system unbalance and maximizing system throughput in the occurrence of technological constraints such as available machining time and tool slots. A mathematical model is used to select machines, assign operations and the required tools. The performance of the PSO is tested by using 10 sample dataset and the results are compared with the heuristics reported in the literature. The results support that the proposed PSO is comparable with the algorithms reported in the literature.