Neural Networks Learning Improvement using the K-Means Clustering Algorithm to Detect Network Intrusions

In the present work, we propose a new technique to enhance the learning capabilities and reduce the computation intensity of a competitive learning multi-layered neural network using the K-means clustering algorithm. The proposed model use multi-layered network architecture with a back propagation learning mechanism. The K-means algorithm is first applied to the training dataset to reduce the amount of samples to be presented to the neural network, by automatically selecting an optimal set of samples. The obtained results demonstrate that the proposed technique performs exceptionally in terms of both accuracy and computation time when applied to the KDD99 dataset compared to a standard learning schema that use the full dataset.

Ezilla Cloud Service with Cassandra Database for Sensor Observation System

The main mission of Ezilla is to provide a friendly interface to access the virtual machine and quickly deploy the high performance computing environment. Ezilla has been developed by Pervasive Computing Team at National Center for High-performance Computing (NCHC). Ezilla integrates the Cloud middleware, virtualization technology, and Web-based Operating System (WebOS) to form a virtual computer in distributed computing environment. In order to upgrade the dataset and speedup, we proposed the sensor observation system to deal with a huge amount of data in the Cassandra database. The sensor observation system is based on the Ezilla to store sensor raw data into distributed database. We adopt the Ezilla Cloud service to create virtual machines and login into virtual machine to deploy the sensor observation system. Integrating the sensor observation system with Ezilla is to quickly deploy experiment environment and access a huge amount of data with distributed database that support the replication mechanism to protect the data security.

Dataset Analysis Using Membership-Deviation Graph

Classification is one of the primary themes in computational biology. The accuracy of classification strongly depends on quality of a dataset, and we need some method to evaluate this quality. In this paper, we propose a new graphical analysis method using 'Membership-Deviation Graph (MDG)' for analyzing quality of a dataset. MDG represents degree of membership and deviations for instances of a class in the dataset. The result of MDG analysis is used for understanding specific feature and for selecting best feature for classification.

Folksonomy-based Recommender Systems with User-s Recent Preferences

Social bookmarking is an environment in which the user gradually changes interests over time so that the tag data associated with the current temporal period is usually more important than tag data temporally far from the current period. This implies that in the social tagging system, the newly tagged items by the user are more relevant than older items. This study proposes a novel recommender system that considers the users- recent tag preferences. The proposed system includes the following stages: grouping similar users into clusters using an E-M clustering algorithm, finding similar resources based on the user-s bookmarks, and recommending the top-N items to the target user. The study examines the system-s information retrieval performance using a dataset from del.icio.us, which is a famous social bookmarking web site. Experimental results show that the proposed system is better and more effective than traditional approaches.

Improvement in Power Transformer Intelligent Dissolved Gas Analysis Method

Non-Destructive evaluation of in-service power transformer condition is necessary for avoiding catastrophic failures. Dissolved Gas Analysis (DGA) is one of the important methods. Traditional, statistical and intelligent DGA approaches have been adopted for accurate classification of incipient fault sources. Unfortunately, there are not often enough faulty patterns required for sufficient training of intelligent systems. By bootstrapping the shortcoming is expected to be alleviated and algorithms with better classification success rates to be obtained. In this paper the performance of an artificial neural network, K-Nearest Neighbour and support vector machine methods using bootstrapped data are detailed and shown that while the success rate of the ANN algorithms improves remarkably, the outcome of the others do not benefit so much from the provided enlarged data space. For assessment, two databases are employed: IEC TC10 and a dataset collected from reported data in papers. High average test success rate well exhibits the remarkable outcome.

GeNS: a Biological Data Integration Platform

The scientific achievements coming from molecular biology depend greatly on the capability of computational applications to analyze the laboratorial results. A comprehensive analysis of an experiment requires typically the simultaneous study of the obtained dataset with data that is available in several distinct public databases. Nevertheless, developing a centralized access to these distributed databases rises up a set of challenges such as: what is the best integration strategy, how to solve nomenclature clashes, how to solve database overlapping data and how to deal with huge datasets. In this paper we present GeNS, a system that uses a simple and yet innovative approach to address several biological data integration issues. Compared with existing systems, the main advantages of GeNS are related to its maintenance simplicity and to its coverage and scalability, in terms of number of supported databases and data types. To support our claims we present the current use of GeNS in two concrete applications. GeNS currently contains more than 140 million of biological relations and it can be publicly downloaded or remotely access through SOAP web services.

Error Effects on SAR Image Resolution using Range Doppler Imaging Algorithm

Synthetic Aperture Radar (SAR) is an imaging radar form by taking full advantage of the relative movement of the antenna with respect to the target. Through the simultaneous processing of the radar reflections over the movement of the antenna via the Range Doppler Algorithm (RDA), the superior resolution of a theoretical wider antenna, termed synthetic aperture, is obtained. Therefore, SAR can achieve high resolution two dimensional imagery of the ground surface. In addition, two filtering steps in range and azimuth direction provide accurate enough result. This paper develops a simulation in which realistic SAR images can be generated. Also, the effect of velocity errors in the resulting image has also been investigated. Taking some velocity errors into account, the simulation results on the image resolution would be presented. Most of the times, algorithms need to be adjusted for particular datasets, or particular applications.

Classifier Based Text Mining for Neural Network

Text Mining is around applying knowledge discovery techniques to unstructured text is termed knowledge discovery in text (KDT), or Text data mining or Text Mining. In Neural Network that address classification problems, training set, testing set, learning rate are considered as key tasks. That is collection of input/output patterns that are used to train the network and used to assess the network performance, set the rate of adjustments. This paper describes a proposed back propagation neural net classifier that performs cross validation for original Neural Network. In order to reduce the optimization of classification accuracy, training time. The feasibility the benefits of the proposed approach are demonstrated by means of five data sets like contact-lenses, cpu, weather symbolic, Weather, labor-nega-data. It is shown that , compared to exiting neural network, the training time is reduced by more than 10 times faster when the dataset is larger than CPU or the network has many hidden units while accuracy ('percent correct') was the same for all datasets but contact-lences, which is the only one with missing attributes. For contact-lences the accuracy with Proposed Neural Network was in average around 0.3 % less than with the original Neural Network. This algorithm is independent of specify data sets so that many ideas and solutions can be transferred to other classifier paradigms.

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

Multiple sequence alignment is a fundamental part in many bioinformatics applications such as phylogenetic analysis. Many alignment methods have been proposed. Each method gives a different result for the same data set, and consequently generates a different phylogenetic tree. Hence, the chosen alignment method affects the resulting tree. However in the literature, there is no evaluation of multiple alignment methods based on the comparison of their phylogenetic trees. This work evaluates the following eight aligners: ClustalX, T-Coffee, SAGA, MUSCLE, MAFFT, DIALIGN, ProbCons and Align-m, based on their phylogenetic trees (test trees) produced on a given data set. The Neighbor-Joining method is used to estimate trees. Three criteria, namely, the dNNI, the dRF and the Id_Tree are established to test the ability of different alignment methods to produce closer test tree compared to the reference one (true tree). Results show that the method which produces the most accurate alignment gives the nearest test tree to the reference tree. MUSCLE outperforms all aligners with respect to the three criteria and for all datasets, performing particularly better when sequence identities are within 10-20%. It is followed by T-Coffee at lower sequence identity (30%), trees scores of all methods become similar.

Medical Image Segmentation Using Deformable Model and Local Fitting Binary: Thoracic Aorta

This paper presents an application of level sets for the segmentation of abdominal and thoracic aortic aneurysms in CTA datasets. An important challenge in reliably detecting aortic is the need to overcome problems associated with intensity inhomogeneities. Level sets are part of an important class of methods that utilize partial differential equations (PDEs) and have been extensively applied in image segmentation. A kernel function in the level set formulation aids the suppression of noise in the extracted regions of interest and then guides the motion of the evolving contour for the detection of weak boundaries. The speed of curve evolution has been significantly improved with a resulting decrease in segmentation time compared with previous implementations of level sets, and are shown to be more effective than other approaches in coping with intensity inhomogeneities. We have applied the Courant Friedrichs Levy (CFL) condition as stability criterion for our algorithm.

Designing Early Warning System: Prediction Accuracy of Currency Crisis by Using k-Nearest Neighbour Method

Developing a stable early warning system (EWS) model that is capable to give an accurate prediction is a challenging task. This paper introduces k-nearest neighbour (k-NN) method which never been applied in predicting currency crisis before with the aim of increasing the prediction accuracy. The proposed k-NN performance depends on the choice of a distance that is used where in our analysis; we take the Euclidean distance and the Manhattan as a consideration. For the comparison, we employ three other methods which are logistic regression analysis (logit), back-propagation neural network (NN) and sequential minimal optimization (SMO). The analysis using datasets from 8 countries and 13 macro-economic indicators for each country shows that the proposed k-NN method with k = 4 and Manhattan distance performs better than the other methods.

A Subtractive Clustering Based Approach for Early Prediction of Fault Proneness in Software Modules

In this paper, subtractive clustering based fuzzy inference system approach is used for early detection of faults in the function oriented software systems. This approach has been tested with real time defect datasets of NASA software projects named as PC1 and CM1. Both the code based model and joined model (combination of the requirement and code based metrics) of the datasets are used for training and testing of the proposed approach. The performance of the models is recorded in terms of Accuracy, MAE and RMSE values. The performance of the proposed approach is better in case of Joined Model. As evidenced from the results obtained it can be concluded that Clustering and fuzzy logic together provide a simple yet powerful means to model the earlier detection of faults in the function oriented software systems.

Aliveness Detection of Fingerprints using Multiple Static Features

Fake finger submission attack is a major problem in fingerprint recognition systems. In this paper, we introduce an aliveness detection method based on multiple static features, which derived from a single fingerprint image. The static features are comprised of individual pore spacing, residual noise and several first order statistics. Specifically, correlation filter is adopted to address individual pore spacing. The multiple static features are useful to reflect the physiological and statistical characteristics of live and fake fingerprint. The classification can be made by calculating the liveness scores from each feature and fusing the scores through a classifier. In our dataset, we compare nine classifiers and the best classification rate at 85% is attained by using a Reduced Multivariate Polynomial classifier. Our approach is faster and more convenient for aliveness check for field applications.

Event Information Extraction System (EIEE): FSM vs HMM

Automatic Extraction of Event information from social text stream (emails, social network sites, blogs etc) is a vital requirement for many applications like Event Planning and Management systems and security applications. The key information components needed from Event related text are Event title, location, participants, date and time. Emails have very unique distinctions over other social text streams from the perspective of layout and format and conversation style and are the most commonly used communication channel for broadcasting and planning events. Therefore we have chosen emails as our dataset. In our work, we have employed two statistical NLP methods, named as Finite State Machines (FSM) and Hidden Markov Model (HMM) for the extraction of event related contextual information. An application has been developed providing a comparison among the two methods over the event extraction task. It comprises of two modules, one for each method, and works for both bulk as well as direct user input. The results are evaluated using Precision, Recall and F-Score. Experiments show that both methods produce high performance and accuracy, however HMM was good enough over Title extraction and FSM proved to be better for Venue, Date, and time.

Protein Secondary Structure Prediction Using Parallelized Rule Induction from Coverings

Protein 3D structure prediction has always been an important research area in bioinformatics. In particular, the prediction of secondary structure has been a well-studied research topic. Despite the recent breakthrough of combining multiple sequence alignment information and artificial intelligence algorithms to predict protein secondary structure, the Q3 accuracy of various computational prediction algorithms rarely has exceeded 75%. In a previous paper [1], this research team presented a rule-based method called RT-RICO (Relaxed Threshold Rule Induction from Coverings) to predict protein secondary structure. The average Q3 accuracy on the sample datasets using RT-RICO was 80.3%, an improvement over comparable computational methods. Although this demonstrated that RT-RICO might be a promising approach for predicting secondary structure, the algorithm-s computational complexity and program running time limited its use. Herein a parallelized implementation of a slightly modified RT-RICO approach is presented. This new version of the algorithm facilitated the testing of a much larger dataset of 396 protein domains [2]. Parallelized RTRICO achieved a Q3 score of 74.6%, which is higher than the consensus prediction accuracy of 72.9% that was achieved for the same test dataset by a combination of four secondary structure prediction methods [2].

A Bayesian Hierarchical 13COBT to Correct Estimates Associated with a Delayed Gastric Emptying

The use of a Bayesian Hierarchical Model (BHM) to interpret breath measurements obtained during a 13C Octanoic Breath Test (13COBT) is demonstrated. The statistical analysis was implemented using WinBUGS, a commercially available computer package for Bayesian inference. A hierarchical setting was adopted where poorly defined parameters associated with a delayed Gastric Emptying (GE) were able to "borrow" strength from global distributions. This is proved to be a sufficient tool to correct model's failures and data inconsistencies apparent in conventional analyses employing a Non-linear least squares technique (NLS). Direct comparison of two parameters describing gastric emptying ng ( tlag -lag phase, t1/ 2 -half emptying time) revealed a strong correlation between the two methods. Despite our large dataset ( n = 164 ), Bayesian modeling was fast and provided a successful fitting for all subjects. On the contrary, NLS failed to return acceptable estimates in cases where GE was delayed.

Model Parameters Estimating on Lyman–Kutcher–Burman Normal Tissue Complication Probability for Xerostomia on Head and Neck Cancer

The purpose of this study is to derive parameters estimating for the Lyman–Kutcher–Burman (LKB) normal tissue complication probability (NTCP) model using analysis of scintigraphy assessments and quality of life (QoL) measurement questionnaires for the parotid gland (xerostomia). In total, 31 patients with head-and-neck (HN) cancer were enrolled. Salivary excretion factor (SEF) and EORTC QLQ-H&N35 questionnaires datasets are used for the NTCP modeling to describe the incidence of grade 4 xerostomia. Assuming that n= 1, NTCP fitted parameters are given as TD50= 43.6 Gy, m= 0.18 in SEF analysis, and as TD50= 44.1 Gy, m= 0.11 in QoL measurements, respectively. SEF and QoL datasets can validate the Quantitative Analyses of Normal Tissue Effects in the Clinic (QUANTEC) guidelines well, resulting in NPV-s of 100% for the both datasets and suggests that the QUANTEC 25/20Gy gland-spared guidelines are suitable for clinical used for the HN cohort to effectively avoid xerostomia.

An SVM based Classification Method for Cancer Data using Minimum Microarray Gene Expressions

This paper gives a novel method for improving classification performance for cancer classification with very few microarray Gene expression data. The method employs classification with individual gene ranking and gene subset ranking. For selection and classification, the proposed method uses the same classifier. The method is applied to three publicly available cancer gene expression datasets from Lymphoma, Liver and Leukaemia datasets. Three different classifiers namely Support vector machines-one against all (SVM-OAA), K nearest neighbour (KNN) and Linear Discriminant analysis (LDA) were tested and the results indicate the improvement in performance of SVM-OAA classifier with satisfactory results on all the three datasets when compared with the other two classifiers.

Image Clustering Framework for BAVM Segmentation in 3DRA Images: Performance Analysis

Brain ArterioVenous Malformation (BAVM) is an abnormal tangle of brain blood vessels where arteries shunt directly into veins with no intervening capillary bed which causes high pressure and hemorrhage risk. The success of treatment by embolization in interventional neuroradiology is highly dependent on the accuracy of the vessels visualization. In this paper the performance of clustering techniques on vessel segmentation from 3- D rotational angiography (3DRA) images is investigated and a new technique of segmentation is proposed. This method consists in: preprocessing step of image enhancement, then K-Means (KM), Fuzzy C-Means (FCM) and Expectation Maximization (EM) clustering are used to separate vessel pixels from background and artery pixels from vein pixels when possible. A post processing step of removing false-alarm components is applied before constructing a three-dimensional volume of the vessels. The proposed method was tested on six datasets along with a medical assessment of an expert. Obtained results showed encouraging segmentations.

Virulent-GO: Prediction of Virulent Proteins in Bacterial Pathogens Utilizing Gene Ontology Terms

Prediction of bacterial virulent protein sequences can give assistance to identification and characterization of novel virulence-associated factors and discover drug/vaccine targets against proteins indispensable to pathogenicity. Gene Ontology (GO) annotation which describes functions of genes and gene products as a controlled vocabulary of terms has been shown effectively for a variety of tasks such as gene expression study, GO annotation prediction, protein subcellular localization, etc. In this study, we propose a sequence-based method Virulent-GO by mining informative GO terms as features for predicting bacterial virulent proteins. Each protein in the datasets used by the existing method VirulentPred is annotated by using BLAST to obtain its homologies with known accession numbers for retrieving GO terms. After investigating various popular classifiers using the same five-fold cross-validation scheme, Virulent-GO using the single kind of GO term features with an accuracy of 82.5% is slightly better than VirulentPred with 81.8% using five kinds of sequence-based features. For the evaluation of independent test, Virulent-GO also yields better results (82.0%) than VirulentPred (80.7%). When evaluating single kind of feature with SVM, the GO term feature performs much well, compared with each of the five kinds of features.