Abstract: In the present work, we propose a new technique to
enhance the learning capabilities and reduce the computation
intensity of a competitive learning multi-layered neural network
using the K-means clustering algorithm. The proposed model use
multi-layered network architecture with a back propagation learning
mechanism. The K-means algorithm is first applied to the training
dataset to reduce the amount of samples to be presented to the neural
network, by automatically selecting an optimal set of samples. The
obtained results demonstrate that the proposed technique performs
exceptionally in terms of both accuracy and computation time when
applied to the KDD99 dataset compared to a standard learning
schema that use the full dataset.
Abstract: The main mission of Ezilla is to provide a friendly
interface to access the virtual machine and quickly deploy the high
performance computing environment. Ezilla has been developed by
Pervasive Computing Team at National Center for High-performance
Computing (NCHC). Ezilla integrates the Cloud middleware,
virtualization technology, and Web-based Operating System (WebOS)
to form a virtual computer in distributed computing environment. In
order to upgrade the dataset and speedup, we proposed the sensor
observation system to deal with a huge amount of data in the
Cassandra database. The sensor observation system is based on the
Ezilla to store sensor raw data into distributed database. We adopt the
Ezilla Cloud service to create virtual machines and login into virtual
machine to deploy the sensor observation system. Integrating the
sensor observation system with Ezilla is to quickly deploy experiment
environment and access a huge amount of data with distributed
database that support the replication mechanism to protect the data
security.
Abstract: Classification is one of the primary themes in
computational biology. The accuracy of classification strongly
depends on quality of a dataset, and we need some method to
evaluate this quality. In this paper, we propose a new graphical
analysis method using 'Membership-Deviation Graph (MDG)' for
analyzing quality of a dataset. MDG represents degree of
membership and deviations for instances of a class in the dataset. The
result of MDG analysis is used for understanding specific feature and
for selecting best feature for classification.
Abstract: Social bookmarking is an environment in which
the user gradually changes interests over time so that the tag
data associated with the current temporal period is usually more
important than tag data temporally far from the current period.
This implies that in the social tagging system, the newly tagged
items by the user are more relevant than older items. This study
proposes a novel recommender system that considers the users-
recent tag preferences. The proposed system includes the
following stages: grouping similar users into clusters using an
E-M clustering algorithm, finding similar resources based on
the user-s bookmarks, and recommending the top-N items to
the target user. The study examines the system-s information
retrieval performance using a dataset from del.icio.us, which is
a famous social bookmarking web site. Experimental results
show that the proposed system is better and more effective than
traditional approaches.
Abstract: Non-Destructive evaluation of in-service power
transformer condition is necessary for avoiding catastrophic failures.
Dissolved Gas Analysis (DGA) is one of the important methods.
Traditional, statistical and intelligent DGA approaches have been
adopted for accurate classification of incipient fault sources.
Unfortunately, there are not often enough faulty patterns required for
sufficient training of intelligent systems. By bootstrapping the
shortcoming is expected to be alleviated and algorithms with better
classification success rates to be obtained. In this paper the
performance of an artificial neural network, K-Nearest Neighbour
and support vector machine methods using bootstrapped data are
detailed and shown that while the success rate of the ANN algorithms
improves remarkably, the outcome of the others do not benefit so
much from the provided enlarged data space. For assessment, two
databases are employed: IEC TC10 and a dataset collected from
reported data in papers. High average test success rate well exhibits
the remarkable outcome.
Abstract: The scientific achievements coming from molecular
biology depend greatly on the capability of computational
applications to analyze the laboratorial results. A comprehensive
analysis of an experiment requires typically the simultaneous study
of the obtained dataset with data that is available in several distinct
public databases. Nevertheless, developing a centralized access to
these distributed databases rises up a set of challenges such as: what
is the best integration strategy, how to solve nomenclature clashes,
how to solve database overlapping data and how to deal with huge
datasets. In this paper we present GeNS, a system that uses a simple and yet innovative approach to address several biological data integration issues. Compared with existing systems, the main
advantages of GeNS are related to its maintenance simplicity and to its coverage and scalability, in terms of number of supported
databases and data types. To support our claims we present the current use of GeNS in two concrete applications. GeNS currently contains more than 140 million of biological relations and it can be
publicly downloaded or remotely access through SOAP web services.
Abstract: Synthetic Aperture Radar (SAR) is an imaging radar form by taking full advantage of the relative movement of the antenna with respect to the target. Through the simultaneous processing of the radar reflections over the movement of the antenna via the Range Doppler Algorithm (RDA), the superior resolution of a theoretical wider antenna, termed synthetic aperture, is obtained. Therefore, SAR can achieve high resolution two dimensional imagery of the ground surface. In addition, two filtering steps in range and azimuth direction provide accurate enough result. This paper develops a simulation in which realistic SAR images can be generated. Also, the effect of velocity errors in the resulting image has also been investigated. Taking some velocity errors into account, the simulation results on the image resolution would be presented. Most of the times, algorithms need to be adjusted for particular datasets, or particular applications.
Abstract: Text Mining is around applying knowledge discovery techniques to unstructured text is termed knowledge discovery in text (KDT), or Text data mining or Text Mining. In Neural Network that address classification problems, training set, testing set, learning rate are considered as key tasks. That is collection of input/output patterns that are used to train the network and used to assess the network performance, set the rate of adjustments. This paper describes a proposed back propagation neural net classifier that performs cross validation for original Neural Network. In order to reduce the optimization of classification accuracy, training time. The feasibility the benefits of the proposed approach are demonstrated by means of five data sets like contact-lenses, cpu, weather symbolic, Weather, labor-nega-data. It is shown that , compared to exiting neural network, the training time is reduced by more than 10 times faster when the dataset is larger than CPU or the network has many hidden units while accuracy ('percent correct') was the same for all datasets but contact-lences, which is the only one with missing attributes. For contact-lences the accuracy with Proposed Neural Network was in average around 0.3 % less than with the original Neural Network. This algorithm is independent of specify data sets so that many ideas and solutions can be transferred to other classifier paradigms.
Abstract: Multiple sequence alignment is a fundamental part in
many bioinformatics applications such as phylogenetic analysis.
Many alignment methods have been proposed. Each method gives a
different result for the same data set, and consequently generates a
different phylogenetic tree. Hence, the chosen alignment method
affects the resulting tree. However in the literature, there is no
evaluation of multiple alignment methods based on the comparison of
their phylogenetic trees. This work evaluates the following eight
aligners: ClustalX, T-Coffee, SAGA, MUSCLE, MAFFT, DIALIGN,
ProbCons and Align-m, based on their phylogenetic trees (test trees)
produced on a given data set. The Neighbor-Joining method is used
to estimate trees. Three criteria, namely, the dNNI, the dRF and the
Id_Tree are established to test the ability of different alignment
methods to produce closer test tree compared to the reference one
(true tree). Results show that the method which produces the most
accurate alignment gives the nearest test tree to the reference tree.
MUSCLE outperforms all aligners with respect to the three criteria
and for all datasets, performing particularly better when sequence
identities are within 10-20%. It is followed by T-Coffee at lower
sequence identity (30%), trees scores of all methods
become similar.
Abstract: This paper presents an application of level sets for the segmentation of abdominal and thoracic aortic aneurysms in CTA
datasets. An important challenge in reliably detecting aortic is the
need to overcome problems associated with intensity
inhomogeneities. Level sets are part of an important class of methods
that utilize partial differential equations (PDEs) and have been extensively applied in image segmentation. A kernel function in the
level set formulation aids the suppression of noise in the extracted
regions of interest and then guides the motion of the evolving contour
for the detection of weak boundaries. The speed of curve evolution
has been significantly improved with a resulting decrease in segmentation time compared with previous implementations of level
sets, and are shown to be more effective than other approaches in
coping with intensity inhomogeneities. We have applied the Courant
Friedrichs Levy (CFL) condition as stability criterion for our algorithm.
Abstract: Developing a stable early warning system (EWS)
model that is capable to give an accurate prediction is a challenging
task. This paper introduces k-nearest neighbour (k-NN) method
which never been applied in predicting currency crisis before with the
aim of increasing the prediction accuracy. The proposed k-NN
performance depends on the choice of a distance that is used where in
our analysis; we take the Euclidean distance and the Manhattan as a
consideration. For the comparison, we employ three other methods
which are logistic regression analysis (logit), back-propagation neural
network (NN) and sequential minimal optimization (SMO). The
analysis using datasets from 8 countries and 13 macro-economic
indicators for each country shows that the proposed k-NN method
with k = 4 and Manhattan distance performs better than the other
methods.
Abstract: In this paper, subtractive clustering based fuzzy inference system approach is used for early detection of faults in the function oriented software systems. This approach has been tested with real time defect datasets of NASA software projects named as PC1 and CM1. Both the code based model and joined model (combination of the requirement and code based metrics) of the datasets are used for training and testing of the proposed approach. The performance of the models is recorded in terms of Accuracy, MAE and RMSE values. The performance of the proposed approach is better in case of Joined Model. As evidenced from the results obtained it can be concluded that Clustering and fuzzy logic together provide a simple yet powerful means to model the earlier detection of faults in the function oriented software systems.
Abstract: Fake finger submission attack is a major problem in fingerprint recognition systems. In this paper, we introduce an aliveness detection method based on multiple static features, which derived from a single fingerprint image. The static features are comprised of individual pore spacing, residual noise and several first order statistics. Specifically, correlation filter is adopted to address individual pore spacing. The multiple static features are useful to reflect the physiological and statistical characteristics of live and fake fingerprint. The classification can be made by calculating the liveness scores from each feature and fusing the scores through a classifier. In our dataset, we compare nine classifiers and the best classification rate at 85% is attained by using a Reduced Multivariate Polynomial classifier. Our approach is faster and more convenient for aliveness check for field applications.
Abstract: Automatic Extraction of Event information from
social text stream (emails, social network sites, blogs etc) is a vital
requirement for many applications like Event Planning and
Management systems and security applications. The key information
components needed from Event related text are Event title, location,
participants, date and time. Emails have very unique distinctions over
other social text streams from the perspective of layout and format
and conversation style and are the most commonly used
communication channel for broadcasting and planning events.
Therefore we have chosen emails as our dataset. In our work, we
have employed two statistical NLP methods, named as Finite State
Machines (FSM) and Hidden Markov Model (HMM) for the
extraction of event related contextual information. An application
has been developed providing a comparison among the two methods
over the event extraction task. It comprises of two modules, one for
each method, and works for both bulk as well as direct user input.
The results are evaluated using Precision, Recall and F-Score.
Experiments show that both methods produce high performance and
accuracy, however HMM was good enough over Title extraction and
FSM proved to be better for Venue, Date, and time.
Abstract: Protein 3D structure prediction has always been an
important research area in bioinformatics. In particular, the
prediction of secondary structure has been a well-studied research
topic. Despite the recent breakthrough of combining multiple
sequence alignment information and artificial intelligence algorithms
to predict protein secondary structure, the Q3 accuracy of various
computational prediction algorithms rarely has exceeded 75%. In a
previous paper [1], this research team presented a rule-based method
called RT-RICO (Relaxed Threshold Rule Induction from Coverings)
to predict protein secondary structure. The average Q3 accuracy on
the sample datasets using RT-RICO was 80.3%, an improvement
over comparable computational methods. Although this demonstrated
that RT-RICO might be a promising approach for predicting
secondary structure, the algorithm-s computational complexity and
program running time limited its use. Herein a parallelized
implementation of a slightly modified RT-RICO approach is
presented. This new version of the algorithm facilitated the testing of
a much larger dataset of 396 protein domains [2]. Parallelized RTRICO
achieved a Q3 score of 74.6%, which is higher than the
consensus prediction accuracy of 72.9% that was achieved for the
same test dataset by a combination of four secondary structure
prediction methods [2].
Abstract: The use of a Bayesian Hierarchical Model (BHM) to interpret breath measurements obtained during a 13C Octanoic Breath Test (13COBT) is demonstrated. The statistical analysis was implemented using WinBUGS, a commercially available computer package for Bayesian inference. A hierarchical setting was adopted where poorly defined parameters associated with a delayed Gastric Emptying (GE) were able to "borrow" strength from global distributions. This is proved to be a sufficient tool to correct model's failures and data inconsistencies apparent in conventional analyses employing a Non-linear least squares technique (NLS). Direct comparison of two parameters describing gastric emptying ng ( tlag -lag phase, t1/ 2 -half emptying time) revealed a strong correlation between the two methods. Despite our large dataset ( n = 164 ), Bayesian modeling was fast and provided a successful fitting for all subjects. On the contrary, NLS failed to return acceptable estimates in cases where GE was delayed.
Abstract: The purpose of this study is to derive parameters
estimating for the Lyman–Kutcher–Burman (LKB) normal tissue
complication probability (NTCP) model using analysis of scintigraphy
assessments and quality of life (QoL) measurement questionnaires for
the parotid gland (xerostomia). In total, 31 patients with
head-and-neck (HN) cancer were enrolled. Salivary excretion factor
(SEF) and EORTC QLQ-H&N35 questionnaires datasets are used for
the NTCP modeling to describe the incidence of grade 4 xerostomia.
Assuming that n= 1, NTCP fitted parameters are given as TD50= 43.6
Gy, m= 0.18 in SEF analysis, and as TD50= 44.1 Gy, m= 0.11 in QoL
measurements, respectively. SEF and QoL datasets can validate the
Quantitative Analyses of Normal Tissue Effects in the Clinic
(QUANTEC) guidelines well, resulting in NPV-s of 100% for the both
datasets and suggests that the QUANTEC 25/20Gy gland-spared
guidelines are suitable for clinical used for the HN cohort to
effectively avoid xerostomia.
Abstract: This paper gives a novel method for improving
classification performance for cancer classification with very few
microarray Gene expression data. The method employs classification
with individual gene ranking and gene subset ranking. For selection
and classification, the proposed method uses the same classifier. The
method is applied to three publicly available cancer gene expression
datasets from Lymphoma, Liver and Leukaemia datasets. Three
different classifiers namely Support vector machines-one against all
(SVM-OAA), K nearest neighbour (KNN) and Linear Discriminant
analysis (LDA) were tested and the results indicate the improvement
in performance of SVM-OAA classifier with satisfactory results on
all the three datasets when compared with the other two classifiers.
Abstract: Brain ArterioVenous Malformation (BAVM) is an abnormal tangle of brain blood vessels where arteries shunt directly into veins with no intervening capillary bed which causes high pressure and hemorrhage risk. The success of treatment by embolization in interventional neuroradiology is highly dependent on the accuracy of the vessels visualization. In this paper the performance of clustering techniques on vessel segmentation from 3- D rotational angiography (3DRA) images is investigated and a new technique of segmentation is proposed. This method consists in: preprocessing step of image enhancement, then K-Means (KM), Fuzzy C-Means (FCM) and Expectation Maximization (EM) clustering are used to separate vessel pixels from background and artery pixels from vein pixels when possible. A post processing step of removing false-alarm components is applied before constructing a three-dimensional volume of the vessels. The proposed method was tested on six datasets along with a medical assessment of an expert. Obtained results showed encouraging segmentations.
Abstract: Prediction of bacterial virulent protein sequences can
give assistance to identification and characterization of novel
virulence-associated factors and discover drug/vaccine targets against
proteins indispensable to pathogenicity. Gene Ontology (GO)
annotation which describes functions of genes and gene products as a
controlled vocabulary of terms has been shown effectively for a
variety of tasks such as gene expression study, GO annotation
prediction, protein subcellular localization, etc. In this study, we
propose a sequence-based method Virulent-GO by mining informative
GO terms as features for predicting bacterial virulent proteins.
Each protein in the datasets used by the existing method
VirulentPred is annotated by using BLAST to obtain its homologies
with known accession numbers for retrieving GO terms. After
investigating various popular classifiers using the same five-fold
cross-validation scheme, Virulent-GO using the single kind of GO
term features with an accuracy of 82.5% is slightly better than
VirulentPred with 81.8% using five kinds of sequence-based features.
For the evaluation of independent test, Virulent-GO also yields better
results (82.0%) than VirulentPred (80.7%). When evaluating single
kind of feature with SVM, the GO term feature performs much well,
compared with each of the five kinds of features.