Abstract: In this study we focus on improvement performance
of a cue based Motor Imagery Brain Computer Interface (BCI). For
this purpose, data fusion approach is used on results of different
classifiers to make the best decision. At first step Distinction
Sensitive Learning Vector Quantization method is used as a feature
selection method to determine most informative frequencies in
recorded signals and its performance is evaluated by frequency
search method. Then informative features are extracted by packet
wavelet transform. In next step 5 different types of classification
methods are applied. The methodologies are tested on BCI
Competition II dataset III, the best obtained accuracy is 85% and the
best kappa value is 0.8. At final step ordered weighted averaging
(OWA) method is used to provide a proper aggregation classifiers
outputs. Using OWA enhanced system accuracy to 95% and kappa
value to 0.9. Applying OWA just uses 50 milliseconds for
performing calculation.
Abstract: In this study, a high accuracy protein-protein interaction
prediction method is developed. The importance of the proposed
method is that it only uses sequence information of proteins while
predicting interaction. The method extracts phylogenetic profiles of
proteins by using their sequence information. Combining the phylogenetic
profiles of two proteins by checking existence of homologs
in different species and fitting this combined profile into a statistical
model, it is possible to make predictions about the interaction status
of two proteins.
For this purpose, we apply a collection of pattern recognition
techniques on the dataset of combined phylogenetic profiles of protein
pairs. Support Vector Machines, Feature Extraction using ReliefF,
Naive Bayes Classification, K-Nearest Neighborhood Classification,
Decision Trees, and Random Forest Classification are the methods
we applied for finding the classification method that best predicts
the interaction status of protein pairs. Random Forest Classification
outperformed all other methods with a prediction accuracy of 76.93%
Abstract: This paper investigates the problem of sampling from transactional data streams. We introduce CFISDS as a content based sampling algorithm that works on a landmark window model of data streams and preserve more informed sample in sample space. This algorithm that work based on closed frequent itemset mining tasks, first initiate a concept lattice using initial data, then update lattice structure using an incremental mechanism.Incremental mechanism insert, update and delete nodes in/from concept lattice in batch manner. Presented algorithm extracts the final samples on demand of user. Experimental results show the accuracy of CFISDS on synthetic and real datasets, despite on CFISDS algorithm is not faster than exist sampling algorithms such as Z and DSS.
Abstract: Linear Discrimination Analysis (LDA) is a linear
solution for classification of two classes. In this paper, we propose a
variant LDA method for multi-class problem which redefines the
between class and within class scatter matrices by incorporating a
weight function into each of them. The aim is to separate classes as
much as possible in a situation that one class is well separated from
other classes, incidentally, that class must have a little influence on
classification. It has been suggested to alleviate influence of classes
that are well separated by adding a weight into between class scatter
matrix and within class scatter matrix. To obtain a simple and
effective weight function, ordinary LDA between every two classes
has been used in order to find Fisher discrimination value and passed
it as an input into two weight functions and redefined between class
and within class scatter matrices. Experimental results showed that
our new LDA method improved classification rate, on glass, iris and
wine datasets, in comparison to different versions of LDA.
Abstract: Lacking an inherent “natural" dissimilarity measure
between objects in categorical dataset presents special difficulties in
clustering analysis. However, each categorical attributes from a given
dataset provides natural probability and information in the sense of
Shannon. In this paper, we proposed a novel method which
heuristically converts categorical attributes to numerical values by
exploiting such associated information. We conduct an experimental
study with real-life categorical dataset. The experiment demonstrates
the effectiveness of our approach.
Abstract: A new automatic system for the recognition and re¬construction of resealed and/or rotated partially occluded objects is presented. The objects to be recognized are described by 2D views and each view is occluded by several half-planes. The whole object views and their visible parts (linear cuts) are then stored in a database. To establish if a region R of an input image represents an object possibly occluded, the system generates a set of linear cuts of R and compare them with the elements in the database. Each linear cut of R is associated to the most similar database linear cut. R is recognized as an instance of the object 0 if the majority of the linear cuts of R are associated to a linear cut of views of 0. In the case of recognition, the system reconstructs the occluded part of R and determines the scale factor and the orientation in the image plane of the recognized object view. The system has been tested on two different datasets of objects, showing good performance both in terms of recognition and reconstruction accuracy.
Abstract: Fault-proneness of a software module is the
probability that the module contains faults. To predict faultproneness
of modules different techniques have been proposed which
includes statistical methods, machine learning techniques, neural
network techniques and clustering techniques. The aim of proposed
study is to explore whether metrics available in the early lifecycle
(i.e. requirement metrics), metrics available in the late lifecycle (i.e.
code metrics) and metrics available in the early lifecycle (i.e.
requirement metrics) combined with metrics available in the late
lifecycle (i.e. code metrics) can be used to identify fault prone
modules using Genetic Algorithm technique. This approach has been
tested with real time defect C Programming language datasets of
NASA software projects. The results show that the fusion of
requirement and code metric is the best prediction model for
detecting the faults as compared with commonly used code based
model.
Abstract: An evolutionary method whose selection and recombination
operations are based on generalization error-bounds of
support vector machine (SVM) can select a subset of potentially
informative genes for SVM classifier very efficiently [7]. In this
paper, we will use the derivative of error-bound (first-order criteria)
to select and recombine gene features in the evolutionary process,
and compare the performance of the derivative of error-bound with
the error-bound itself (zero-order) in the evolutionary process. We
also investigate several error-bounds and their derivatives to compare
the performance, and find the best criteria for gene selection
and classification. We use 7 cancer-related human gene expression
datasets to evaluate the performance of the zero-order and first-order
criteria of error-bounds. Though both criteria have the same strategy
in theoretically, experimental results demonstrate the best criterion
for microarray gene expression data.
Abstract: The study of proteomics reached unexpected levels of
interest, as a direct consequence of its discovered influence over some
complex biological phenomena, such as problematic diseases like
cancer. This paper presents the latest authors- achievements regarding
the analysis of the networks of proteins (interactome networks), by
computing more efficiently the betweenness centrality measure. The
paper introduces the concept of betweenness centrality, and then
describes how betweenness computation can help the interactome net-
work analysis. Current sequential implementations for the between-
ness computation do not perform satisfactory in terms of execution
times. The paper-s main contribution is centered towards introducing
a speedup technique for the betweenness computation, based on
modified shortest path algorithms for sparse graphs. Three optimized
generic algorithms for betweenness computation are described and
implemented, and their performance tested against real biological
data, which is part of the IntAct dataset.
Abstract: The goal of Gene Expression Analysis is to understand the processes that underlie the regulatory networks and pathways controlling inter-cellular and intra-cellular activities. In recent times microarray datasets are extensively used for this purpose. The scope of such analysis has broadened in recent times towards reconstruction of gene networks and other holistic approaches of Systems Biology. Evolutionary methods are proving to be successful in such problems and a number of such methods have been proposed. However all these methods are based on processing of genotypic information. Towards this end, there is a need to develop evolutionary methods that address phenotypic interactions together with genotypic interactions. We present a novel evolutionary approach, called Phenomic algorithm, wherein the focus is on phenotypic interaction. We use the expression profiles of genes to model the interactions between them at the phenotypic level. We apply this algorithm to the yeast sporulation dataset and show that the algorithm can identify gene networks with relative ease.
Abstract: The segmentation of mouth and lips is a fundamental
problem in facial image analyisis. In this paper we propose a method
for lip segmentation based on rg-color histogram. Statistical analysis
shows, using the rg-color-space is optimal for this purpose of a pure
color based segmentation. Initially a rough adaptive threshold selects
a histogram region, that assures that all pixels in that region are
skin pixels. Based on that pixels we build a gaussian model which
represents the skin pixels distribution and is utilized to obtain a
refined, optimal threshold. We are not incorporating shape or edge
information. In experiments we show the performance of our lip pixel
segmentation method compared to the ground truth of our dataset and
a conventional watershed algorithm.
Abstract: Yeast cells live in a constantly changing environment that requires the continuous adaptation of their genomic program in order to sustain their homeostasis, survive and proliferate. Due to the advancement of high throughput technologies, there is currently a large amount of data such as gene expression, gene deletion and protein-protein interactions for S. Cerevisiae under various environmental conditions. Mining these datasets requires efficient computational methods capable of integrating different types of data, identifying inter-relations between different components and inferring functional groups or 'modules' that shape intracellular processes. This study uses computational methods to delineate some of the mechanisms used by yeast cells to respond to environmental changes. The GRAM algorithm is first used to integrate gene expression data and ChIP-chip data in order to find modules of coexpressed and co-regulated genes as well as the transcription factors (TFs) that regulate these modules. Since transcription factors are themselves transcriptionally regulated, a three-layer regulatory cascade consisting of the TF-regulators, the TFs and the regulated modules is subsequently considered. This three-layer cascade is then modeled quantitatively using artificial neural networks (ANNs) where the input layer corresponds to the expression of the up-stream transcription factors (TF-regulators) and the output layer corresponds to the expression of genes within each module. This work shows that (a) the expression of at least 33 genes over time and for different stress conditions is well predicted by the expression of the top layer transcription factors, including cases in which the effect of up-stream regulators is shifted in time and (b) identifies at least 6 novel regulatory interactions that were not previously associated with stress-induced changes in gene expression. These findings suggest that the combination of gene expression and protein-DNA interaction data with artificial neural networks can successfully model biological pathways and capture quantitative dependencies between distant regulators and downstream genes.
Abstract: Data clustering is an important data exploration technique
with many applications in data mining. We present an enhanced
version of the well known single link clustering algorithm. We will
refer to this algorithm as DCBOR. The proposed algorithm alleviates
the chain effect by removing the outliers from the given dataset.
So this algorithm provides outlier detection and data clustering
simultaneously. This algorithm does not need to update the distance
matrix, since the algorithm depends on merging the most k-nearest
objects in one step and the cluster continues grow as long as possible
under specified condition. So the algorithm consists of two phases;
at the first phase, it removes the outliers from the input dataset. At
the second phase, it performs the clustering process. This algorithm
discovers clusters of different shapes, sizes, densities and requires
only one input parameter; this parameter represents a threshold for
outlier points. The value of the input parameter is ranging from 0 to
1. The algorithm supports the user in determining an appropriate
value for it. We have tested this algorithm on different datasets
contain outlier and connecting clusters by chain of density points,
and the algorithm discovers the correct clusters. The results of
our experiments demonstrate the effectiveness and the efficiency of
DCBOR.
Abstract: National Biodiversity Database System (NBIDS) has
been developed for collecting Thai biodiversity data. The goal of this
project is to provide advanced tools for querying, analyzing,
modeling, and visualizing patterns of species distribution for
researchers and scientists. NBIDS data record two types of datasets:
biodiversity data and environmental data. Biodiversity data are
specie presence data and species status. The attributes of biodiversity
data can be further classified into two groups: universal and projectspecific
attributes. Universal attributes are attributes that are common
to all of the records, e.g. X/Y coordinates, year, and collector name.
Project-specific attributes are attributes that are unique to one or a
few projects, e.g., flowering stage. Environmental data include
atmospheric data, hydrology data, soil data, and land cover data
collecting by using GLOBE protocols. We have developed webbased
tools for data entry. Google Earth KML and ArcGIS were used
as tools for map visualization. webMathematica was used for simple
data visualization and also for advanced data analysis and
visualization, e.g., spatial interpolation, and statistical analysis.
NBIDS will be used by park rangers at Khao Nan National Park, and
researchers.
Abstract: Agriculture products are being more demanding in
market today. To increase its productivity, automation to produce
these products will be very helpful. The purpose of this work is to
measure and determine the ripeness and quality of watermelon. The
textures on watermelon skin will be captured using digital camera.
These images will be filtered using image processing technique. All
these information gathered will be trained using ANN to determine
the watermelon ripeness accuracy. Initial results showed that the best
model has produced percentage accuracy of 86.51%, when measured
at 32 hidden units with a balanced percentage rate of training dataset.
Abstract: Sequential pattern mining is a challenging task in data mining area with large applications. One among those applications is mining patterns from weblog. Recent times, weblog is highly dynamic and some of them may become absolute over time. In addition, users may frequently change the threshold value during the data mining process until acquiring required output or mining interesting rules. Some of the recently proposed algorithms for mining weblog, build the tree with two scans and always consume large time and space. In this paper, we build Revised PLWAP with Non-frequent Items (RePLNI-tree) with single scan for all items. While mining sequential patterns, the links related to the nonfrequent items are not considered. Hence, it is not required to delete or maintain the information of nodes while revising the tree for mining updated transactions. The algorithm supports both incremental and interactive mining. It is not required to re-compute the patterns each time, while weblog is updated or minimum support changed. The performance of the proposed tree is better, even the size of incremental database is more than 50% of existing one. For evaluation purpose, we have used the benchmark weblog dataset and found that the performance of proposed tree is encouraging compared to some of the recently proposed approaches.
Abstract: Graph has become increasingly important in modeling
complicated structures and schemaless data such as proteins, chemical
compounds, and XML documents. Given a graph query, it is desirable
to retrieve graphs quickly from a large database via graph-based
indices. Different from the existing methods, our approach, called
VFM (Vertex to Frequent Feature Mapping), makes use of vertices
and decision features as the basic indexing feature. VFM constructs
two mappings between vertices and frequent features to answer graph
queries. The VFM approach not only provides an elegant solution to
the graph indexing problem, but also demonstrates how database
indexing and query processing can benefit from data mining,
especially frequent pattern mining. The results show that the proposed
method not only avoids the enumeration method of getting subgraphs
of query graph, but also effectively reduces the subgraph isomorphism
tests between the query graph and graphs in candidate answer set in
verification stage.
Abstract: Recent scientific investigations indicate that
multimodal biometrics overcome the technical limitations of
unimodal biometrics, making them ideally suited for everyday life
applications that require a reliable authentication system. However,
for a successful adoption of multimodal biometrics, such systems
would require large heterogeneous datasets with complex multimodal
fusion and privacy schemes spanning various distributed
environments. From experimental investigations of current
multimodal systems, this paper reports the various issues related to
speed, error-recovery and privacy that impede the diffusion of such
systems in real-life. This calls for a robust mechanism that caters to
the desired real-time performance, robust fusion schemes,
interoperability and adaptable privacy policies.
The main objective of this paper is to present a framework that
addresses the abovementioned issues by leveraging on the
heterogeneous resource sharing capacities of Grid services and the
efficient machine learning capabilities of artificial neural networks
(ANN). Hence, this paper proposes a Grid-based neural network
framework for adopting multimodal biometrics with the view of
overcoming the barriers of performance, privacy and risk issues that
are associated with shared heterogeneous multimodal data centres.
The framework combines the concept of Grid services for reliable
brokering and privacy policy management of shared biometric
resources along with a momentum back propagation ANN (MBPANN)
model of machine learning for efficient multimodal fusion and
authentication schemes. Real-life applications would be able to adopt
the proposed framework to cater to the varying business requirements
and user privacies for a successful diffusion of multimodal
biometrics in various day-to-day transactions.
Abstract: Biclustering aims at identifying several biclusters that
reveal potential local patterns from a microarray matrix. A bicluster is
a sub-matrix of the microarray consisting of only a subset of genes
co-regulates in a subset of conditions. In this study, we extend the
motif of subspace clustering to present a K-biclusters clustering (KBC)
algorithm for the microarray biclustering issue. Besides minimizing
the dissimilarities between genes and bicluster centers within all
biclusters, the objective function of the KBC algorithm additionally
takes into account how to minimize the residues within all biclusters
based on the mean square residue model. In addition, the objective
function also maximizes the entropy of conditions to stimulate more
conditions to contribute the identification of biclusters. The KBC
algorithm adopts the K-means type clustering process to efficiently
make the partition of K biclusters be optimized. A set of experiments
on a practical microarray dataset are demonstrated to show the
performance of the proposed KBC algorithm.
Abstract: We propose an enhanced collaborative filtering
method using Hofstede-s cultural dimensions, calculated for 111
countries. We employ 4 of these dimensions, which are correlated to
the costumers- buying behavior, in order to detect users- preferences
for items. In addition, several advantages of this method
demonstrated for data sparseness and cold-start users, which are
important challenges in collaborative filtering. We present
experiments using a real dataset, Book Crossing Dataset.
Experimental results shows that the proposed algorithm provide
significant advantages in terms of improving recommendation
quality.