Abstract: The prediction of transmembrane helical segments
(TMHs) in membrane proteins is an important field in the
bioinformatics research. In this paper, a new method based on discrete
wavelet transform (DWT) has been developed to predict the number
and location of TMHs in membrane proteins. PDB coded as 1KQG
was chosen as an example to describe the prediction of the number and
location of TMHs in membrane proteins by using this method. To
access the effect of the method, 80 proteins with known 3D-structure
from Mptopo database are chosen at random as the test objects
(including 325 TMHs), 308 of which can be predicted accurately, the
average predicted accuracy is 96.3%. In addition, the above 80
membrane proteins are divided into 13 groups according to their
function and type. In particular, the results of the prediction of TMHs
of the 13 groups are satisfying.
Abstract: Tumor classification is a key area of research in the
field of bioinformatics. Microarray technology is commonly used in
the study of disease diagnosis using gene expression levels. The
main drawback of gene expression data is that it contains thousands
of genes and a very few samples. Feature selection methods are used
to select the informative genes from the microarray. These methods
considerably improve the classification accuracy. In the proposed
method, Genetic Algorithm (GA) is used for effective feature
selection. Informative genes are identified based on the T-Statistics,
Signal-to-Noise Ratio (SNR) and F-Test values. The initial candidate
solutions of GA are obtained from top-m informative genes. The
classification accuracy of k-Nearest Neighbor (kNN) method is used
as the fitness function for GA. In this work, kNN and Support Vector
Machine (SVM) are used as the classifiers. The experimental results
show that the proposed work is suitable for effective feature
selection. With the help of the selected genes, GA-kNN method
achieves 100% accuracy in 4 datasets and GA-SVM method
achieves in 5 out of 10 datasets. The GA with kNN and SVM
methods are demonstrated to be an accurate method for microarray
based tumor classification.
Abstract: Proteomics is one of the largest areas of research for
bioinformatics and medical science. An ambitious goal of proteomics
is to elucidate the structure, interactions and functions of all proteins
within cells and organisms. Predicting Protein-Protein Interaction
(PPI) is one of the crucial and decisive problems in current research.
Genomic data offer a great opportunity and at the same time a lot of
challenges for the identification of these interactions. Many methods
have already been proposed in this regard. In case of in-silico
identification, most of the methods require both positive and negative
examples of protein interaction and the perfection of these examples
are very much crucial for the final prediction accuracy. Positive
examples are relatively easy to obtain from well known databases. But
the generation of negative examples is not a trivial task. Current PPI
identification methods generate negative examples based on some
assumptions, which are likely to affect their prediction accuracy.
Hence, if more reliable negative examples are used, the PPI prediction
methods may achieve even more accuracy. Focusing on this issue, a
graph based negative example generation method is proposed, which
is simple and more accurate than the existing approaches. An
interaction graph of the protein sequences is created. The basic
assumption is that the longer the shortest path between two
protein-sequences in the interaction graph, the less is the possibility of
their interaction. A well established PPI detection algorithm is
employed with our negative examples and in most cases it increases
the accuracy more than 10% in comparison with the negative pair
selection method in that paper.
Abstract: A computational platform is presented in this
contribution. It has been designed as a virtual laboratory to be used
for exploring optimization algorithms in biological problems. This
platform is built on a blackboard-based agent architecture. As a test
case, the version of the platform presented here is devoted to the
study of protein folding, initially with a bead-like description of the
chain and with the widely used model of hydrophobic and polar
residues (HP model). Some details of the platform design are
presented along with its capabilities and also are revised some
explorations of the protein folding problems with different types of
discrete space. It is also shown the capability of the platform to
incorporate specific tools for the structural analysis of the runs in
order to understand and improve the optimization process.
Accordingly, the results obtained demonstrate that the ensemble of
computational tools into a single platform is worthwhile by itself,
since experiments developed on it can be designed to fulfill different
levels of information in a self-consistent fashion. By now, it is being
explored how an experiment design can be useful to create a
computational agent to be included within the platform. These
inclusions of designed agents –or software pieces– are useful for the
better accomplishment of the tasks to be developed by the platform.
Clearly, while the number of agents increases the new version of the
virtual laboratory thus enhances in robustness and functionality.
Abstract: The similarity comparison of RNA secondary
structures is important in studying the functions of RNAs. In recent
years, most existing tools represent the secondary structures by
tree-based presentation and calculate the similarity by tree alignment
distance. Different to previous approaches, we propose a new method
based on maximum clique detection algorithm to extract the maximum
common structural elements in compared RNA secondary structures.
A new graph-based similarity measurement and maximum common
subgraph detection procedures for comparing purely RNA secondary
structures is introduced. Given two RNA secondary structures, the
proposed algorithm consists of a process to determine the score of the
structural similarity, followed by comparing vertices labelling, the
labelled edges and the exact degree of each vertex. The proposed
algorithm also consists of a process to extract the common structural
elements between compared secondary structures based on a proposed
maximum clique detection of the problem. This graph-based model
also can work with NC-IUB code to perform the pattern-based
searching. Therefore, it can be used to identify functional RNA motifs
from database or to extract common substructures between complex
RNA secondary structures. We have proved the performance of this
proposed algorithm by experimental results. It provides a new idea of
comparing RNA secondary structures. This tool is helpful to those
who are interested in structural bioinformatics.
Abstract: A gene network gives the knowledge of the regulatory
relationships among the genes. Each gene has its activators and
inhibitors that regulate its expression positively and negatively
respectively. Genes themselves are believed to act as activators and
inhibitors of other genes. They can even activate one set of genes and
inhibit another set. Identifying gene networks is one of the most
crucial and challenging problems in Bioinformatics. Most work done
so far either assumes that there is no time delay in gene regulation or
there is a constant time delay. We here propose a Dynamic Time-
Lagged Correlation Based Method (DTCBM) to learn the gene
networks, which uses time-lagged correlation to find the potential
gene interactions, and then uses a post-processing stage to remove
false gene interactions to common parents, and finally uses dynamic
correlation thresholds for each gene to construct the gene network.
DTCBM finds correlation between gene expression signals shifted in
time, and therefore takes into consideration the multi time delay
relationships among the genes. The implementation of our method is
done in MATLAB and experimental results on Saccharomyces
cerevisiae gene expression data and comparison with other methods
indicate that it has a better performance.
Abstract: Annotation of a protein sequence is pivotal for the understanding of its function. Accuracy of manual annotation provided by curators is still questionable by having lesser evidence strength and yet a hard task and time consuming. A number of computational methods including tools have been developed to tackle this challenging task. However, they require high-cost hardware, are difficult to be setup by the bioscientists, or depend on time intensive and blind sequence similarity search like Basic Local Alignment Search Tool. This paper introduces a new method of assigning highly correlated Gene Ontology terms of annotated protein sequences to partially annotated or newly discovered protein sequences. This method is fully based on Gene Ontology data and annotations. Two problems had been identified to achieve this method. The first problem relates to splitting the single monolithic Gene Ontology RDF/XML file into a set of smaller files that can be easy to assess and process. Thus, these files can be enriched with protein sequences and Inferred from Electronic Annotation evidence associations. The second problem involves searching for a set of semantically similar Gene Ontology terms to a given query. The details of macro and micro problems involved and their solutions including objective of this study are described. This paper also describes the protein sequence annotation and the Gene Ontology. The methodology of this study and Gene Ontology based protein sequence annotation tool namely extended UTMGO is presented. Furthermore, its basic version which is a Gene Ontology browser that is based on semantic similarity search is also introduced.
Abstract: A number of competing methodologies have been developed
to identify genes and classify DNA sequences into coding
and non-coding sequences. This classification process is fundamental
in gene finding and gene annotation tools and is one of the most
challenging tasks in bioinformatics and computational biology. An
information theory measure based on mutual information has shown
good accuracy in classifying DNA sequences into coding and noncoding.
In this paper we describe a species independent iterative
approach that distinguishes coding from non-coding sequences using
the mutual information measure (MIM). A set of sixty prokaryotes is
used to extract universal training data. To facilitate comparisons with
the published results of other researchers, a test set of 51 bacterial
and archaeal genomes was used to evaluate MIM. These results
demonstrate that MIM produces superior results while remaining
species independent.
Abstract: The purpose of my research proposal is to
demonstrate that there is a relationship between EEG and
endometrial cancer.
The above relationship is based on an Aristotelian Syllogism;
since it is known that the 14-3-3 protein is related to the electrical
activity of the brain via control of the flow of Na+ and K+ ions and
since it is also known that many types of cancer are associated with
14-3-3 protein, it is possible that there is a relationship between EEG
and cancer. This research will be carried out by well-defined
diagnostic indicators, obtained via the EEG, using signal processing
procedures and pattern recognition tools such as neural networks in
order to recognize the endometrial cancer type. The current research
shall compare the findings from EEG and hysteroscopy performed on
women of a wide age range. Moreover, this practice could be
expanded to other types of cancer. The implementation of this
methodology will be completed with the creation of an ontology.
This ontology shall define the concepts existing in this research-s
domain and the relationships between them. It will represent the
types of relationships between hysteroscopy and EEG findings.
Abstract: The feature of HIV genome is in a wide range because
of it is highly heterogeneous. Hence, the infection ability of the virus changes related with different chemokine receptors. From this point,
R5 and X4 HIV viruses use CCR5 and CXCR5 coreceptors respectively while R5X4 viruses can utilize both coreceptors. Recently, in Bioinformatics, R5X4 viruses have been studied to
classify by using the coreceptors of HIV genome.
The aim of this study is to develop the optimal Multilayer
Perceptron (MLP) for high classification accuracy of HIV sub-type viruses. To accomplish this purpose, the unit number in hidden layer
was incremented one by one, from one to a particular number. The statistical data of R5X4, R5 and X4 viruses was preprocessed by the
signal processing methods. Accessible residues of these virus sequences were extracted and modeled by Auto-Regressive Model
(AR) due to the dimension of residues is large and different from each other. Finally the pre-processed dataset was used to evolve MLP with various number of hidden units to determine R5X4
viruses. Furthermore, ROC analysis was used to figure out the optimal MLP structure.
Abstract: Gamboge disorder (GD) or fruit damage by the yellow sap is a major problem in mangosteen. Mangosteen plants varied in the level of GD, from very low or non GD to low, moderate and high GD. However it was difficult to differentiate between GD and non GD plants because evaluation of the disorder is strongly influenced by environment. In this study we investigated the usefulness of primer designed from bioinformatics related to cell wall strength, termed as MCWS, to predict GD. Plant materials used were 28 mangosteen plants selected based on percentage of GD categorized as high, moderate, low and very low or non GD. The result showed that the specific DNA fragments were absent in the high GD accessions. The MCWS marker suggests as a novel polymorphic marker for GD in mangosteen as well as a marker for detect variability in mangosteen as apomictic plant.
Abstract: In the last few years, the Semantic Web gained scientific acceptance as a means of relationships identification in knowledge base, widely known by semantic association. Query about complex relationships between entities is a strong requirement for many applications in analytical domains. In bioinformatics for example, it is critical to extract exchanges between proteins. Currently, the widely known result of such queries is to provide paths between connected entities from data graph. However, they do not always give good results while facing the user need by the best association or a set of limited best association, because they only consider all existing paths but ignore the path evaluation. In this paper, we present an approach for supporting association discovery queries. Our proposal includes (i) a query language PmSPRQL which provides a multiparadigm query expressions for association extraction and (ii) some quantification measures making easy the process of association ranking. The originality of our proposal is demonstrated by a performance evaluation of our approach on real world datasets.
Abstract: Discovering new biological knowledge from the highthroughput biological data is a major challenge to bioinformatics today. To address this challenge, we developed a new approach for protein classification. Proteins that are evolutionarily- and thereby functionally- related are said to belong to the same classification. Identifying protein classification is of fundamental importance to document the diversity of the known protein universe. It also provides a means to determine the functional roles of newly discovered protein sequences. Our goal is to predict the functional classification of novel protein sequences based on a set of features extracted from each protein sequence. The proposed technique used datasets extracted from the Structural Classification of Proteins (SCOP) database. A set of spectral domain features based on Fast Fourier Transform (FFT) is used. The proposed classifier uses multilayer back propagation (MLBP) neural network for protein classification. The maximum classification accuracy is about 91% when applying the classifier to the full four levels of the SCOP database. However, it reaches a maximum of 96% when limiting the classification to the family level. The classification results reveal that spectral domain contains information that can be used for classification with high accuracy. In addition, the results emphasize that sequence similarity measures are of great importance especially at the family level.
Abstract: Nowadays scientific data is inevitably digital and
stored in a wide variety of formats in heterogeneous systems.
Scientists need to access an integrated view of remote or local
heterogeneous data sources with advanced data accessing, analyzing,
and visualization tools. This research suggests the use of Service
Oriented Architecture (SOA) to integrate biological data from
different data sources. This work shows SOA will solve the problems
that facing integration process and if the biologist scientists can
access the biological data in easier way. There are several methods to
implement SOA but web service is the most popular method. The
Microsoft .Net Framework used to implement proposed architecture.
Abstract: Multi-agent system approach has proven to be an effective and appropriate abstraction level to construct whole models of a diversity of biological problems, integrating aspects which can be found both in "micro" and "macro" approaches when modeling this type of phenomena. Taking into account these considerations, this paper presents the important computational characteristics to be gathered into a novel bioinformatics framework built upon a multiagent architecture. The version of the tool presented herein allows studying and exploring complex problems belonging principally to structural biology, such as protein folding. The bioinformatics framework is used as a virtual laboratory to explore a minimalist model of protein folding as a test case. In order to show the laboratory concept of the platform as well as its flexibility and adaptability, we studied the folding of two particular sequences, one of 45-mer and another of 64-mer, both described by an HP model (only hydrophobic and polar residues) and coarse grained 2D-square lattice. According to the discussion section of this piece of work, these two sequences were chosen as breaking points towards the platform, in order to determine the tools to be created or improved in such a way to overcome the needs of a particular computation and analysis of a given tough sequence. The backwards philosophy herein is that the continuous studying of sequences provides itself important points to be added into the platform, to any time improve its efficiency, as is demonstrated herein.
Abstract: Due to the ever growing amount of publications about
protein-protein interactions, information extraction from text is
increasingly recognized as one of crucial technologies in
bioinformatics. This paper presents a Protein Interaction Extraction
System using a Link Grammar Parser from biomedical abstracts
(PIELG). PIELG uses linkage given by the Link Grammar Parser to
start a case based analysis of contents of various syntactic roles as
well as their linguistically significant and meaningful combinations.
The system uses phrasal-prepositional verbs patterns to overcome
preposition combinations problems. The recall and precision are
74.4% and 62.65%, respectively. Experimental evaluations with two
other state-of-the-art extraction systems indicate that PIELG system
achieves better performance. For further evaluation, the system is
augmented with a graphical package (Cytoscape) for extracting
protein interaction information from sequence databases. The result
shows that the performance is remarkably promising.
Abstract: Multiple sequence alignment is a fundamental part in
many bioinformatics applications such as phylogenetic analysis.
Many alignment methods have been proposed. Each method gives a
different result for the same data set, and consequently generates a
different phylogenetic tree. Hence, the chosen alignment method
affects the resulting tree. However in the literature, there is no
evaluation of multiple alignment methods based on the comparison of
their phylogenetic trees. This work evaluates the following eight
aligners: ClustalX, T-Coffee, SAGA, MUSCLE, MAFFT, DIALIGN,
ProbCons and Align-m, based on their phylogenetic trees (test trees)
produced on a given data set. The Neighbor-Joining method is used
to estimate trees. Three criteria, namely, the dNNI, the dRF and the
Id_Tree are established to test the ability of different alignment
methods to produce closer test tree compared to the reference one
(true tree). Results show that the method which produces the most
accurate alignment gives the nearest test tree to the reference tree.
MUSCLE outperforms all aligners with respect to the three criteria
and for all datasets, performing particularly better when sequence
identities are within 10-20%. It is followed by T-Coffee at lower
sequence identity (30%), trees scores of all methods
become similar.
Abstract: The protein domain structure has been widely used as the most informative sequence feature to computationally predict protein-protein interactions. However, in a recent study, a research group has reported a very high accuracy of 94% using hydrophobicity feature. Therefore, in this study we compare and verify the usefulness of protein domain structure and hydrophobicity properties as the sequence features. Using the Support Vector Machines (SVM) as the learning system, our results indicate that both features achieved accuracy of nearly 80%. Furthermore, domains structure had receiver operating characteristic (ROC) score of 0.8480 with running time of 34 seconds, while hydrophobicity had ROC score of 0.8159 with running time of 20,571 seconds (5.7 hours). These results indicate that protein-protein interaction can be predicted from domain structure with reliable accuracy and acceptable running time.
Abstract: The National Agricultural Biotechnology Information
Center (NABIC) plays a leading role in the biotechnology information
database for agricultural plants in Korea. Since 2002, we have
concentrated on functional genomics of major crops, building an
integrated biotechnology database for agro-biotech information that
focuses on bioinformatics of major agricultural resources such as rice,
Chinese cabbage, and microorganisms. In the NABIC,
integration-based biotechnology database provides useful information
through a user-friendly web interface that allows analysis of genome
infrastructure, multiple plants, microbial resources, and living
modified organisms.
Abstract: Protein 3D structure prediction has always been an
important research area in bioinformatics. In particular, the
prediction of secondary structure has been a well-studied research
topic. Despite the recent breakthrough of combining multiple
sequence alignment information and artificial intelligence algorithms
to predict protein secondary structure, the Q3 accuracy of various
computational prediction algorithms rarely has exceeded 75%. In a
previous paper [1], this research team presented a rule-based method
called RT-RICO (Relaxed Threshold Rule Induction from Coverings)
to predict protein secondary structure. The average Q3 accuracy on
the sample datasets using RT-RICO was 80.3%, an improvement
over comparable computational methods. Although this demonstrated
that RT-RICO might be a promising approach for predicting
secondary structure, the algorithm-s computational complexity and
program running time limited its use. Herein a parallelized
implementation of a slightly modified RT-RICO approach is
presented. This new version of the algorithm facilitated the testing of
a much larger dataset of 396 protein domains [2]. Parallelized RTRICO
achieved a Q3 score of 74.6%, which is higher than the
consensus prediction accuracy of 72.9% that was achieved for the
same test dataset by a combination of four secondary structure
prediction methods [2].