On the Prediction of Transmembrane Helical Segments in Membrane Proteins Based on Wavelet Transform

The prediction of transmembrane helical segments (TMHs) in membrane proteins is an important field in the bioinformatics research. In this paper, a new method based on discrete wavelet transform (DWT) has been developed to predict the number and location of TMHs in membrane proteins. PDB coded as 1KQG was chosen as an example to describe the prediction of the number and location of TMHs in membrane proteins by using this method. To access the effect of the method, 80 proteins with known 3D-structure from Mptopo database are chosen at random as the test objects (including 325 TMHs), 308 of which can be predicted accurately, the average predicted accuracy is 96.3%. In addition, the above 80 membrane proteins are divided into 13 groups according to their function and type. In particular, the results of the prediction of TMHs of the 13 groups are satisfying.

Performance Analysis of Genetic Algorithm with kNN and SVM for Feature Selection in Tumor Classification

Tumor classification is a key area of research in the field of bioinformatics. Microarray technology is commonly used in the study of disease diagnosis using gene expression levels. The main drawback of gene expression data is that it contains thousands of genes and a very few samples. Feature selection methods are used to select the informative genes from the microarray. These methods considerably improve the classification accuracy. In the proposed method, Genetic Algorithm (GA) is used for effective feature selection. Informative genes are identified based on the T-Statistics, Signal-to-Noise Ratio (SNR) and F-Test values. The initial candidate solutions of GA are obtained from top-m informative genes. The classification accuracy of k-Nearest Neighbor (kNN) method is used as the fitness function for GA. In this work, kNN and Support Vector Machine (SVM) are used as the classifiers. The experimental results show that the proposed work is suitable for effective feature selection. With the help of the selected genes, GA-kNN method achieves 100% accuracy in 4 datasets and GA-SVM method achieves in 5 out of 10 datasets. The GA with kNN and SVM methods are demonstrated to be an accurate method for microarray based tumor classification.

Selecting Negative Examples for Protein-Protein Interaction

Proteomics is one of the largest areas of research for bioinformatics and medical science. An ambitious goal of proteomics is to elucidate the structure, interactions and functions of all proteins within cells and organisms. Predicting Protein-Protein Interaction (PPI) is one of the crucial and decisive problems in current research. Genomic data offer a great opportunity and at the same time a lot of challenges for the identification of these interactions. Many methods have already been proposed in this regard. In case of in-silico identification, most of the methods require both positive and negative examples of protein interaction and the perfection of these examples are very much crucial for the final prediction accuracy. Positive examples are relatively easy to obtain from well known databases. But the generation of negative examples is not a trivial task. Current PPI identification methods generate negative examples based on some assumptions, which are likely to affect their prediction accuracy. Hence, if more reliable negative examples are used, the PPI prediction methods may achieve even more accuracy. Focusing on this issue, a graph based negative example generation method is proposed, which is simple and more accurate than the existing approaches. An interaction graph of the protein sequences is created. The basic assumption is that the longer the shortest path between two protein-sequences in the interaction graph, the less is the possibility of their interaction. A well established PPI detection algorithm is employed with our negative examples and in most cases it increases the accuracy more than 10% in comparison with the negative pair selection method in that paper.

Exploring Dimensionality, Systematic Mutations and Number of Contacts in Simple HP ab-initio Protein Folding Using a Blackboard-based Agent Platform

A computational platform is presented in this contribution. It has been designed as a virtual laboratory to be used for exploring optimization algorithms in biological problems. This platform is built on a blackboard-based agent architecture. As a test case, the version of the platform presented here is devoted to the study of protein folding, initially with a bead-like description of the chain and with the widely used model of hydrophobic and polar residues (HP model). Some details of the platform design are presented along with its capabilities and also are revised some explorations of the protein folding problems with different types of discrete space. It is also shown the capability of the platform to incorporate specific tools for the structural analysis of the runs in order to understand and improve the optimization process. Accordingly, the results obtained demonstrate that the ensemble of computational tools into a single platform is worthwhile by itself, since experiments developed on it can be designed to fulfill different levels of information in a self-consistent fashion. By now, it is being explored how an experiment design can be useful to create a computational agent to be included within the platform. These inclusions of designed agents –or software pieces– are useful for the better accomplishment of the tasks to be developed by the platform. Clearly, while the number of agents increases the new version of the virtual laboratory thus enhances in robustness and functionality.

Maximum Common Substructure Extraction in RNA Secondary Structures Using Clique Detection Approach

The similarity comparison of RNA secondary structures is important in studying the functions of RNAs. In recent years, most existing tools represent the secondary structures by tree-based presentation and calculate the similarity by tree alignment distance. Different to previous approaches, we propose a new method based on maximum clique detection algorithm to extract the maximum common structural elements in compared RNA secondary structures. A new graph-based similarity measurement and maximum common subgraph detection procedures for comparing purely RNA secondary structures is introduced. Given two RNA secondary structures, the proposed algorithm consists of a process to determine the score of the structural similarity, followed by comparing vertices labelling, the labelled edges and the exact degree of each vertex. The proposed algorithm also consists of a process to extract the common structural elements between compared secondary structures based on a proposed maximum clique detection of the problem. This graph-based model also can work with NC-IUB code to perform the pattern-based searching. Therefore, it can be used to identify functional RNA motifs from database or to extract common substructures between complex RNA secondary structures. We have proved the performance of this proposed algorithm by experimental results. It provides a new idea of comparing RNA secondary structures. This tool is helpful to those who are interested in structural bioinformatics.

A Dynamic Time-Lagged Correlation based Method to Learn Multi-Time Delay Gene Networks

A gene network gives the knowledge of the regulatory relationships among the genes. Each gene has its activators and inhibitors that regulate its expression positively and negatively respectively. Genes themselves are believed to act as activators and inhibitors of other genes. They can even activate one set of genes and inhibit another set. Identifying gene networks is one of the most crucial and challenging problems in Bioinformatics. Most work done so far either assumes that there is no time delay in gene regulation or there is a constant time delay. We here propose a Dynamic Time- Lagged Correlation Based Method (DTCBM) to learn the gene networks, which uses time-lagged correlation to find the potential gene interactions, and then uses a post-processing stage to remove false gene interactions to common parents, and finally uses dynamic correlation thresholds for each gene to construct the gene network. DTCBM finds correlation between gene expression signals shifted in time, and therefore takes into consideration the multi time delay relationships among the genes. The implementation of our method is done in MATLAB and experimental results on Saccharomyces cerevisiae gene expression data and comparison with other methods indicate that it has a better performance.

Computational Method for Annotation of Protein Sequence According to Gene Ontology Terms

Annotation of a protein sequence is pivotal for the understanding of its function. Accuracy of manual annotation provided by curators is still questionable by having lesser evidence strength and yet a hard task and time consuming. A number of computational methods including tools have been developed to tackle this challenging task. However, they require high-cost hardware, are difficult to be setup by the bioscientists, or depend on time intensive and blind sequence similarity search like Basic Local Alignment Search Tool. This paper introduces a new method of assigning highly correlated Gene Ontology terms of annotated protein sequences to partially annotated or newly discovered protein sequences. This method is fully based on Gene Ontology data and annotations. Two problems had been identified to achieve this method. The first problem relates to splitting the single monolithic Gene Ontology RDF/XML file into a set of smaller files that can be easy to assess and process. Thus, these files can be enriched with protein sequences and Inferred from Electronic Annotation evidence associations. The second problem involves searching for a set of semantically similar Gene Ontology terms to a given query. The details of macro and micro problems involved and their solutions including objective of this study are described. This paper also describes the protein sequence annotation and the Gene Ontology. The methodology of this study and Gene Ontology based protein sequence annotation tool namely extended UTMGO is presented. Furthermore, its basic version which is a Gene Ontology browser that is based on semantic similarity search is also introduced.

MIM: A Species Independent Approach for Classifying Coding and Non-Coding DNA Sequences in Bacterial and Archaeal Genomes

A number of competing methodologies have been developed to identify genes and classify DNA sequences into coding and non-coding sequences. This classification process is fundamental in gene finding and gene annotation tools and is one of the most challenging tasks in bioinformatics and computational biology. An information theory measure based on mutual information has shown good accuracy in classifying DNA sequences into coding and noncoding. In this paper we describe a species independent iterative approach that distinguishes coding from non-coding sequences using the mutual information measure (MIM). A set of sixty prokaryotes is used to extract universal training data. To facilitate comparisons with the published results of other researchers, a test set of 51 bacterial and archaeal genomes was used to evaluate MIM. These results demonstrate that MIM produces superior results while remaining species independent.

Endometrial Cancer Recognition via EEG Dependent upon 14-3-3 Protein Leading to an Ontological Diagnosis

The purpose of my research proposal is to demonstrate that there is a relationship between EEG and endometrial cancer. The above relationship is based on an Aristotelian Syllogism; since it is known that the 14-3-3 protein is related to the electrical activity of the brain via control of the flow of Na+ and K+ ions and since it is also known that many types of cancer are associated with 14-3-3 protein, it is possible that there is a relationship between EEG and cancer. This research will be carried out by well-defined diagnostic indicators, obtained via the EEG, using signal processing procedures and pattern recognition tools such as neural networks in order to recognize the endometrial cancer type. The current research shall compare the findings from EEG and hysteroscopy performed on women of a wide age range. Moreover, this practice could be expanded to other types of cancer. The implementation of this methodology will be completed with the creation of an ontology. This ontology shall define the concepts existing in this research-s domain and the relationships between them. It will represent the types of relationships between hysteroscopy and EEG findings.

Optimal Multilayer Perceptron Structure For Classification of HIV Sub-Type Viruses

The feature of HIV genome is in a wide range because of it is highly heterogeneous. Hence, the infection ability of the virus changes related with different chemokine receptors. From this point, R5 and X4 HIV viruses use CCR5 and CXCR5 coreceptors respectively while R5X4 viruses can utilize both coreceptors. Recently, in Bioinformatics, R5X4 viruses have been studied to classify by using the coreceptors of HIV genome. The aim of this study is to develop the optimal Multilayer Perceptron (MLP) for high classification accuracy of HIV sub-type viruses. To accomplish this purpose, the unit number in hidden layer was incremented one by one, from one to a particular number. The statistical data of R5X4, R5 and X4 viruses was preprocessed by the signal processing methods. Accessible residues of these virus sequences were extracted and modeled by Auto-Regressive Model (AR) due to the dimension of residues is large and different from each other. Finally the pre-processed dataset was used to evolve MLP with various number of hidden units to determine R5X4 viruses. Furthermore, ROC analysis was used to figure out the optimal MLP structure.

Polymorphic Marker Designed from Bioinformatics Sequences Related to Cell Wall Strength for Discrimination of Mangosteen (Garcinia mangostana L.) Clones Resistant to Gamboge Disorder

Gamboge disorder (GD) or fruit damage by the yellow sap is a major problem in mangosteen. Mangosteen plants varied in the level of GD, from very low or non GD to low, moderate and high GD. However it was difficult to differentiate between GD and non GD plants because evaluation of the disorder is strongly influenced by environment. In this study we investigated the usefulness of primer designed from bioinformatics related to cell wall strength, termed as MCWS, to predict GD. Plant materials used were 28 mangosteen plants selected based on percentage of GD categorized as high, moderate, low and very low or non GD. The result showed that the specific DNA fragments were absent in the high GD accessions. The MCWS marker suggests as a novel polymorphic marker for GD in mangosteen as well as a marker for detect variability in mangosteen as apomictic plant.

PmSPARQL: Extended SPARQL for Multi-paradigm Path Extraction

In the last few years, the Semantic Web gained scientific acceptance as a means of relationships identification in knowledge base, widely known by semantic association. Query about complex relationships between entities is a strong requirement for many applications in analytical domains. In bioinformatics for example, it is critical to extract exchanges between proteins. Currently, the widely known result of such queries is to provide paths between connected entities from data graph. However, they do not always give good results while facing the user need by the best association or a set of limited best association, because they only consider all existing paths but ignore the path evaluation. In this paper, we present an approach for supporting association discovery queries. Our proposal includes (i) a query language PmSPRQL which provides a multiparadigm query expressions for association extraction and (ii) some quantification measures making easy the process of association ranking. The originality of our proposal is demonstrated by a performance evaluation of our approach on real world datasets.

A Novel Approach for Protein Classification Using Fourier Transform

Discovering new biological knowledge from the highthroughput biological data is a major challenge to bioinformatics today. To address this challenge, we developed a new approach for protein classification. Proteins that are evolutionarily- and thereby functionally- related are said to belong to the same classification. Identifying protein classification is of fundamental importance to document the diversity of the known protein universe. It also provides a means to determine the functional roles of newly discovered protein sequences. Our goal is to predict the functional classification of novel protein sequences based on a set of features extracted from each protein sequence. The proposed technique used datasets extracted from the Structural Classification of Proteins (SCOP) database. A set of spectral domain features based on Fast Fourier Transform (FFT) is used. The proposed classifier uses multilayer back propagation (MLBP) neural network for protein classification. The maximum classification accuracy is about 91% when applying the classifier to the full four levels of the SCOP database. However, it reaches a maximum of 96% when limiting the classification to the family level. The classification results reveal that spectral domain contains information that can be used for classification with high accuracy. In addition, the results emphasize that sequence similarity measures are of great importance especially at the family level.

Biological Data Integration using SOA

Nowadays scientific data is inevitably digital and stored in a wide variety of formats in heterogeneous systems. Scientists need to access an integrated view of remote or local heterogeneous data sources with advanced data accessing, analyzing, and visualization tools. This research suggests the use of Service Oriented Architecture (SOA) to integrate biological data from different data sources. This work shows SOA will solve the problems that facing integration process and if the biologist scientists can access the biological data in easier way. There are several methods to implement SOA but web service is the most popular method. The Microsoft .Net Framework used to implement proposed architecture.

Multi-Agent Systems Applied in the Modeling and Simulation of Biological Problems: A Case Study in Protein Folding

Multi-agent system approach has proven to be an effective and appropriate abstraction level to construct whole models of a diversity of biological problems, integrating aspects which can be found both in "micro" and "macro" approaches when modeling this type of phenomena. Taking into account these considerations, this paper presents the important computational characteristics to be gathered into a novel bioinformatics framework built upon a multiagent architecture. The version of the tool presented herein allows studying and exploring complex problems belonging principally to structural biology, such as protein folding. The bioinformatics framework is used as a virtual laboratory to explore a minimalist model of protein folding as a test case. In order to show the laboratory concept of the platform as well as its flexibility and adaptability, we studied the folding of two particular sequences, one of 45-mer and another of 64-mer, both described by an HP model (only hydrophobic and polar residues) and coarse grained 2D-square lattice. According to the discussion section of this piece of work, these two sequences were chosen as breaking points towards the platform, in order to determine the tools to be created or improved in such a way to overcome the needs of a particular computation and analysis of a given tough sequence. The backwards philosophy herein is that the continuous studying of sequences provides itself important points to be added into the platform, to any time improve its efficiency, as is demonstrated herein.

PIELG: A Protein Interaction Extraction Systemusing a Link Grammar Parser from Biomedical Abstracts

Due to the ever growing amount of publications about protein-protein interactions, information extraction from text is increasingly recognized as one of crucial technologies in bioinformatics. This paper presents a Protein Interaction Extraction System using a Link Grammar Parser from biomedical abstracts (PIELG). PIELG uses linkage given by the Link Grammar Parser to start a case based analysis of contents of various syntactic roles as well as their linguistically significant and meaningful combinations. The system uses phrasal-prepositional verbs patterns to overcome preposition combinations problems. The recall and precision are 74.4% and 62.65%, respectively. Experimental evaluations with two other state-of-the-art extraction systems indicate that PIELG system achieves better performance. For further evaluation, the system is augmented with a graphical package (Cytoscape) for extracting protein interaction information from sequence databases. The result shows that the performance is remarkably promising.

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

Multiple sequence alignment is a fundamental part in many bioinformatics applications such as phylogenetic analysis. Many alignment methods have been proposed. Each method gives a different result for the same data set, and consequently generates a different phylogenetic tree. Hence, the chosen alignment method affects the resulting tree. However in the literature, there is no evaluation of multiple alignment methods based on the comparison of their phylogenetic trees. This work evaluates the following eight aligners: ClustalX, T-Coffee, SAGA, MUSCLE, MAFFT, DIALIGN, ProbCons and Align-m, based on their phylogenetic trees (test trees) produced on a given data set. The Neighbor-Joining method is used to estimate trees. Three criteria, namely, the dNNI, the dRF and the Id_Tree are established to test the ability of different alignment methods to produce closer test tree compared to the reference one (true tree). Results show that the method which produces the most accurate alignment gives the nearest test tree to the reference tree. MUSCLE outperforms all aligners with respect to the three criteria and for all datasets, performing particularly better when sequence identities are within 10-20%. It is followed by T-Coffee at lower sequence identity (30%), trees scores of all methods become similar.

Comparison of Domain and Hydrophobicity Features for the Prediction of Protein-Protein Interactions using Support Vector Machines

The protein domain structure has been widely used as the most informative sequence feature to computationally predict protein-protein interactions. However, in a recent study, a research group has reported a very high accuracy of 94% using hydrophobicity feature. Therefore, in this study we compare and verify the usefulness of protein domain structure and hydrophobicity properties as the sequence features. Using the Support Vector Machines (SVM) as the learning system, our results indicate that both features achieved accuracy of nearly 80%. Furthermore, domains structure had receiver operating characteristic (ROC) score of 0.8480 with running time of 34 seconds, while hydrophobicity had ROC score of 0.8159 with running time of 20,571 seconds (5.7 hours). These results indicate that protein-protein interaction can be predicted from domain structure with reliable accuracy and acceptable running time.

An Integrated Biotechnology Database of the National Agricultural Information Center in Korea

The National Agricultural Biotechnology Information Center (NABIC) plays a leading role in the biotechnology information database for agricultural plants in Korea. Since 2002, we have concentrated on functional genomics of major crops, building an integrated biotechnology database for agro-biotech information that focuses on bioinformatics of major agricultural resources such as rice, Chinese cabbage, and microorganisms. In the NABIC, integration-based biotechnology database provides useful information through a user-friendly web interface that allows analysis of genome infrastructure, multiple plants, microbial resources, and living modified organisms.

Protein Secondary Structure Prediction Using Parallelized Rule Induction from Coverings

Protein 3D structure prediction has always been an important research area in bioinformatics. In particular, the prediction of secondary structure has been a well-studied research topic. Despite the recent breakthrough of combining multiple sequence alignment information and artificial intelligence algorithms to predict protein secondary structure, the Q3 accuracy of various computational prediction algorithms rarely has exceeded 75%. In a previous paper [1], this research team presented a rule-based method called RT-RICO (Relaxed Threshold Rule Induction from Coverings) to predict protein secondary structure. The average Q3 accuracy on the sample datasets using RT-RICO was 80.3%, an improvement over comparable computational methods. Although this demonstrated that RT-RICO might be a promising approach for predicting secondary structure, the algorithm-s computational complexity and program running time limited its use. Herein a parallelized implementation of a slightly modified RT-RICO approach is presented. This new version of the algorithm facilitated the testing of a much larger dataset of 396 protein domains [2]. Parallelized RTRICO achieved a Q3 score of 74.6%, which is higher than the consensus prediction accuracy of 72.9% that was achieved for the same test dataset by a combination of four secondary structure prediction methods [2].