Computational Method for Annotation of Protein Sequence According to Gene Ontology Terms

Annotation of a protein sequence is pivotal for the understanding of its function. Accuracy of manual annotation provided by curators is still questionable by having lesser evidence strength and yet a hard task and time consuming. A number of computational methods including tools have been developed to tackle this challenging task. However, they require high-cost hardware, are difficult to be setup by the bioscientists, or depend on time intensive and blind sequence similarity search like Basic Local Alignment Search Tool. This paper introduces a new method of assigning highly correlated Gene Ontology terms of annotated protein sequences to partially annotated or newly discovered protein sequences. This method is fully based on Gene Ontology data and annotations. Two problems had been identified to achieve this method. The first problem relates to splitting the single monolithic Gene Ontology RDF/XML file into a set of smaller files that can be easy to assess and process. Thus, these files can be enriched with protein sequences and Inferred from Electronic Annotation evidence associations. The second problem involves searching for a set of semantically similar Gene Ontology terms to a given query. The details of macro and micro problems involved and their solutions including objective of this study are described. This paper also describes the protein sequence annotation and the Gene Ontology. The methodology of this study and Gene Ontology based protein sequence annotation tool namely extended UTMGO is presented. Furthermore, its basic version which is a Gene Ontology browser that is based on semantic similarity search is also introduced.

SeqWord Gene Island Sniffer: a Program to Study the Lateral Genetic Exchange among Bacteria

SeqWord Gene Island Sniffer, a new program for the identification of mobile genetic elements in sequences of bacterial chromosomes is presented. This program is based on the analysis of oligonucleotide usage variations in DNA sequences. 3,518 mobile genetic elements were identified in 637 bacterial genomes and further analyzed by sequence similarity and the functionality of encoded proteins. The results of this study are stored in an open database http://anjie.bi.up.ac.za/geidb/geidbhome. php). The developed computer program and the database provide the information valuable for further investigation of the distribution of mobile genetic elements and virulence factors among bacteria. The program is available for download at www.bi.up.ac.za/SeqWord/sniffer/index.html.

A Novel Approach for Protein Classification Using Fourier Transform

Discovering new biological knowledge from the highthroughput biological data is a major challenge to bioinformatics today. To address this challenge, we developed a new approach for protein classification. Proteins that are evolutionarily- and thereby functionally- related are said to belong to the same classification. Identifying protein classification is of fundamental importance to document the diversity of the known protein universe. It also provides a means to determine the functional roles of newly discovered protein sequences. Our goal is to predict the functional classification of novel protein sequences based on a set of features extracted from each protein sequence. The proposed technique used datasets extracted from the Structural Classification of Proteins (SCOP) database. A set of spectral domain features based on Fast Fourier Transform (FFT) is used. The proposed classifier uses multilayer back propagation (MLBP) neural network for protein classification. The maximum classification accuracy is about 91% when applying the classifier to the full four levels of the SCOP database. However, it reaches a maximum of 96% when limiting the classification to the family level. The classification results reveal that spectral domain contains information that can be used for classification with high accuracy. In addition, the results emphasize that sequence similarity measures are of great importance especially at the family level.

Sequence Relationships Similarity of Swine Influenza a (H1N1) Virus

In April 2009, a new variant of Influenza A virus subtype H1N1 emerged in Mexico and spread all over the world. The influenza has three subtypes in human (H1N1, H1N2 and H3N2) Types B and C influenza tend to be associated with local or regional epidemics. Preliminary genetic characterization of the influenza viruses has identified them as swine influenza A (H1N1) viruses. Nucleotide sequence analysis of the Haemagglutinin (HA) and Neuraminidase (NA) are similar to each other and the majority of their genes of swine influenza viruses, two genes coding for the neuraminidase (NA) and matrix (M) proteins are similar to corresponding genes of swine influenza. Sequence similarity between the 2009 A (H1N1) virus and its nearest relatives indicates that its gene segments have been circulating undetected for an extended period. Nucleic acid sequence Maximum Likelihood (MCL) and DNA Empirical base frequencies, Phylogenetic relationship amongst the HA genes of H1N1 virus isolated in Genbank having high nucleotide sequence homology. In this paper we used 16 HA nucleotide sequences from NCBI for computing sequence relationships similarity of swine influenza A virus using the following method MCL the result is 28%, 36.64% for Optimal tree with the sum of branch length, 35.62% for Interior branch phylogeny Neighber – Join Tree, 1.85% for the overall transition/transversion, and 8.28% for Overall mean distance.

Detecting Remote Protein Evolutionary Relationships via String Scoring Method

The amount of the information being churned out by the field of biology has jumped manifold and now requires the extensive use of computer techniques for the management of this information. The predominance of biological information such as protein sequence similarity in the biological information sea is key information for detecting protein evolutionary relationship. Protein sequence similarity typically implies homology, which in turn may imply structural and functional similarities. In this work, we propose, a learning method for detecting remote protein homology. The proposed method uses a transformation that converts protein sequence into fixed-dimensional representative feature vectors. Each feature vector records the sensitivity of a protein sequence to a set of amino acids substrings generated from the protein sequences of interest. These features are then used in conjunction with support vector machines for the detection of the protein remote homology. The proposed method is tested and evaluated on two different benchmark protein datasets and it-s able to deliver improvements over most of the existing homology detection methods.

Parallezation Protein Sequence Similarity Algorithms using Remote Method Interface

One of the major problems in genomic field is to perform sequence comparison on DNA and protein sequences. Executing sequence comparison on the DNA and protein data is a computationally intensive task. Sequence comparison is the basic step for all algorithms in protein sequences similarity. Parallel computing is an attractive solution to provide the computational power needed to speedup the lengthy process of the sequence comparison. Our main research is to enhance the protein sequence algorithm using dynamic programming method. In our approach, we parallelize the dynamic programming algorithm using multithreaded program to perform the sequence comparison and also developed a distributed protein database among many PCs using Remote Method Interface (RMI). As a result, we showed how different sizes of protein sequences data and computation of scoring matrix of these protein sequence on different number of processors affected the processing time and speed, as oppose to sequential processing.