Abstract: A number of competing methodologies have been developed
to identify genes and classify DNA sequences into coding
and non-coding sequences. This classification process is fundamental
in gene finding and gene annotation tools and is one of the most
challenging tasks in bioinformatics and computational biology. An
information theory measure based on mutual information has shown
good accuracy in classifying DNA sequences into coding and noncoding.
In this paper we describe a species independent iterative
approach that distinguishes coding from non-coding sequences using
the mutual information measure (MIM). A set of sixty prokaryotes is
used to extract universal training data. To facilitate comparisons with
the published results of other researchers, a test set of 51 bacterial
and archaeal genomes was used to evaluate MIM. These results
demonstrate that MIM produces superior results while remaining
species independent.
Abstract: Tandem mass spectrometry (MS/MS) is the engine
driving high-throughput protein identification. Protein mixtures possibly
representing thousands of proteins from multiple species are
treated with proteolytic enzymes, cutting the proteins into smaller
peptides that are then analyzed generating MS/MS spectra. The
task of determining the identity of the peptide from its spectrum
is currently the weak point in the process. Current approaches to de
novo sequencing are able to compute candidate peptides efficiently.
The problem lies in the limitations of current scoring functions. In this
paper we introduce the concept of proteome signature. By examining
proteins and compiling proteome signatures (amino acid usage) it is
possible to characterize likely combinations of amino acids and better
distinguish between candidate peptides. Our results strongly support
the hypothesis that a scoring function that considers amino acid usage
patterns is better able to distinguish between candidate peptides. This
in turn leads to higher accuracy in peptide prediction.