An Information Theoretic Approach to Rescoring Peptides Produced by De Novo Peptide Sequencing

Tandem mass spectrometry (MS/MS) is the engine driving high-throughput protein identification. Protein mixtures possibly representing thousands of proteins from multiple species are treated with proteolytic enzymes, cutting the proteins into smaller peptides that are then analyzed generating MS/MS spectra. The task of determining the identity of the peptide from its spectrum is currently the weak point in the process. Current approaches to de novo sequencing are able to compute candidate peptides efficiently. The problem lies in the limitations of current scoring functions. In this paper we introduce the concept of proteome signature. By examining proteins and compiling proteome signatures (amino acid usage) it is possible to characterize likely combinations of amino acids and better distinguish between candidate peptides. Our results strongly support the hypothesis that a scoring function that considers amino acid usage patterns is better able to distinguish between candidate peptides. This in turn leads to higher accuracy in peptide prediction.




References:
[1] R. Aebersold and M. Mann, "Mass spectrometry-based proteomics,"
Nature, vol. 422, pp. 198-207, 2003.
[2] R. D. Smith, G. A. Anderson, M. S. Lipton, L. Pasa-Tolic, Y. Shen,
T. P. Conrads, T. D. Veenstra, and H. R. Udseth, "An accurate mass tag
strategy for quantitative and high-throughput proteome measurements,"
Proteomics, vol. 2, pp. 513-523, 2002.
[3] D. A. Wolters, M. P. Washburn, and J. R. I. Yates, "An automated multidimensional
protein identification technology for shotgun proteomics,"
Anal. Chem., vol. 73, pp. 5683-5690, 2001.
[4] D. N. Perkins, D. J. C. Pappin, D. M. Creasy, and J. S. Cottrell,
"Probability-based protein identification by searching sequence
databases using mass spectrometry data," Electrophoresis, vol. 20, pp.
3551-3567, 1999.
[5] J. K. Eng, A. L. McCormack, and J. R. I. Yates, "An approach
to correlate tandem mass spectral data of peptides with amino acid
sequences in a protein database," J. Am. Soc. Mass Spectrom., vol. 5,
pp. 976-989, 1994.
[6] J. I. Yates, J. K. Eng, A. L. McCormack, and D. Schieltz, "A method
to correlate tandem mass spectra of modified peptides to amino acid
sequences in the protein database," Anal. Chem., vol. 67, pp. 1426-
1436, 1995.
[7] V. Bafna and N. Edwards, "Scope: a probabilistic model for scoring
tandem mass spectra against a peptide database," Bioinformatics, vol. 17,
pp. S13-S21, 2001.
[8] J. A. Taylor and R. S. Johnson, "Sequence database searches via de novo
peptide sequencing by tandem mass spectrometry," Rapid Commun.
Mass Spectrom., vol. 11, pp. 1067-1075, 1997.
[9] V. Dancik, T. A. Addona, K. R. Clauser, J. E. Vath, and P. A. Pevzner,
"De novo peptide sequencing via tandem mass spectrometry," J Comp
Biol., vol. 6, pp. 327-342, 1999.
[10] A. Frank and P. Pevzner, "Pepnovo: De novo peptide sequencing via
probabilistic network modeling," Anal. Chem., vol. 77, pp. 964-973,
2005.
[11] M. Bern and D. Goldberg, "De novo analysis of peptide tandem mass
spectra by spectral graph partitioning," J Comp Biol., vol. 13, pp. 364-
378, 2006.
[12] B. Fischer, V. Roth, F. Roos, J. Grossmann, S. Baginsky, P. Widmayer,
W. Gruissem, and J. M. Buhmann, "Novohmm: A hidden markov model
for de novo peptide sequencing," Anal. Chem., vol. 77, pp. 7265-7273,
2005.
[13] P. A. DiMaggio and C. A. Floudas, "De novo peptide identification via
tandem mass spectrometry and integer linear optimization," Anal. Chem.,
vol. 79, pp. 1433-1446, 2007.
[14] K. R. Clauser, P. Baker, and A. L. Burlingame, "Role of accurate mass
measurement (+/- 10 ppm) in protein identification strategies employing
ms or ms/ms and database searching," Anal. Chem., vol. 71, pp. 2871-
2882, 1999.
[15] M. Mann and M. Wilm, "Error-tolerant identification of peptides in
sequence databases by peptide sequence tags," Anal. Chem., vol. 66,
pp. 4390-4399, 1994.
[16] D. M. Ward, W. R., and M. M. Bateson, "16s rrna sequences reveal
numerous uncultured microorganisms in a natural community," Nature,
vol. 345, pp. 63-65, 1990.
[17] U. B. Goebel, "Phylogenetic amplification for the detection of uncultured
bacteria and the analysis of complex microbiota," J. Microbiol. Methods,
vol. 23, pp. 117-128, 1995.
[18] J. M. Gonzalez and C. Saiz-Jimenez, "Application of molecular nucleic
acidbased techniques for the study of microbial communities in monuments,"
Int. Microbiol., vol. 8, pp. 189-194, 2005.
[19] Y. Fu, Q. Yang, R. Sun, D. Li, R. Zeng, C. X. Ling, and W. Gao,
"Exploiting the kernel trick to correlate fragment ions for peptide
identification via tandem mass spectrometry," Bioinformatics, vol. 20,
pp. 1948-1954, 2004.
[20] M. Havilio, Y. Haddad, and Z. Smilansky, "Intensity-based statistical
scorer for tandem mass spectrometry," Anal. Chem., vol. 75, pp. 435-
444, 2003.
[21] R. G. Sadygov and J. R. Yates, "A hypergeometric probability model
for protein identification and validation using tandem mass spectral data
and protein sequence databases," Anal. Chem., vol. 75, pp. 3792-3798,
2003.
[22] T. Fridman, J. Razumovskaya, N. Verberkmoes, G. Hurst, V. Protopopescu,
and Y. Xu, "The probability distribution for a random
match between an experimental-theoretical spectral pair in tandem mass
spectrometry," J. Bioinform. Comput. Biol., vol. 3, pp. 455-476, 2005.
[23] A. M. Frank, "A ranking-based scoring function for peptide-spectrum
matches," J. Proteome Res., vol. 8, pp. 2241-2252, 2008.
[24] R. Craig, J. Cortens, and R. Beavis, "The use of proteotypic peptide
libraries for protein identification," Rapid Commun. Mass Spectrom.,
vol. 19, pp. 1844-1850, 2005.
[25] H. Tang, R. Arnold, P. Alves, Z. Xun, D. Clemmer, M. Novotny,
J. Reilly, and P. Radivojac, "A computational approach toward labelfree
protein quantification using predicted peptide detectability," Bioinformatics,
vol. 22, pp. e481-e488, 2006.
[26] J. Ranish, B. Raught, R. Schmitt, T. Werner, K. B., and R. Aebersold,
"Computational prediction of proteotypic peptides for quantitative proteomics,"
Nat. Biotechnol., vol. 25, pp. 125-131, 2007.
[27] P. Foster and D. A. Hickey, "Compositional bias may affect both dnabased
and protein-based phylogenetic reconstructions," J. Mol. Evol.,
vol. 48, pp. 284-290, 1999.
[28] G. A. C. Singer and D. A. Hickey, "Nucleotide bias causes a
genomewide bias in the amino acid composition of proteins," Mol. Biol.
Evol., vol. 17, pp. 1581-1588, 2000.
[29] A. Keller, S. Purvine, A. I. Nesvizhskii, S. Stolyar, D. R. Goodlett, and
E. Kolker, "Experimental protein mixture for validating tandem mass
spectrometry analysis," OMICS J. Integr. Biol., vol. 6, pp. 207-212,
2002.
[30] J. T. Prince, M. W. Carlson, R. Wang, P. Lu, and E. M. Marcotte, "The
need for a public proteomics repository," Nat. Biotechnol., vol. 22, pp.
471-472, 2004.
[31] N. Pace, "Mapping the tree of life: Progress and prospects," Microbiology
and Molecular Biology Reviews, vol. 73, no. 4, pp. 565-576,
December 2009.
[32] J. M. Janda and S. L. Abbott, "16s rrna gene sequencing for bacterial
identification in the diagnostic laboratory: Pluses, perils, and pitfalls,"
Journal of Clinical Microbiology, vol. 45, no. 6, pp. 2761-2764,
September 2007.
[33] M. Drancourt, C. Bollet, A. Carlioz, R. Martelin, J. Gayral, and
D. Raoult, "16s ribosomal dna sequence analysis of a large collection of
environmental and clinical unidentifiable bacterial isolates," Journal of
Clinical Microbiology, vol. 38, no. 10, pp. 3623-3630, October 2000.
[34] S. Mignard and J. P. Flandrois, "16s rrna sequencing in routine bacterial
identification: a 30-month experiment," Journal of Microbiological
Methods, vol. 67, no. 3, pp. 574-581, December 2006.