Prediction of Protein Subchloroplast Locations using Random Forests

Protein subchloroplast locations are correlated with its functions. In contrast to the large amount of available protein sequences, the information of their locations and functions is less known. The experiment works for identification of protein locations and functions are costly and time consuming. The accurate prediction of protein subchloroplast locations can accelerate the study of functions of proteins in chloroplast. This study proposes a Random Forest based method, ChloroRF, to predict protein subchloroplast locations using interpretable physicochemical properties. In addition to high prediction accuracy, the ChloroRF is able to select important physicochemical properties. The important physicochemical properties are also analyzed to provide insights into the underlying mechanism.




References:
[1] W. Martin and R. G. Herrmann, "Gene transfer from organelles to the
nucleus: how much, what happens, and Why?," Plant Physiol, vol. 118,
pp. 9-17, Sep 1998.
[2] J. B. Peltier, G. Friso, D. E. Kalume, P. Roepstorff, F. Nilsson, I.
Adamska, and K. J. van Wijk, "Proteomics of the chloroplast: systematic
identification and targeting analysis of lumenal and peripheral thylakoid
proteins," Plant Cell, vol. 12, pp. 319-41, Mar 2000.
[3] J. B. Peltier, O. Emanuelsson, D. E. Kalume, J. Ytterberg, G. Friso, A.
Rudella, D. A. Liberles, L. Soderberg, P. Roepstorff, G. von Heijne, and
K. J. van Wijk, "Central functions of the lumenal and peripheral
thylakoid proteome of Arabidopsis determined by experimentation and
genome-wide prediction," Plant Cell, vol. 14, pp. 211-36, Jan 2002.
[4] M. Ferro, D. Salvi, H. Riviere-Rolland, T. Vermat, D. Seigneurin-Berny,
D. Grunwald, J. Garin, J. Joyard, and N. Rolland, "Integral membrane
proteins of the chloroplast envelope: identification and subcellular
localization of new transporters," Proc Natl Acad Sci U S A, vol. 99, pp.
11487-92, Aug 20 2002.
[5] M. Ferro, D. Salvi, S. Brugiere, S. Miras, S. Kowalski, M. Louwagie, J.
Garin, J. Joyard, and N. Rolland, "Proteomics of the chloroplast
envelope membranes from Arabidopsis thaliana," Mol Cell Proteomics,
vol. 2, pp. 325-45, May 2003.
[6] O. Emanuelsson, H. Nielsen, S. Brunak, and G. von Heijne, "Predicting
subcellular localization of proteins based on their N-terminal amino acid
sequence," J Mol Biol, vol. 300, pp. 1005-16, Jul 21 2000.
[7] O. Emanuelsson, H. Nielsen, and G. von Heijne, "ChloroP, a neural
network-based method for predicting chloroplast transit peptides and
their cleavage sites," Protein Sci, vol. 8, pp. 978-84, May 1999.
[8] F. Abdallah, F. Salamini, and D. Leister, "A prediction of the size and
evolutionary origin of the proteome of chloroplasts of Arabidopsis,"
Trends Plant Sci, vol. 5, pp. 141-2, Apr 2000.
[9] W. Martin, T. Rujan, E. Richly, A. Hansen, S. Cornelsen, T. Lins, D.
Leister, B. Stoebe, M. Hasegawa, and D. Penny, "Evolutionary analysis
of Arabidopsis, cyanobacterial, and chloroplast genomes reveals plastid
phylogeny and thousands of cyanobacterial genes in the nucleus," Proc
Natl Acad Sci U S A, vol. 99, pp. 12246-51, Sep 17 2002.
[10] D. Leister, "Chloroplast research in the genomic age," Trends Genet, vol.
19, pp. 47-56, Jan 2003.
[11] P. Du, S. Cao, and Y. Li, "SubChlo: predicting protein subchloroplast
locations with pseudo-amino acid composition and the
evidence-theoretic K-nearest neighbor (ET-KNN) algorithm," J Theor
Biol, vol. 261, pp. 330-5, Nov 21 2009.
[12] C.-W. Tung and S.-Y. Ho, "POPI: predicting immunogenicity of MHC
class I binding peptides by mining informative physicochemical
properties," Bioinformatics, vol. 23, pp. 942-9, Apr 15 2007.
[13] C.-W. Tung and S.-Y. Ho, "Computational identification of
ubiquitylation sites from protein sequences," BMC Bioinformatics, vol.
9, p. 310, 2008.
[14] K.-T. Hsu, H.-L. Huang, C.-W. Tung, Y.-H. Chen, and S.-Y. Ho,
"Analysis of physicochemical properties on prediction of R5, X4, and
R5X4 HIV-1 coreceptor usage," Int J Biol Life Sci, vol. 5, pp. 208-15,
2009.
[15] W.-L. Huang, C.-W. Tung, H.-L. Huang, S.-F. Hwang, and S.-Y. Ho,
"ProLoc: Prediction of protein subnuclear localization using SVM with
automatic selection from physicochemical composition features,"
Biosystems, Jan 4 2007.
[16] D. Sarda, G. H. Chua, K. B. Li, and A. Krishnan, "pSLIP: SVM based
protein subcellular localization prediction using multiple
physicochemical properties," BMC Bioinformatics, vol. 6, p. 152, 2005.
[17] L. Breiman, "Random forests," Machine Learning, vol. 45, pp. 5-32, Oct
2001.
[18] S. Kawashima, P. Pokarowski, M. Pokarowska, A. Kolinski, T.
Katayama, and M. Kanehisa, "AAindex: amino acid index database,
progress report 2008," Nucleic Acids Res, vol. 36, pp. D202-5, Jan 2008.
[19] N. Lin, B. Wu, R. Jansen, M. Gerstein, and H. Zhao, "Information
assessment on predicting protein-protein interactions," BMC
Bioinformatics, vol. 5, p. 154, Oct 18 2004.
[20] D. Amaratunga, J. Cabrera, and Y. S. Lee, "Enriched random forests,"
Bioinformatics, vol. 24, pp. 2010-4, Sep 15 2008.
[21] "The Universal Protein Resource (UniProt) 2009," Nucleic Acids Res,
vol. 37, pp. D169-74, Jan 2009.
[22] W. Li and A. Godzik, "Cd-hit: a fast program for clustering and
comparing large sets of protein or nucleotide sequences,"
Bioinformatics, vol. 22, pp. 1658-9, Jul 1 2006.
[23] Y. Huang, B. Niu, Y. Gao, L. Fu, and W. Li, "CD-HIT Suite: a web
server for clustering and comparing biological sequences,"
Bioinformatics, vol. 26, pp. 680-2, Mar 1 2010.
[24] L. Breiman, Classification and regression trees: Chapman & Hall/CRC,
1984.
[25] S. Rackovsky and H. Scheraga, "Differential geometry and polymer
conformation. 4. Conformational and nucleation properties of individual
amino acids," Macromolecules, vol. 15, pp. 1340-1346, 1982.
[26] R. Grantham, "Amino acid difference formula to help explain protein
evolution," Science, vol. 185, pp. 862-4, Sep 6 1974.
[27] M. Wilce, M. Aguilar, and M. Hearn, "Physicochemical basis of amino
acid hydrophobicity scales: Evaluation of four new scales of amino acid
hydrophobicity coefficients derived from RP-HPLC of peptides,"
Analytical chemistry, vol. 67, pp. 1210-1219, 1995.
[28] L. Kuhn, C. Swanson, M. Pique, J. Tainer, and E. Getzoff, "Atomic and
residue hydrophilicity in the context of folded protein structures,"
Proteins, vol. 23, p. 536, 1995.
[29] P. K. Ponnuswamy, M. Prabhakaran, and P. Manavalan, "Hydrophobic
packing and spatial arrangement of amino acid residues in globular
proteins," Biochim Biophys Acta, vol. 623, pp. 301-16, Jun 26 1980.
[30] D. Eisenberg and A. D. McLachlan, "Solvation energy in protein folding
and binding," Nature, vol. 319, pp. 199-203, Jan 16-22 1986.
[31] P. Argos, J. K. Rao, and P. A. Hargrave, "Structural prediction of
membrane-bound proteins," Eur J Biochem, vol. 128, pp. 565-75, Nov
15 1982.
[32] H. Nakashima and K. Nishikawa, "The amino acid composition is
different between the cytoplasmic and extracellular sides in membrane
proteins," FEBS letters, vol. 303, pp. 141-146, 1992.
[33] J. Cornette, K. Cease, H. Margalit, J. Spouge, J. Berzofsky, and C.
DeLisi, "Hydrophobicity scales and computational techniques for
detecting amphipathic structures in proteins," J Mol Biol, vol. 195, pp.
659-685, 1987.
[34] S. Fukuchi and K. Nishikawa, "Protein surface amino acid compositions
distinctively differ between thermophilic and mesophilic bacteria1," J
Mol Biol, vol. 309, pp. 835-843, 2001.
[35] S. Kumar, C. Tsai, and R. Nussinov, "Factors enhancing protein
thermostability," Protein Eng Des Sel, vol. 13, p. 179, 2000.