Dimensionality Reduction of PSSM Matrix and its Influence on Secondary Structure and Relative Solvent Accessibility Predictions

State-of-the-art methods for secondary structure (Porter, Psi-PRED, SAM-T99sec, Sable) and solvent accessibility (Sable, ACCpro) predictions use evolutionary profiles represented by the position specific scoring matrix (PSSM). It has been demonstrated that evolutionary profiles are the most important features in the feature space for these predictions. Unfortunately applying PSSM matrix leads to high dimensional feature spaces that may create problems with parameter optimization and generalization. Several recently published suggested that applying feature extraction for the PSSM matrix may result in improvements in secondary structure predictions. However, none of the top performing methods considered here utilizes dimensionality reduction to improve generalization. In the present study, we used simple and fast methods for features selection (t-statistics, information gain) that allow us to decrease the dimensionality of PSSM matrix by 75% and improve generalization in the case of secondary structure prediction compared to the Sable server.


Authors:



References:
[1] Jones, D. T. (1999). Protein secondary structure prediction based on
position-specific scoring matrices., J Mol Biol 292 : 195-202.
[2] Pollastri, G. & McLysaght, A. (2005). Porter: a new, accurate server for
protein secondary structure prediction., Bioinformatics 21 : 1719-1720.
[3] Rost, B. (2001). Review: protein secondary structure prediction
continues to rise., J Struct Biol 134 : 204-218.
[4] Adamczak, R.; Porollo, A. & Meller, J. (2004). Accurate prediction of solvent accessibility using neural networks-based regression., Proteins 56 : 753-767.
[5] Pollastri, G.; Martin, A. J. M.; Mooney, C. & Vullo, A. (2007). Accurate
prediction of protein secondary structure and solvent accessibility by
consensus combiners of sequence and structure information., BMC
Bioinformatics 8 : 201.
[6] Pollastri, G.; Baldi, P.; Fariselli, P. & Casadio, R. (2001). Improved
prediction of the number of residue contacts in proteins by recurrent
neural networks., Bioinformatics 17 Suppl 1 : S234-S242.
[7] King, R. D. & Sternberg, M. J. (1996). Identification and application of
the concepts important for accurate and reliable protein secondary
structure prediction., Protein Sci 5 : 2298-2310.
[8] Woodcock, S.; Mornon, J. P. & Henrissat, B. (1992). Detection of
secondary structure elements in proteins by hydrophobic cluster
analysis., Protein Eng 5 : 629-635.
[9] Bastolla, U.; Porto, M.; Roman, H. E. & Vendruscolo, M. (2005).
Principal eigenvector of contact matrices and hydrophobicity profiles in
proteins., Proteins 58 : 22-30.
[10] Gribskov, M.; McLachlan, A. D. & Eisenberg, D. (1987). Profile
analysis: detection of distantly related proteins., Proc Natl Acad Sci U S
A 84 : 4355-4358.
[11] Altschul, S. F.; Madden, T. L.; Schäffer, A. A.; Zhang, J.; Zhang, Z.;
Miller, W. & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a
new generation of protein database search programs., Nucleic Acids Res
25 : 3389-3402.
[12] Melo, J. C. B.; Cavalcanti, G. D. C. & Guimaraes, K. S. (2003). PCA
feature extraction for protein structure prediction, 4 : 2952-2957.
[13] Simas, G. M.; Botelho, S. S. C.; Grando, N. & Colares, R. G. (2008).
Dimensional Reduction in the Protein Secondary Structure Prediction ÔÇö
Nonlinear Method Improvements. In: (Ed.), Innovations in Hybrid
Intelligent Systems, Springer Berlin / Heidelberg.
[14] Jollife, I. T., 1986. Principle component analysis. Springer Varlag, .
[15] Cuff, J. A. & Barton, G. J. (1999). Evaluation and improvement of
multiple sequence methods for protein secondary structure prediction.,
Proteins 34 : 508-519.
[16] E. Hunt, J. Martin, P. S. (1966). Experiments in Induction, Academic
Press, New York .
[17] Adamczak, R.; Porollo, A. & Meller, J. (2005). Combining prediction of
secondary structure and solvent accessibility in proteins., Proteins 59 :
467-475.
[18] Rost, B.; Sander, C. & Schneider, R. (1994). PHD--an automatic mail
server for protein secondary structure prediction., Comput Appl Biosci
10 : 53-60.
[19] Zell, A.; Mache, N.; Hubner, R.; Mamier, G.; Vogt, M.; uwe Herrmann,
K.; Schmalzl, M.; Sommer, T.; Hatzigeorgiou, A.; Doring, S.; Posselt,
D.; Reczko, M. & Riedmiller, M. (1993). SNNS - Stuttgart Neural
Network Simulator, .
[20] Riedmiller, M. & Braun, H. (1992). RPROP- A fast adaptive learning
algorithm, .
[21] Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary
structure: pattern recognition of hydrogen-bonded and geometrical
features., Biopolymers 22 : 2577-2637.
[22] Zemla, A.; Venclovas, C.; Fidelis, K. & Rost, B. (1999). A modified
definition of Sov, a segment-based measure for protein secondary
structure prediction assessment., Proteins 34 : 220-223.
[23] Eyrich, V. A.; Martí-Renom, M. A.; Przybylski, D.; Madhusudhan, M.
S.; Fiser, A.; Pazos, F.; Valencia, A.; Sali, A. & Rost, B. (2001). EVA:
continuous automatic evaluation of protein structure prediction servers.,
Bioinformatics 17 : 1242-1243.
[24] Wagner, M.; Adamczak, R.; Porollo, A. & Meller, J. (2005). Linear
regression models for solvent accessibility prediction in proteins., J
Comput Biol 12 : 355-369.