Weighted k-Nearest-Neighbor Techniques for High Throughput Screening Data

The k-nearest neighbors (knn) is a simple but effective method of classification. In this paper we present an extended version of this technique for chemical compounds used in High Throughput Screening, where the distances of the nearest neighbors can be taken into account. Our algorithm uses kernel weight functions as guidance for the process of defining activity in screening data. Proposed kernel weight function aims to combine properties of graphical structure and molecule descriptors of screening compounds. We apply the modified knn method on several experimental data from biological screens. The experimental results confirm the effectiveness of the proposed method.





References:
[1] Burden, F.R. 1989. "Molecular Identification Number For Substructure
Searches", Journal of Chemical Information and Computer Sciences, 29,
225-227.
[2] D. Hand, H. Mannila, P. Smyth.: Principles of Data Mining. The MIT
Press. (2001)
[3] Domeniconi, C., Peng, J., Gunopulos, D.: Locally adaptive metric
nearest-neighbor classification. IEEE Transactions on Pattern Analysis
and Machine Intelligence 24 (2002) 1281-1285
[4] D.P. Mahe, N. Ueda, T. Akutsu, J.-L. Perret and J.-P. Vert, "Extensions
of Marginalized Graph Kernels," Proc. 21st Int'l Conf. Machine
Learning, 2004
[5] Friedman, J.: Flexible metric nearest neighbor classification. Technical
Report 113, Stanford University Statistics Department (1994)
[6] Graham W. Richards. Virtual screening using grid computing: the
screensaver project. Nature Reviews: Drug Discovery, 1:551-554, July
2002.
[7] Gregory A Landrum, Julie E Penzotti and Santosh Putta, Machinelearning
models for combinatorial catalyst discovery. Rational
Discovery LLC, 555 Bryant St 467, Palo Alto, CA 94301, USA
[8] Hastie, T., Tibshirani, R.: Discriminant adaptive nearest neighbor
classification. IEEE Transactions on Pattern Analysis and Machine
Intelligence 18 (1996) 607- 615
[9] H. Froehlich, J. K. Wegner, A. Zell, QSAR Comb. Sci. 2004, 23, 311 -
318.
[10] Hawkins, D.M., Young, S.S., and Rusinko, A. 1997. "Analysis of a
Large Structure- Activity Data Set Using Recursive Partitioning",
Quantitative Structure Activity Relationships 16, 296-302.
[11] http://cran.r-project.org/src/contrib/Descriptions/exactRankTests.html.
"exactRankTests": Exact Distributions for Rank and Permutation Tests
[12] J. Kandola, J. Shawe-Taylor, and N. Cristianini. On the application of
diffusion kernel to text data. Technical report, Neurocolt, 2002.
NeuroCOLT Technical Report NC-TR-02- 122.
[13] Joachims T., 1998. Text Categorization with Support Vector Machines:
Learning with Many Relevant Features (A). In: Proceedings of the
European Conference on Machine Learning (C).
[14] J.P. Myles and D.J. Hand, "The Multi-Class Metric Problem in
Nearestneighbor Discrimination Rules," Pattern Recognition, vol. 723,
pp. 1291-1297, 1990.
[15] J. Peng, D. Heisterkamp, and H.K. Dai, "LDA/SVM Driven Nearest
Neighbor Classification," Proc. IEEE Computer Soc. Conf. Computer
Vision and Pattern Recognition, pp. 58-63, 2001.
[16] Klopman, G. 1984. "Artificial Intelligence Approach to Structure-
Activity Studies. Computer Automated Structure Evaluation of
Biological Activity of Organic Molecules", American Chemical Society,
Vol. 106, No. 24, 7315-7321.
[17] Li Baoli, Chen Yuzhong, and Yu Shiwen, 2002. A Comparative Study
on Automatic Categorization Methods for Chinese Search Engine (A).
In: Proceedings of the Eighth Joint International Computer Conference
[C]. Hangzhou: Zhejiang University Press, 117-120.
[18] Pearlman, R. S. and Smith, K. M. 1998. "Novel software tools for
chemical diversity", Perspectives in Drug Discovery and Design,
9/10/11, 339-353.
[19] S. Kramer, L. De Raedt, and C. Helma. Molecular feature mining in hiv
data. In 7th International Conference on Knowledge Discovery and Data
Mining, 2001.
[20] R.D. Short and K. Fukunaga, "Optimal Distance Measure for Nearest
Neighbor Classification," IEEE Trans. Information Theory, vol. 27, pp.
622-627, 1981.
[21] Yang Y. and Liu X., 1999. A Re-examination of Text Categorization
Methods (A). In: Proceedings of 22nd Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval (C).
42-49.
[22] Westfall, P. H. & Young, S. S. (1993). Resampling-based multiple
testing: Examples and methods for p-value adjustment, John Wiley &
Sons.
[23] W. Hechenbichler, K., Schliep, K.: Weighted k-Nearest-Neighbor
Techniques and Ordinal Classification. SFB Discussion paper 399.
(2004)