A Text Clustering System based on k-means Type Subspace Clustering and Ontology

This paper presents a text clustering system developed based on a k-means type subspace clustering algorithm to cluster large, high dimensional and sparse text data. In this algorithm, a new step is added in the k-means clustering process to automatically calculate the weights of keywords in each cluster so that the important words of a cluster can be identified by the weight values. For understanding and interpretation of clustering results, a few keywords that can best represent the semantic topic are extracted from each cluster. Two methods are used to extract the representative words. The candidate words are first selected according to their weights calculated by our new algorithm. Then, the candidates are fed to the WordNet to identify the set of noun words and consolidate the synonymy and hyponymy words. Experimental results have shown that the clustering algorithm is superior to the other subspace clustering algorithms, such as PROCLUS and HARP and kmeans type algorithm, e.g., Bisecting-KMeans. Furthermore, the word extraction method is effective in selection of the words to represent the topics of the clusters.





References:
[1] "Textminer." (Online). Available:
http://www.sas.com/technologies/analyticsdatamining/textminer/
[2] "Textanalyst." (Online). Available:
http://www.megaputer.com/products/ta/index.php3
[3] "gcluto." (Online). Available:
http://www-users.cs.umn.edu/ karypis/cluto/gcluto/
[4] "Refviz." (Online). Available: http://www.refviz.com
[5] W. Fan, L. Wallace, S. Rich, and Z. Zhang, "Tapping into the power of
text mining," the Communications of ACM, 2005.
[6] B. Ricardo and R. PBerthier, Modern information retrieval, 1999.
[7] A. Hotho, A. Maedche, and S. Staab, "Ontology-based text document
clustering," Seattle, USA, 2001.
[8] L. Jing, M. Ng, J. Xu, and Z. Huang, "Subspace clustering of text
documents with feature weighting k-means algorithm," 2005, pp. 802-
812.
[9] C. Fellbaum, "Wordnet: an electronic lexical databases," The MIT Press,
1998.
[10] C. Aggarwal, C. Procopiuc, J. Wolf, P. Yu, and J. Park, "Fast algorithms
for projected clustering," Proc. ACM SIGMOD, pp. 61-72, 1999.
[11] K. Y. Yip, D.W. Cheung, and M. K. Ng, "A practical projected clustering
algorithm," IEEE Transactions on knowledge and data engineering,
vol. 16, no. 11, pp. 1387-1397, 2004.
[12] "20-newsgroups." (Online). Available:
http://kdd.ics.uci.edu/databases/20newsgroups
/20newsgroups.html
[13] A. K. Jain, M. N. Murty, and P. J. Flynn, "Data clustering : A review,"
ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.
[14] J. Han, M. Kamber, and A. Tung, "Spatial clustering methods in data
mining: a survey," Geographic Data Mining and Knowledge Discovery,
2001.
[15] L. Parsons, E. Haque, and H. Liu, "Subspace clustering for high
dimensional data: a review," SIGKDD Explorations, vol. 6, no. 1, pp.
90-105, 2004.
[16] J. MacQueen, "Some methods for classification and analysis of multivariate
observations," 1967, pp. 281-297.
[17] S. Dhillon, J. Fan, and Y. Guan, "Efficient clustering of very large
document collections," Data mining for scientific and engineering applications,
pp. 357-381, 2001.
[18] M. Steinbach, G. Karypis, and V. Kumar, "A comparison of document
clustering techniques," 2000.
[19] O. Duda, E. Hart, and G. Stork, Pattern classification, 2000.
[20] Y. Zhao and G. Karypis, "Comparison of agglomerative and partitional
document clustering algorithms," Technical report 02-014, University
of Minnesota, 2002.
[21] C. Aggarwal and P. Yu, "Finding generalized projected clusters in high
dimensional spaces," Proc. ACM SIGMOD, pp. 70-81, 2000.
[22] H. Frigui and O. Nasraoui, "Unsupervised lerning of prototypes and
attribute weights," Pattern recognition, vol. 37, no. 3, pp. 567-581, 2004.
[23] Y. Chan, K. Ching, K. Ng, and Z. Huang, "An optimization algorithm for
clustering using weighted dissimilarity measures," Pattern recognition,
vol. 37, no. 5, pp. 943-952, 2004.
[24] B. M.W., S. Dumais, and G. O-Brien, "Using linear algebra for intelligent
information retrieval," SIAM Review, vol. 37, pp. 573-595, 1995.
[25] I. Herman, G. Melancon, and M. Marshall, "Graph visualization and
navigation in information visualization: a survey," IEEE Transactions on
Visualization and Computer Graphics, vol. 6, no. 1, pp. 24-43, 2000.
[26] S. Dhillon and S. Modha, "Concept decompositions for large sparse text
data using clustering," Machine learning, vol. 42, no. 1, pp. 143-175,
2001.
[27] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, "Distributional
word clusters vs. words for text categorization," Journal of Machine
Learning Research, vol. 3, no. 7-8, pp. 1183-1208, 2003.
[28] A. McCallum, "Bow: A toolkit for statistical language modeling,
text retrieval, classification and clustering," 1996. (Online). Available:
http://www.cs.cmu.edu/mccallum/bow
[29] J. Bezdek, "A convergence theorem for the fuzzy isodata clustering
algorithms," IEEE Trans. on Pattern Analysis and Machine Intelligence,
vol. 2, pp. 1-8, 1980.
[30] P. Hague and P. Harris, Sampling and statistics, London, 1993.
[31] S. Selim and M. Ismail, "k-means type algorithms: a generalized
convergence theorem and characterization of local optimality," IEEE
Trans. Pattern Analysis and Machine Intelligence, vol. 6, no. 1, pp.
81-87, 1984.
[32] W. Li, "Random texts exhibit zipf-s-law-like word frequency distribution,"
IEEE Transactions on Information Theory, vol. 38, no. 6, pp.
1842-1845, 1992.
[33] Z. Shi and J. Ghosh, "A comparative study of generative models for
document clustering," San Francisco, CA, May, 2003.
[34] I. Katsavounidis, C. Kuo, and Z. Zhang, "A new initialization technique
for generalized lioyd iteration," IEEE signal proceeding, Letters 1(10),
pp. 144-146, 1994.
[35] D. Swanson, "Medical literature as a potential source of new knowledge,"
Bull Med Libr Assoc., vol. 78, no. 1, pp. 29-37, 1990.