Clustering Unstructured Text Documents Using Fading Function
Clustering unstructured text documents is an
important issue in data mining community and has a number of
applications such as document archive filtering, document
organization and topic detection and subject tracing. In the real
world, some of the already clustered documents may not be of
importance while new documents of more significance may evolve.
Most of the work done so far in clustering unstructured text
documents overlooks this aspect of clustering. This paper, addresses
this issue by using the Fading Function. The unstructured text
documents are clustered. And for each cluster a statistics structure
called Cluster Profile (CP) is implemented. The cluster profile
incorporates the Fading Function. This Fading Function keeps an
account of the time-dependent importance of the cluster. The work
proposes a novel algorithm Clustering n-ary Merge Algorithm
(CnMA) for unstructured text documents, that uses Cluster Profile
and Fading Function. Experimental results illustrating the
effectiveness of the proposed technique are also included.
[1] T. Kohonen, S. Kaski, K. Lagus, J. Salojrvi, J. Honkela, V. Paatero, A.
Saarela, "Self organization of a massive document collection", IEEE
Trans. Neural Networks, vol. 11, 2000, pp. 574-585.
[2] J. Tantrum, A. Murua, W. Stuetzle, "Hierarchical model-based clustering
of large datasets through fractionation and refractionation", Proc. 8th
ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2002,
pp. 183-190.
[3] I. S. Dhillon, D. S. Modha, "Concept decompositions for large sparse
text data using clustering", Machine Learning, vol. 42, 2001, pp. 143-
175.
[4] M. Steinbach, G. Karypis, V. Kumar, "A comparison of document
clustering techniques", KDD Workshop on Text Mining, 2000, pp. 109-
110.
[5] S. Vaithyanathan, B. Dom, "Model-based hierarchical clustering", Proc.
16th Conf. Uncertainty in Artificial Intelligence, 2000, pp. 599-608.
[6] M. Meila, D. Heckerman, "An experimental comparison of model-based
clustering methods", Machine Learning, vol. 42, 2001, pp. 9-29.
[7] L. O-Callaghan, N. Mishra, A. Meyerson, S. Guha, "Streaming data
algorithms for high-quality clustering", In Proc. ICDE, San Jose, CA,
February 2002, pp. 685-704.
[8] S. Guha, N. Mishra, R. Motwani, L. O-Callaghan, "Clustering data
streams", In Proc. FOCS, California, November 2000, pp. 359-366.
[9] C. C. Agrawal, J. Han, J. Wang, P. S. Yu, "A framework for clustering
evolving data streams", In Proc. VLDB, Berlin, September 2003, pp. 81-
92.
[10] C. C. Aggarwal, P. S. Yu , "A framework for clustering massive text and
categorical data streams", In Proc. SIAM Conference on Data Mining,
Bethesda, MD, April 2006, pp. 407-411.
[11] Y. B. Liu, J. R. Cai, J Yin.,"Clustering text data streams", Journal of
Computer Science and Technology, vol. 23(1), Jan. 2008, pp. 112-128.
[12] http://www.nsf.gov/awardsearch
[1] T. Kohonen, S. Kaski, K. Lagus, J. Salojrvi, J. Honkela, V. Paatero, A.
Saarela, "Self organization of a massive document collection", IEEE
Trans. Neural Networks, vol. 11, 2000, pp. 574-585.
[2] J. Tantrum, A. Murua, W. Stuetzle, "Hierarchical model-based clustering
of large datasets through fractionation and refractionation", Proc. 8th
ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2002,
pp. 183-190.
[3] I. S. Dhillon, D. S. Modha, "Concept decompositions for large sparse
text data using clustering", Machine Learning, vol. 42, 2001, pp. 143-
175.
[4] M. Steinbach, G. Karypis, V. Kumar, "A comparison of document
clustering techniques", KDD Workshop on Text Mining, 2000, pp. 109-
110.
[5] S. Vaithyanathan, B. Dom, "Model-based hierarchical clustering", Proc.
16th Conf. Uncertainty in Artificial Intelligence, 2000, pp. 599-608.
[6] M. Meila, D. Heckerman, "An experimental comparison of model-based
clustering methods", Machine Learning, vol. 42, 2001, pp. 9-29.
[7] L. O-Callaghan, N. Mishra, A. Meyerson, S. Guha, "Streaming data
algorithms for high-quality clustering", In Proc. ICDE, San Jose, CA,
February 2002, pp. 685-704.
[8] S. Guha, N. Mishra, R. Motwani, L. O-Callaghan, "Clustering data
streams", In Proc. FOCS, California, November 2000, pp. 359-366.
[9] C. C. Agrawal, J. Han, J. Wang, P. S. Yu, "A framework for clustering
evolving data streams", In Proc. VLDB, Berlin, September 2003, pp. 81-
92.
[10] C. C. Aggarwal, P. S. Yu , "A framework for clustering massive text and
categorical data streams", In Proc. SIAM Conference on Data Mining,
Bethesda, MD, April 2006, pp. 407-411.
[11] Y. B. Liu, J. R. Cai, J Yin.,"Clustering text data streams", Journal of
Computer Science and Technology, vol. 23(1), Jan. 2008, pp. 112-128.
[12] http://www.nsf.gov/awardsearch
@article{"International Journal of Information, Control and Computer Sciences:61594", author = "Pallav Roxy and Durga Toshniwal", title = "Clustering Unstructured Text Documents Using Fading Function", abstract = "Clustering unstructured text documents is an
important issue in data mining community and has a number of
applications such as document archive filtering, document
organization and topic detection and subject tracing. In the real
world, some of the already clustered documents may not be of
importance while new documents of more significance may evolve.
Most of the work done so far in clustering unstructured text
documents overlooks this aspect of clustering. This paper, addresses
this issue by using the Fading Function. The unstructured text
documents are clustered. And for each cluster a statistics structure
called Cluster Profile (CP) is implemented. The cluster profile
incorporates the Fading Function. This Fading Function keeps an
account of the time-dependent importance of the cluster. The work
proposes a novel algorithm Clustering n-ary Merge Algorithm
(CnMA) for unstructured text documents, that uses Cluster Profile
and Fading Function. Experimental results illustrating the
effectiveness of the proposed technique are also included.", keywords = "Clustering, Text Mining, Unstructured TextDocuments, Fading Function.", volume = "3", number = "4", pages = "1147-8", }