Applying Clustering of Hierarchical K-means-like Algorithm on Arabic Language

In this study a clustering technique has been implemented which is K-Means like with hierarchical initial set (HKM). The goal of this study is to prove that clustering document sets do enhancement precision on information retrieval systems, since it was proved by Bellot & El-Beze on French language. A comparison is made between the traditional information retrieval system and the clustered one. Also the effect of increasing number of clusters on precision is studied. The indexing technique is Term Frequency * Inverse Document Frequency (TF * IDF). It has been found that the effect of Hierarchical K-Means Like clustering (HKM) with 3 clusters over 242 Arabic abstract documents from the Saudi Arabian National Computer Conference has significant results compared with traditional information retrieval system without clustering. Additionally it has been found that it is not necessary to increase the number of clusters to improve precision more.





References:
[1] McCallum and K. Nigam, "A Comparison of Event Models for Naive
Bayes Text Classification", in Proc. of the AAAI-98/ICML-
98,Workshop on Learning for Text Categorization (AAAI), Madison;
1998, pp. 71-74.
[2] D. Fragoudis, D. Meretakis and S. Likothanassis, Integrating Feature
and Instance Selection for Text Classification, 2000, pp. 27-37.
[3] K. Nigam, A. Kachites, S. Thrun and T. Mitchell, Text Classification
from Labeled and Unlabeled Documents using EM. Kluwer Academic
Publishers, Boston. 1999.
[4] K. Thompson and R. Nickolov, "A Clustering-Based Algorithm for
Automatic Document Separation", in Proc. of the SIGIR 2002,
Workshop on Information Retrieval , 2002, pp. 38-43.
[5] N. Slonim and N. Tishby, "The Power of Word Clusters for Text
Classification", in Proc. of the 23rd European Colloquium on
Information Retrieval Research, 2001,pp. 1-12
[6] P. Bellot and M. El-Bèze, "Clustering by means of Unsupervised
Decision Trees or Hierarchical and K-means-like Algorithm", in Proc. of
RIAO 2000, pp. 344-363.
[7] P. Dai, U. Iurgel and G. Rigoll, "A Novel Feature Combination
Approach for Spoken Document Classification with Support Vector
Machines", in Proc Multimedia Information Retrieval Workshop in
conjunction, 2003, pp. 1-5.
[8] R. Ghani, "Using error-correcting codes for text classification", in Proc.
17th International Conference on Machine Learning (ICML-00),
Stanford, CA, 2000, pp. 303-310.
[9] R. Ramakrishnan and J. Gehrke, Database Management Systems.
McGraw-Hill, 2002.
[10] T. Theeramunkong and V. Lertnattee, "Multi-Dimensional Text
Classification", in Proc. of the 19th International Conference on
Computational Linguistics, Taipei, 2002, pp. 34-38.
[11] Y. Fang, S. Parthasarathy, and F. Schwartz, "Using Clustering to Boost
Text Classification", in Proc. of the IEEE International Conference on
Data Mining, California, USA, 2001, pp. 123-127.