Advanced Information Extraction with n-gram based LSI

Number of documents being created increases at an increasing pace while most of them being in already known topics and little of them introducing new concepts. This fact has started a new era in information retrieval discipline where the requirements have their own specialties. That is digging into topics and concepts and finding out subtopics or relations between topics. Up to now IR researches were interested in retrieving documents about a general topic or clustering documents under generic subjects. However these conventional approaches can-t go deep into content of documents which makes it difficult for people to reach to right documents they were searching. So we need new ways of mining document sets where the critic point is to know much about the contents of the documents. As a solution we are proposing to enhance LSI, one of the proven IR techniques by supporting its vector space with n-gram forms of words. Positive results we have obtained are shown in two different application area of IR domain; querying a document database, clustering documents in the document database.




References:
[1] Bellot, P. and El-Beze, M., A Clustering Method for Information
Retrieval, Technical Report IR-0199, Laboratoire d'Informatique
d'Avignon,France, 1999.
[2] Berry, M. W., Drmac, Z. and Jessup E. R.: Matrices, Vector Spaces, and
Information Retrieval, SIAM Review, v.41 n.2, p.335-362, June 1999.
[3] Boley D., Principal direction divisive partitioning. Data Mining and
Knowledge Discovery, 2(4), 1998.
[4] Brown, P. F., Della Pietra, V. J., deSouza, P. V., Lai "Class-based ngram
models of Natural Language", Computational Linguistics, vol. 18,
pp. 467-479, 1992.
[5] Croft, W.B. and Xu, J.: Corpus-specific stemming using word form cooccurence.
In Proceedings for the Fourth Annual Symposium on
Document Analysis and Information Retrieval (pp. 147-159), Las Vegas,
Nevada. 1995.
[6] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and
Harshman, R.: (1990). Indexing by latent semantic analysis. Journal of
the American Society for Information Science, 41(6), 391-407.
[7] Duda, R. O., Hart, P. E., and Stork, D. G., Pattern Classification. Wiley,
New York.2001.
[8] Ekmekcioglu, F. C., Lynch, M. F. and Willett, P. (1996): Stemming and
N-gram Matching for Term Conflation in Turkish Texts. Inf. Research,
Vol. 2, No. 2.
[9] Kohonen, T., "The Self-Organizing Map," Proceedings of the IEEE, vol.
9, 1990, pp. 1464-1479.
[10] Lingpipe NLP Library http://www.aliasi.com/lingpipe
[11] Salton, G. and McGill, M. J.: Int. to modern information retrieval.
McGraw-Hill.
[12] Willet, P., Recent trends in hierarchical document clustering: a critical
review. Information Processing and Management, vol. 24(5), pages 577-
- 597, 1988.
[13] Zemberek Turkish NLP Library: https://zemberek.dev.java.net/
[14] "Reuters21578collection",
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
[15] Porterstemmer , http://www.tartarus.org/martin/PorterStemmer/
[16] Foundations of Statistical Natural Language Processing (Hardcover) by
Christopher D. Manning, Hinrich Sch├╝tze.
[17] Unsupervised Machine Learning Techniques for Text Document
Clustering, Arzucan Özgür, Ethem Alpaydın.