Advanced Information Extraction with n-gram based LSI
Number of documents being created increases at an
increasing pace while most of them being in already known topics
and little of them introducing new concepts. This fact has started a
new era in information retrieval discipline where the requirements
have their own specialties. That is digging into topics and concepts
and finding out subtopics or relations between topics. Up to now IR
researches were interested in retrieving documents about a general
topic or clustering documents under generic subjects. However these
conventional approaches can-t go deep into content of documents
which makes it difficult for people to reach to right documents they
were searching. So we need new ways of mining document sets
where the critic point is to know much about the contents of the
documents. As a solution we are proposing to enhance LSI, one of
the proven IR techniques by supporting its vector space with n-gram
forms of words. Positive results we have obtained are shown in two
different application area of IR domain; querying a document
database, clustering documents in the document database.
[1] Bellot, P. and El-Beze, M., A Clustering Method for Information
Retrieval, Technical Report IR-0199, Laboratoire d'Informatique
d'Avignon,France, 1999.
[2] Berry, M. W., Drmac, Z. and Jessup E. R.: Matrices, Vector Spaces, and
Information Retrieval, SIAM Review, v.41 n.2, p.335-362, June 1999.
[3] Boley D., Principal direction divisive partitioning. Data Mining and
Knowledge Discovery, 2(4), 1998.
[4] Brown, P. F., Della Pietra, V. J., deSouza, P. V., Lai "Class-based ngram
models of Natural Language", Computational Linguistics, vol. 18,
pp. 467-479, 1992.
[5] Croft, W.B. and Xu, J.: Corpus-specific stemming using word form cooccurence.
In Proceedings for the Fourth Annual Symposium on
Document Analysis and Information Retrieval (pp. 147-159), Las Vegas,
Nevada. 1995.
[6] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and
Harshman, R.: (1990). Indexing by latent semantic analysis. Journal of
the American Society for Information Science, 41(6), 391-407.
[7] Duda, R. O., Hart, P. E., and Stork, D. G., Pattern Classification. Wiley,
New York.2001.
[8] Ekmekcioglu, F. C., Lynch, M. F. and Willett, P. (1996): Stemming and
N-gram Matching for Term Conflation in Turkish Texts. Inf. Research,
Vol. 2, No. 2.
[9] Kohonen, T., "The Self-Organizing Map," Proceedings of the IEEE, vol.
9, 1990, pp. 1464-1479.
[10] Lingpipe NLP Library http://www.aliasi.com/lingpipe
[11] Salton, G. and McGill, M. J.: Int. to modern information retrieval.
McGraw-Hill.
[12] Willet, P., Recent trends in hierarchical document clustering: a critical
review. Information Processing and Management, vol. 24(5), pages 577-
- 597, 1988.
[13] Zemberek Turkish NLP Library: https://zemberek.dev.java.net/
[14] "Reuters21578collection",
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
[15] Porterstemmer , http://www.tartarus.org/martin/PorterStemmer/
[16] Foundations of Statistical Natural Language Processing (Hardcover) by
Christopher D. Manning, Hinrich Sch├╝tze.
[17] Unsupervised Machine Learning Techniques for Text Document
Clustering, Arzucan Özgür, Ethem Alpaydın.
[1] Bellot, P. and El-Beze, M., A Clustering Method for Information
Retrieval, Technical Report IR-0199, Laboratoire d'Informatique
d'Avignon,France, 1999.
[2] Berry, M. W., Drmac, Z. and Jessup E. R.: Matrices, Vector Spaces, and
Information Retrieval, SIAM Review, v.41 n.2, p.335-362, June 1999.
[3] Boley D., Principal direction divisive partitioning. Data Mining and
Knowledge Discovery, 2(4), 1998.
[4] Brown, P. F., Della Pietra, V. J., deSouza, P. V., Lai "Class-based ngram
models of Natural Language", Computational Linguistics, vol. 18,
pp. 467-479, 1992.
[5] Croft, W.B. and Xu, J.: Corpus-specific stemming using word form cooccurence.
In Proceedings for the Fourth Annual Symposium on
Document Analysis and Information Retrieval (pp. 147-159), Las Vegas,
Nevada. 1995.
[6] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and
Harshman, R.: (1990). Indexing by latent semantic analysis. Journal of
the American Society for Information Science, 41(6), 391-407.
[7] Duda, R. O., Hart, P. E., and Stork, D. G., Pattern Classification. Wiley,
New York.2001.
[8] Ekmekcioglu, F. C., Lynch, M. F. and Willett, P. (1996): Stemming and
N-gram Matching for Term Conflation in Turkish Texts. Inf. Research,
Vol. 2, No. 2.
[9] Kohonen, T., "The Self-Organizing Map," Proceedings of the IEEE, vol.
9, 1990, pp. 1464-1479.
[10] Lingpipe NLP Library http://www.aliasi.com/lingpipe
[11] Salton, G. and McGill, M. J.: Int. to modern information retrieval.
McGraw-Hill.
[12] Willet, P., Recent trends in hierarchical document clustering: a critical
review. Information Processing and Management, vol. 24(5), pages 577-
- 597, 1988.
[13] Zemberek Turkish NLP Library: https://zemberek.dev.java.net/
[14] "Reuters21578collection",
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
[15] Porterstemmer , http://www.tartarus.org/martin/PorterStemmer/
[16] Foundations of Statistical Natural Language Processing (Hardcover) by
Christopher D. Manning, Hinrich Sch├╝tze.
[17] Unsupervised Machine Learning Techniques for Text Document
Clustering, Arzucan Özgür, Ethem Alpaydın.
@article{"International Journal of Information, Control and Computer Sciences:59935", author = "Ahmet Güven and Ö. Özgür Bozkurt and Oya Kalıpsız", title = "Advanced Information Extraction with n-gram based LSI", abstract = "Number of documents being created increases at an
increasing pace while most of them being in already known topics
and little of them introducing new concepts. This fact has started a
new era in information retrieval discipline where the requirements
have their own specialties. That is digging into topics and concepts
and finding out subtopics or relations between topics. Up to now IR
researches were interested in retrieving documents about a general
topic or clustering documents under generic subjects. However these
conventional approaches can-t go deep into content of documents
which makes it difficult for people to reach to right documents they
were searching. So we need new ways of mining document sets
where the critic point is to know much about the contents of the
documents. As a solution we are proposing to enhance LSI, one of
the proven IR techniques by supporting its vector space with n-gram
forms of words. Positive results we have obtained are shown in two
different application area of IR domain; querying a document
database, clustering documents in the document database.", keywords = "Document clustering, Information Extraction,
Information Retrieval, LSI,n-gram.", volume = "2", number = "5", pages = "1650-6", }