Information Filtering using Index Word Selection based on the Topics

We have proposed an information filtering system using index word selection from a document set based on the topics included in a set of documents. This method narrows down the particularly characteristic words in a document set and the topics are obtained by Sparse Non-negative Matrix Factorization. In information filtering, a document is often represented with the vector in which the elements correspond to the weight of the index words, and the dimension of the vector becomes larger as the number of documents is increased. Therefore, it is possible that useless words as index words for the information filtering are included. In order to address the problem, the dimension needs to be reduced. Our proposal reduces the dimension by selecting index words based on the topics included in a document set. We have applied the Sparse Non-negative Matrix Factorization to the document set to obtain these topics. The filtering is carried out based on a centroid of the learning document set. The centroid is regarded as the user-s interest. In addition, the centroid is represented with a document vector whose elements consist of the weight of the selected index words. Using the English test collection MEDLINE, thus, we confirm the effectiveness of our proposal. Hence, our proposed selection can confirm the improvement of the recommendation accuracy from the other previous methods when selecting the appropriate number of index words. In addition, we discussed the selected index words by our proposal and we found our proposal was able to select the index words covered some minor topics included in the document set.




References:
[1] G.Salton, M.J.McGill: "Introduction to Modern Information
Retrieval", McGraw-Hill Book Company, 1983.
[2] P.O.Hoyer, "Non-negative Matrix Factorization with Sparseness
Constraints'", Journal of Machine Learning Research, Vol. 5, pp.
1457-1469, 2004.
[3] D.Lee and H.Seung, "Algorithms for non-negative matrix
factorization", NIPS 2000, 2000.
[4] D.Lee and H.Seung, "Learning the parts of objects by
non-negative matrix factorization", Nature, Vol. 401,
pp.788-791
[5] S.Tsuge, M.Shishibori, S.Kuroiwa and K.Kita: "Dimensionality
Reduction Using Non-negative Matrix Factorization for
Information Retrieval", Natural Language Processing and
Knowledge Engineering Mini Symposium, IEEE SYSTEMS,
MAN, AND CYBERNETICS 2001 (NLPKE), pp.960-965,
2001
[6] E. P. Jiang: "Information Retrieval and Filtering Using the
Riemannian SVD", Ph.D. Thesis, Dept. of Computer Science,
The University of Tennessee, Knoxville, TN, 1988.
[7] S.Deerwester, T.Dumais, T.Landauer, W.Furnas and
A.Harshman: "Indexing by Latent Semantic Analysis", Journal
of the Society for Information Science, Vol.41, No.6,
pp.391-497
[8] T.Kolenda and L.K.Hansen: "Independent Components in Text",
Advances in Independent Component Analysis, Springer-Verlag,
2000.
[9] T.Yokoi, H.Yanagimoto and S.Omatu: "The Proposal for the
Way to Recommend Information with ICA", The Ninth Int.
Synp. on Artificial Life and Robotics(AROB 9th '04), Proc. pp.
694-697, 2004
[10] Xu. W., Liu. X., Gong. Y.:"Document Clustering Based On
Non-negative Matrix Factorization", Proceedings of SIGIR-03,
pp.267-273, 2003.
[11] M.W. Berry, M. Browne, A.N. Langville, "Algorithms and
Applications for Approximate Nonnegative Matrix
Factorization", V.P. Pauca, and R.J. Plemmons, Computational
Statistics & Data Analysis 52(1), pp. 155-173, 2007.
[12] P.O.Hoyer, "Nonnegative Sparse Coding", Proc. IEEE
Workshop Neural Networks for Signal Processing, 2002
[13] Xu.W., Liu. X., Gong. Y., "Nonnegative Matrix Factorization for
Visual Coding", Proc. IEEE Int. Conf. Acoustics, Speech, and
Signal Processing(ICASSP2003), 2003
[14] Y.Matsuo and M.Ishizuka, "Keyword Extraction from a Single
Document using Word Co-occurrence Statistical Information",
Int'l Journal on Artificial Intelligence Tools, Vol.13, No.1,
pp.157-169, 2004
[15] Yukio Ohsawa, Nels E. Benson and Masahiko Yachida,
"KeyGraph: Automatic Indexing by Co-occurrence Graph based
on Building Construction Metaphor", Proc. Advanced Digital
Library Conference (IEEE ADL'98), pp.12-18 (1998)
[16] J. Rocchio: "Relevance Feedback in Information Retrieval", The
SMART Retrieval System Experiments in Automatic Document
Processing, pp313-323, 1971.
[17] SMART stop-list
ftp://ftp.cs.cornell.edu/pub/smart/english.stop