Information Filtering using Index Word Selection based on the Topics
We have proposed an information filtering system
using index word selection from a document set based on the
topics included in a set of documents. This method narrows
down the particularly characteristic words in a document set
and the topics are obtained by Sparse Non-negative Matrix
Factorization. In information filtering, a document is often
represented with the vector in which the elements correspond
to the weight of the index words, and the dimension of the
vector becomes larger as the number of documents is
increased. Therefore, it is possible that useless words as index
words for the information filtering are included. In order to
address the problem, the dimension needs to be reduced. Our
proposal reduces the dimension by selecting index words
based on the topics included in a document set. We have
applied the Sparse Non-negative Matrix Factorization to the
document set to obtain these topics. The filtering is carried out
based on a centroid of the learning document set. The centroid
is regarded as the user-s interest. In addition, the centroid is
represented with a document vector whose elements consist of
the weight of the selected index words. Using the English test
collection MEDLINE, thus, we confirm the effectiveness of
our proposal. Hence, our proposed selection can confirm the
improvement of the recommendation accuracy from the other
previous methods when selecting the appropriate number of
index words. In addition, we discussed the selected index
words by our proposal and we found our proposal was able to
select the index words covered some minor topics included in
the document set.
[1] G.Salton, M.J.McGill: "Introduction to Modern Information
Retrieval", McGraw-Hill Book Company, 1983.
[2] P.O.Hoyer, "Non-negative Matrix Factorization with Sparseness
Constraints'", Journal of Machine Learning Research, Vol. 5, pp.
1457-1469, 2004.
[3] D.Lee and H.Seung, "Algorithms for non-negative matrix
factorization", NIPS 2000, 2000.
[4] D.Lee and H.Seung, "Learning the parts of objects by
non-negative matrix factorization", Nature, Vol. 401,
pp.788-791
[5] S.Tsuge, M.Shishibori, S.Kuroiwa and K.Kita: "Dimensionality
Reduction Using Non-negative Matrix Factorization for
Information Retrieval", Natural Language Processing and
Knowledge Engineering Mini Symposium, IEEE SYSTEMS,
MAN, AND CYBERNETICS 2001 (NLPKE), pp.960-965,
2001
[6] E. P. Jiang: "Information Retrieval and Filtering Using the
Riemannian SVD", Ph.D. Thesis, Dept. of Computer Science,
The University of Tennessee, Knoxville, TN, 1988.
[7] S.Deerwester, T.Dumais, T.Landauer, W.Furnas and
A.Harshman: "Indexing by Latent Semantic Analysis", Journal
of the Society for Information Science, Vol.41, No.6,
pp.391-497
[8] T.Kolenda and L.K.Hansen: "Independent Components in Text",
Advances in Independent Component Analysis, Springer-Verlag,
2000.
[9] T.Yokoi, H.Yanagimoto and S.Omatu: "The Proposal for the
Way to Recommend Information with ICA", The Ninth Int.
Synp. on Artificial Life and Robotics(AROB 9th '04), Proc. pp.
694-697, 2004
[10] Xu. W., Liu. X., Gong. Y.:"Document Clustering Based On
Non-negative Matrix Factorization", Proceedings of SIGIR-03,
pp.267-273, 2003.
[11] M.W. Berry, M. Browne, A.N. Langville, "Algorithms and
Applications for Approximate Nonnegative Matrix
Factorization", V.P. Pauca, and R.J. Plemmons, Computational
Statistics & Data Analysis 52(1), pp. 155-173, 2007.
[12] P.O.Hoyer, "Nonnegative Sparse Coding", Proc. IEEE
Workshop Neural Networks for Signal Processing, 2002
[13] Xu.W., Liu. X., Gong. Y., "Nonnegative Matrix Factorization for
Visual Coding", Proc. IEEE Int. Conf. Acoustics, Speech, and
Signal Processing(ICASSP2003), 2003
[14] Y.Matsuo and M.Ishizuka, "Keyword Extraction from a Single
Document using Word Co-occurrence Statistical Information",
Int'l Journal on Artificial Intelligence Tools, Vol.13, No.1,
pp.157-169, 2004
[15] Yukio Ohsawa, Nels E. Benson and Masahiko Yachida,
"KeyGraph: Automatic Indexing by Co-occurrence Graph based
on Building Construction Metaphor", Proc. Advanced Digital
Library Conference (IEEE ADL'98), pp.12-18 (1998)
[16] J. Rocchio: "Relevance Feedback in Information Retrieval", The
SMART Retrieval System Experiments in Automatic Document
Processing, pp313-323, 1971.
[17] SMART stop-list
ftp://ftp.cs.cornell.edu/pub/smart/english.stop
[1] G.Salton, M.J.McGill: "Introduction to Modern Information
Retrieval", McGraw-Hill Book Company, 1983.
[2] P.O.Hoyer, "Non-negative Matrix Factorization with Sparseness
Constraints'", Journal of Machine Learning Research, Vol. 5, pp.
1457-1469, 2004.
[3] D.Lee and H.Seung, "Algorithms for non-negative matrix
factorization", NIPS 2000, 2000.
[4] D.Lee and H.Seung, "Learning the parts of objects by
non-negative matrix factorization", Nature, Vol. 401,
pp.788-791
[5] S.Tsuge, M.Shishibori, S.Kuroiwa and K.Kita: "Dimensionality
Reduction Using Non-negative Matrix Factorization for
Information Retrieval", Natural Language Processing and
Knowledge Engineering Mini Symposium, IEEE SYSTEMS,
MAN, AND CYBERNETICS 2001 (NLPKE), pp.960-965,
2001
[6] E. P. Jiang: "Information Retrieval and Filtering Using the
Riemannian SVD", Ph.D. Thesis, Dept. of Computer Science,
The University of Tennessee, Knoxville, TN, 1988.
[7] S.Deerwester, T.Dumais, T.Landauer, W.Furnas and
A.Harshman: "Indexing by Latent Semantic Analysis", Journal
of the Society for Information Science, Vol.41, No.6,
pp.391-497
[8] T.Kolenda and L.K.Hansen: "Independent Components in Text",
Advances in Independent Component Analysis, Springer-Verlag,
2000.
[9] T.Yokoi, H.Yanagimoto and S.Omatu: "The Proposal for the
Way to Recommend Information with ICA", The Ninth Int.
Synp. on Artificial Life and Robotics(AROB 9th '04), Proc. pp.
694-697, 2004
[10] Xu. W., Liu. X., Gong. Y.:"Document Clustering Based On
Non-negative Matrix Factorization", Proceedings of SIGIR-03,
pp.267-273, 2003.
[11] M.W. Berry, M. Browne, A.N. Langville, "Algorithms and
Applications for Approximate Nonnegative Matrix
Factorization", V.P. Pauca, and R.J. Plemmons, Computational
Statistics & Data Analysis 52(1), pp. 155-173, 2007.
[12] P.O.Hoyer, "Nonnegative Sparse Coding", Proc. IEEE
Workshop Neural Networks for Signal Processing, 2002
[13] Xu.W., Liu. X., Gong. Y., "Nonnegative Matrix Factorization for
Visual Coding", Proc. IEEE Int. Conf. Acoustics, Speech, and
Signal Processing(ICASSP2003), 2003
[14] Y.Matsuo and M.Ishizuka, "Keyword Extraction from a Single
Document using Word Co-occurrence Statistical Information",
Int'l Journal on Artificial Intelligence Tools, Vol.13, No.1,
pp.157-169, 2004
[15] Yukio Ohsawa, Nels E. Benson and Masahiko Yachida,
"KeyGraph: Automatic Indexing by Co-occurrence Graph based
on Building Construction Metaphor", Proc. Advanced Digital
Library Conference (IEEE ADL'98), pp.12-18 (1998)
[16] J. Rocchio: "Relevance Feedback in Information Retrieval", The
SMART Retrieval System Experiments in Automatic Document
Processing, pp313-323, 1971.
[17] SMART stop-list
ftp://ftp.cs.cornell.edu/pub/smart/english.stop
@article{"International Journal of Information, Control and Computer Sciences:63745", author = "Takeru YOKOI and Hidekazu YANAGIMOTO and Sigeru OMATU", title = "Information Filtering using Index Word Selection based on the Topics", abstract = "We have proposed an information filtering system
using index word selection from a document set based on the
topics included in a set of documents. This method narrows
down the particularly characteristic words in a document set
and the topics are obtained by Sparse Non-negative Matrix
Factorization. In information filtering, a document is often
represented with the vector in which the elements correspond
to the weight of the index words, and the dimension of the
vector becomes larger as the number of documents is
increased. Therefore, it is possible that useless words as index
words for the information filtering are included. In order to
address the problem, the dimension needs to be reduced. Our
proposal reduces the dimension by selecting index words
based on the topics included in a document set. We have
applied the Sparse Non-negative Matrix Factorization to the
document set to obtain these topics. The filtering is carried out
based on a centroid of the learning document set. The centroid
is regarded as the user-s interest. In addition, the centroid is
represented with a document vector whose elements consist of
the weight of the selected index words. Using the English test
collection MEDLINE, thus, we confirm the effectiveness of
our proposal. Hence, our proposed selection can confirm the
improvement of the recommendation accuracy from the other
previous methods when selecting the appropriate number of
index words. In addition, we discussed the selected index
words by our proposal and we found our proposal was able to
select the index words covered some minor topics included in
the document set.", keywords = "Information Filtering, Sparse NMF, Index wordSelection, User Profile, Chi-squared Measure", volume = "3", number = "2", pages = "480-7", }