Abstract: In this paper a one-dimension Self Organizing Map
algorithm (SOM) to perform feature selection is presented. The
algorithm is based on a first classification of the input dataset on a
similarity space. From this classification for each class a set of
positive and negative features is computed. This set of features is
selected as result of the procedure. The procedure is evaluated on an
in-house dataset from a Knowledge Discovery from Text (KDT)
application and on a set of publicly available datasets used in
international feature selection competitions. These datasets come
from KDT applications, drug discovery as well as other applications.
The knowledge of the correct classification available for the training
and validation datasets is used to optimize the parameters for positive
and negative feature extractions. The process becomes feasible for
large and sparse datasets, as the ones obtained in KDT applications,
by using both compression techniques to store the similarity matrix
and speed up techniques of the Kohonen algorithm that take
advantage of the sparsity of the input matrix. These improvements
make it feasible, by using the grid, the application of the
methodology to massive datasets.
Abstract: In this paper, a model for an information retrieval
system is proposed which takes into account that knowledge about
documents and information need of users are dynamic. Two
methods are combined, one qualitative or symbolic and the other
quantitative or numeric, which are deemed suitable for many
clustering contexts, data analysis, concept exploring and
knowledge discovery. These two methods may be classified as
inductive learning techniques. In this model, they are introduced to
build “long term" knowledge about past queries and concepts in a
collection of documents. The “long term" knowledge can guide
and assist the user to formulate an initial query and can be
exploited in the process of retrieving relevant information. The
different kinds of knowledge are organized in different points of
view. This may be considered an enrichment of the exploration
level which is coherent with the concept of document/query
structure.