Abstract: The growth in the volume of text data such as books
and articles in libraries for centuries has imposed to establish
effective mechanisms to locate them. Early techniques such as
abstraction, indexing and the use of classification categories have
marked the birth of a new field of research called "Information
Retrieval". Information Retrieval (IR) can be defined as the task of
defining models and systems whose purpose is to facilitate access to
a set of documents in electronic form (corpus) to allow a user to find
the relevant ones for him, that is to say, the contents which matches
with the information needs of the user. This paper presents a new
semantic indexing approach of a documentary corpus. The indexing
process starts first by a term weighting phase to determine the
importance of these terms in the documents. Then the use of a
thesaurus like Wordnet allows moving to the conceptual level.
Each candidate concept is evaluated by determining its level of
representation of the document, that is to say, the importance of the
concept in relation to other concepts of the document. Finally, the
semantic index is constructed by attaching to each concept of the
ontology, the documents of the corpus in which these concepts are
found.
Abstract: The growth in the volume of text data such as books
and articles in libraries for centuries has imposed to establish
effective mechanisms to locate them. Early techniques such as
abstraction, indexing and the use of classification categories have
marked the birth of a new field of research called "Information
Retrieval". Information Retrieval (IR) can be defined as the task of
defining models and systems whose purpose is to facilitate access to
a set of documents in electronic form (corpus) to allow a user to find
the relevant ones for him, that is to say, the contents which matches
with the information needs of the user.
Most of the models of information retrieval use a specific data
structure to index a corpus which is called "inverted file" or "reverse
index".
This inverted file collects information on all terms over the corpus
documents specifying the identifiers of documents that contain the
term in question, the frequency of each term in the documents of the
corpus, the positions of the occurrences of the word...
In this paper we use an oriented object database (db4o) instead of
the inverted file, that is to say, instead to search a term in the inverted
file, we will search it in the db4o database.
The purpose of this work is to make a comparative study to see if
the oriented object databases may be competing for the inverse index
in terms of access speed and resource consumption using a large
volume of data.