Abstract: Search engine plays an important role in internet, to
retrieve the relevant documents among the huge number of web
pages. However, it retrieves more number of documents, which are
all relevant to your search topics. To retrieve the most meaningful
documents related to search topics, ranking algorithm is used in
information retrieval technique. One of the issues in data miming is
ranking the retrieved document. In information retrieval the ranking
is one of the practical problems. This paper includes various Page
Ranking algorithms, page segmentation algorithms and compares
those algorithms used for Information Retrieval. Diverse Page Rank
based algorithms like Page Rank (PR), Weighted Page Rank (WPR),
Weight Page Content Rank (WPCR), Hyperlink Induced Topic
Selection (HITS), Distance Rank, Eigen Rumor, Distance Rank Time
Rank, Tag Rank, Relational Based Page Rank and Query Dependent
Ranking algorithms are discussed and compared.
Abstract: This paper aims to analyze the role of natural
language processing (NLP). The paper will discuss the role in the
context of automated data retrieval, automated question answer, and
text structuring. NLP techniques are gaining wider acceptance in real
life applications and industrial concerns. There are various
complexities involved in processing the text of natural language that
could satisfy the need of decision makers. This paper begins with the
description of the qualities of NLP practices. The paper then focuses
on the challenges in natural language processing. The paper also
discusses major techniques of NLP. The last section describes
opportunities and challenges for future research.
Abstract: Nowadays, the Web has become one of the most
pervasive platforms for information change and retrieval. It collects
the suitable and perfectly fitting information from websites that one
requires. Data mining is the form of extracting data’s available in the
internet. Web mining is one of the elements of data mining
Technique, which relates to various research communities such as
information recovery, folder managing system and simulated
intellects. In this Paper we have discussed the concepts of Web
mining. We contain generally focused on one of the categories of
Web mining, specifically the Web Content Mining and its various
farm duties. The mining tools are imperative to scanning the many
images, text, and HTML documents and then, the result is used by
the various search engines. We conclude by presenting a comparative
table of these tools based on some pertinent criteria.
Abstract: Search is the most obvious application of information
retrieval. The variety of widely obtainable biomedical data is
enormous and is expanding fast. This expansion makes the existing
techniques are not enough to extract the most interesting patterns
from the collection as per the user requirement. Recent researches are
concentrating more on semantic based searching than the traditional
term based searches. Algorithms for semantic searches are
implemented based on the relations exist between the words of the
documents. Ontologies are used as domain knowledge for identifying
the semantic relations as well as to structure the data for effective
information retrieval. Annotation of data with concepts of ontology is
one of the wide-ranging practices for clustering the documents. In
this paper, indexing based on concept and annotation are proposed
for clustering the biomedical documents. Fuzzy c-means (FCM)
clustering algorithm is used to cluster the documents. The
performances of the proposed methods are analyzed with traditional
term based clustering for PubMed articles in five different diseases
communities. The experimental results show that the proposed
methods outperform the term based fuzzy clustering.
Abstract: The analysis of scientific collaboration networks has contributed significantly to improving the understanding of how does the process of collaboration between researchers and also to understand how the evolution of scientific production of researchers or research groups occurs. However, the identification of collaborations in large scientific databases is not a trivial task given the high computational cost of the methods commonly used. This paper proposes a method for identifying collaboration in large data base of curriculum researchers. The proposed method has low computational cost with satisfactory results, proving to be an interesting alternative for the modeling and characterization of large scientific collaboration networks.
Abstract: Genetic Algorithm (GA) is a powerful technique for solving optimization problems. It follows the idea of survival of the fittest - Better and better solutions evolve from previous generations until a near optimal solution is obtained. GA uses the main three operations, the selection, crossover and mutation to produce new generations from the old ones. GA has been widely used to solve optimization problems in many applications such as traveling salesman problem, airport traffic control, information retrieval (IR), reactive power optimization, job shop scheduling, and hydraulics systems such as water pipeline systems. In water pipeline systems we need to achieve some goals optimally such as minimum cost of construction, minimum length of pipes and diameters, and the place of protection devices. GA shows high performance over the other optimization techniques, moreover, it is easy to implement and use. Also, it searches a limited number of solutions.
Abstract: Information retrieval has become an important field of study and research under computer science due to explosive growth of information available in the form of full text, hypertext, administrative text, directory, numeric or bibliographic text. The research work is going on various aspects of information retrieval systems so as to improve its efficiency and reliability. This paper presents a comprehensive study, which discusses not only emergence and evolution of information retrieval but also includes different information retrieval models and some important aspects such as document representation, similarity measure and query expansion.
Abstract: Reformulating the user query is a technique that aims to improve the performance of an Information Retrieval System (IRS) in terms of precision and recall. This paper tries to evaluate the technique of query reformulation guided by an external resource for Arabic texts. To do this, various precision and recall measures were conducted and two corpora with different external resources like Arabic WordNet (AWN) and the Arabic Dictionary (thesaurus) of Meaning (ADM) were used. Examination of the obtained results will allow us to measure the real contribution of this reformulation technique in improving the IRS performance.
Abstract: The research on ARCS for critical information retrieval development aimed to (1) investigate conditions of critical information retrieval skill of the Mathematics pre-service teachers before applying ARCS model in learning activities, (2) study and analyze the development of critical information retrieval skill of the Mathematics pre-service teachers after utilizing ARCS model in learning activities, and (3) evaluate the Mathematics pre-service teachers’ satisfaction on using ARCS model in learning activities as a tool to development critical information retrieval skill. Forty-one of 4th year Mathematics pre-service teachers who have enrolled in the subject of Research for Learning Development of semester 2 in 2012 were purposively selected as the research cohort. The research tools were self-report and interview questionnaire that was approved as content validity and reliability (IOC=.66-1.00, α =.834). The research found that critical information retrieval skill of the research samples before using ARCS model in learning activities was in the normal high level. According to the in-depth interview and focus group, the result however showed that the pre-service teachers still lack inadequate and effective knowledge in information retrieval. Additionally, critical information retrieval skill of the research cohort after applying ARCS model in learning activities appeared to be high level. The result revealed that the pre-service teachers are able to explain the method of searching, extraction, and selecting information as well as evaluating quality of information, and effectively making decision in accepting information. Moreover, the research discovered that the pre-service teachers showed normal high to highest level of satisfaction on using ARCS model in learning activities as a tool to development their critical information retrieval skill.
Abstract: Rapid progress in audio compression technology has contributed to the explosive growth of music available in digital form today. In a reversal of ideas, this work makes use of a recently proposed efficient audio compression scheme to develop three important applications in the context of Music Information Retrieval (MIR) for the effective manipulation of large music databases, namely automatic music recommendation (AMR), digital rights management (DRM) and audio finger-printing for song identification. The performance of these three applications has been evaluated with respect to a database of songs collected from a diverse set of genres.
Abstract: In this paper, we present symbolic recognition models to extract knowledge characterized by document structures. Focussing on the extraction and the meticulous exploitation of the semantic structure of documents, we obtain a meaningful contextual tagging corresponding to different unit types (title, chapter, section, enumeration, etc.).
Abstract: Ever increasing capacities of contemporary storage devices
inspire the vision to accumulate (personal) information without
the need of deleting old data over a long time-span. Hence the target
of SemanticLIFE project is to create a Personal Information Management
system for a human lifetime data. One of the most important
characteristics of the system is its dedication to retrieve information
in a very efficient way. By adopting user demands regarding the
reduction of ambiguities, our approach aims at a user-oriented and
yet powerful enough system with a satisfactory query performance.
We introduce the query system of SemanticLIFE, the Virtual Query
System, which uses emerging Semantic Web technologies to fulfill
users- requirements.
Abstract: The information on the Web increases tremendously.
A number of search engines have been developed for searching Web
information and retrieving relevant documents that satisfy the
inquirers needs. Search engines provide inquirers irrelevant
documents among search results, since the search is text-based rather
than semantic-based. Information retrieval research area has
presented a number of approaches and methodologies such as
profiling, feedback, query modification, human-computer interaction,
etc for improving search results. Moreover, information retrieval has
employed artificial intelligence techniques and strategies such as
machine learning heuristics, tuning mechanisms, user and system
vocabularies, logical theory, etc for capturing user's preferences and
using them for guiding the search based on the semantic analysis
rather than syntactic analysis. Although a valuable improvement has
been recorded on search results, the survey has shown that still
search engines users are not really satisfied with their search results.
Using ontologies for semantic-based searching is likely the key
solution. Adopting profiling approach and using ontology base
characteristics, this work proposes a strategy for finding the exact
meaning of the query terms in order to retrieve relevant information
according to user needs. The evaluation of conducted experiments
has shown the effectiveness of the suggested methodology and
conclusion is presented.
Abstract: In this paper, an intelligent algorithm for optimal
document archiving is presented. It is kown that electronic archives
are very important for information system management. Minimizing
the size of the stored data in electronic archive is a main issue to
reduce the physical storage area. Here, the effect of different types of
Arabic fonts on electronic archives size is discussed. Simulation
results show that PDF is the best file format for storage of the Arabic
documents in electronic archive. Furthermore, fast information
detection in a given PDF file is introduced. Such approach uses fast
neural networks (FNNs) implemented in the frequency domain. The
operation of these networks relies on performing cross correlation in
the frequency domain rather than spatial one. It is proved
mathematically and practically that the number of computation steps
required for the presented FNNs is less than that needed by
conventional neural networks (CNNs). Simulation results using
MATLAB confirm the theoretical computations.
Abstract: XML is a markup language which is becoming the
standard format for information representation and data exchange. A
major purpose of XML is the explicit representation of the logical
structure of a document. Much research has been performed to
exploit logical structure of documents in information retrieval in
order to precisely extract user information need from large
collections of XML documents. In this paper, we describe an XML
information retrieval weighting scheme that tries to find the most
relevant elements in XML documents in response to a user query.
We present this weighting model for information retrieval systems
that utilize plausible inferences to infer the relevance of elements in
XML documents. We also add to this model the Dempster-Shafer
theory of evidence to express the uncertainty in plausible inferences
and Dempster-Shafer rule of combination to combine evidences
derived from different inferences.
Abstract: Music Information Retrieval (MIR) and modern data mining techniques are applied to identify style markers in midi music for stylometric analysis and author attribution. Over 100 attributes are extracted from a library of 2830 songs then mined using supervised learning data mining techniques. Two attributes are identified that provide high informational gain. These attributes are then used as style markers to predict authorship. Using these style markers the authors are able to correctly distinguish songs written by the Beatles from those that were not with a precision and accuracy of over 98 per cent. The identification of these style markers as well as the architecture for this research provides a foundation for future research in musical stylometry.
Abstract: Question answering (QA) aims at retrieving precise information from a large collection of documents. Most of the Question Answering systems composed of three main modules: question processing, document processing and answer processing. Question processing module plays an important role in QA systems to reformulate questions. Moreover answer processing module is an emerging topic in QA systems, where these systems are often required to rank and validate candidate answers. These techniques aiming at finding short and precise answers are often based on the semantic relations and co-occurrence keywords. This paper discussed about a new model for question answering which improved two main modules, question processing and answer processing which both affect on the evaluation of the system operations. There are two important components which are the bases of the question processing. First component is question classification that specifies types of question and answer. Second one is reformulation which converts the user's question into an understandable question by QA system in a specific domain. The objective of an Answer Validation task is thus to judge the correctness of an answer returned by a QA system, according to the text snippet given to support it. For validating answers we apply candidate answer filtering, candidate answer ranking and also it has a final validation section by user voting. Also this paper described new architecture of question and answer processing modules with modeling, implementing and evaluating the system. The system differs from most question answering systems in its answer validation model. This module makes it more suitable to find exact answer. Results show that, from total 50 asked questions, evaluation of the model, show 92% improving the decision of the system.
Abstract: The purposes of this study are 1) to identify
learning styles of university students in Bangkok, and 2) to study
the frequency of the relevant instructional context of the identified
learning styles. Learning Styles employed in this study are those of
Honey and Mumford, which include 1) Reflectors, 2) Theorists, 3)
Pragmatists, and 4) Activists. The population comprises 1383
students and 5 lecturers. Research tools are 2 questionnaires – one
used for identifying students- learning styles, and the other used for
identifying the frequency of the relevant instructional context of
the identified learning styles.
The research findings reveal that 32.30 percent - are Activists,
while 28.10 percent are Theorists, 20.10 are Reflectors, and 19.50
are Pragmatists. In terms of the relevant instructional context of the
identified 4 learning styles, it is found that the frequency level of
the instructional context is totally in high level. Moreover, 2 lists of
the context being conducted most frequently are 'Lead'in activity
to review background knowledge,- and 'Information retrieval
report.' And these two activities serve the learning styles of
theorists and activists. It is, therefore, suggested that more
instructional context supporting the activists, the majority of the
population, learning best by doing, as well as emotional learning
situation should be added.
Abstract: In the world of Peer-to-Peer (P2P) networking
different protocols have been developed to make the resource sharing
or information retrieval more efficient. The SemPeer protocol is a
new layer on Gnutella that transforms the connections of the nodes
based on semantic information to make information retrieval more
efficient. However, this transformation causes high clustering in the
network that decreases the number of nodes reached, therefore the
probability of finding a document is also decreased. In this paper we
describe a mathematical model for the Gnutella and SemPeer
protocols that captures clustering-related issues, followed by a
proposition to modify the SemPeer protocol to achieve moderate
clustering. This modification is a sort of link management for the
individual nodes that allows the SemPeer protocol to be more
efficient, because the probability of a successful query in the P2P
network is reasonably increased. For the validation of the models, we
evaluated a series of simulations that supported our results.
Abstract: Field Association (FA) terms are a limited set of discriminating terms that give us the knowledge to identify document fields which are effective in document classification, similar file retrieval and passage retrieval. But the problem lies in the lack of an effective method to extract automatically relevant Arabic FA Terms to build a comprehensive dictionary. Moreover, all previous studies are based on FA terms in English and Japanese, and the extension of FA terms to other language such Arabic could be definitely strengthen further researches. This paper presents a new method to extract, Arabic FA Terms from domain-specific corpora using part-of-speech (POS) pattern rules and corpora comparison. Experimental evaluation is carried out for 14 different fields using 251 MB of domain-specific corpora obtained from Arabic Wikipedia dumps and Alhyah news selected average of 2,825 FA Terms (single and compound) per field. From the experimental results, recall and precision are 84% and 79% respectively. Therefore, this method selects higher number of relevant Arabic FA Terms at high precision and recall.