Abstract: Since dealing with high dimensional data is
computationally complex and sometimes even intractable, recently
several feature reductions methods have been developed to reduce
the dimensionality of the data in order to simplify the calculation
analysis in various applications such as text categorization, signal
processing, image retrieval, gene expressions and etc. Among feature
reduction techniques, feature selection is one the most popular
methods due to the preservation of the original features.
In this paper, we propose a new unsupervised feature selection
method which will remove redundant features from the original
feature space by the use of probability density functions of various
features. To show the effectiveness of the proposed method, popular
feature selection methods have been implemented and compared.
Experimental results on the several datasets derived from UCI
repository database, illustrate the effectiveness of our proposed
methods in comparison with the other compared methods in terms of
both classification accuracy and the number of selected features.
Abstract: The evaluation of conversational agents or chatterbots question answering systems is a major research area that needs much attention. Before the rise of domain-oriented conversational agents based on natural language understanding and reasoning, evaluation is never a problem as information retrieval-based metrics are readily available for use. However, when chatterbots began to become more domain specific, evaluation becomes a real issue. This is especially true when understanding and reasoning is required to cater for a wider variety of questions and at the same time to achieve high quality responses. This paper discusses the inappropriateness of the existing measures for response quality evaluation and the call for new standard measures and related considerations are brought forward. As a short-term solution for evaluating response quality of conversational agents, and to demonstrate the challenges in evaluating systems of different nature, this research proposes a blackbox approach using observation, classification scheme and a scoring mechanism to assess and rank three example systems, AnswerBus, START and AINI.
Abstract: The overall objective of this paper is to retrieve soil
surfaces parameters namely, roughness and soil moisture related to
the dielectric constant by inverting the radar backscattered signal
from natural soil surfaces.
Because the classical description of roughness using statistical
parameters like the correlation length doesn't lead to satisfactory
results to predict radar backscattering, we used a multi-scale
roughness description using the wavelet transform and the Mallat
algorithm. In this description, the surface is considered as a
superposition of a finite number of one-dimensional Gaussian
processes each having a spatial scale. A second step in this study
consisted in adapting a direct model simulating radar backscattering
namely the small perturbation model to this multi-scale surface
description. We investigated the impact of this description on radar
backscattering through a sensitivity analysis of backscattering
coefficient to the multi-scale roughness parameters.
To perform the inversion of the small perturbation multi-scale
scattering model (MLS SPM) we used a multi-layer neural network
architecture trained by backpropagation learning rule. The inversion
leads to satisfactory results with a relative uncertainty of 8%.
Abstract: The paper proposes a unified model for multimedia data retrieval which includes data representatives, content representatives, index structure, and search algorithms. The multimedia data are defined as k-dimensional signals indexed in a multidimensional k-tree structure. The benefits of using the k-tree unified model were demonstrated by running the data retrieval application on a six networked nodes test bed cluster. The tests were performed with two retrieval algorithms, one that allows parallel searching using a single feature, the second that performs a weighted cascade search for multiple features querying. The experiments show a significant reduction of retrieval time while maintaining the quality of results.
Abstract: One of the major challenges in the Information
Retrieval field is handling the massive amount of information
available to Internet users. Existing ranking techniques and strategies
that govern the retrieval process fall short of expected accuracy.
Often relevant documents are buried deep in the list of documents
returned by the search engine. In order to improve retrieval accuracy
we examine the issue of language effect on the retrieval process.
Then, we propose a solution for a more biased, user-centric relevance
for retrieved data. The results demonstrate that using indices based
on variations of the same language enhances the accuracy of search
engines for individual users.
Abstract: Automatic reusability appraisal could be helpful in
evaluating the quality of developed or developing reusable software
components and in identification of reusable components from
existing legacy systems; that can save cost of developing the software
from scratch. But the issue of how to identify reusable components
from existing systems has remained relatively unexplored. In this
paper, we have mentioned two-tier approach by studying the
structural attributes as well as usability or relevancy of the
component to a particular domain. Latent semantic analysis is used
for the feature vector representation of various software domains. It
exploits the fact that FeatureVector codes can be seen as documents
containing terms -the idenifiers present in the components- and so
text modeling methods that capture co-occurrence information in
low-dimensional spaces can be used. Further, we devised Neuro-
Fuzzy hybrid Inference System, which takes structural metric values
as input and calculates the reusability of the software component.
Decision tree algorithm is used to decide initial set of fuzzy rules for
the Neuro-fuzzy system. The results obtained are convincing enough
to propose the system for economical identification and retrieval of
reusable software components.
Abstract: Number of documents being created increases at an
increasing pace while most of them being in already known topics
and little of them introducing new concepts. This fact has started a
new era in information retrieval discipline where the requirements
have their own specialties. That is digging into topics and concepts
and finding out subtopics or relations between topics. Up to now IR
researches were interested in retrieving documents about a general
topic or clustering documents under generic subjects. However these
conventional approaches can-t go deep into content of documents
which makes it difficult for people to reach to right documents they
were searching. So we need new ways of mining document sets
where the critic point is to know much about the contents of the
documents. As a solution we are proposing to enhance LSI, one of
the proven IR techniques by supporting its vector space with n-gram
forms of words. Positive results we have obtained are shown in two
different application area of IR domain; querying a document
database, clustering documents in the document database.
Abstract: The development of distributed systems has been affected by the need to accommodate an increasing degree of flexibility, adaptability, and autonomy. The Mobile Agent technology is emerging as an alternative to build a smart generation of highly distributed systems. In this work, we investigate the performance aspect of agent-based technologies for information retrieval. We present a comparative performance evaluation model of Mobile Agents versus Remote Method Invocation by means of an analytical approach. We demonstrate the effectiveness of mobile agents for dynamic code deployment and remote data processing by reducing total latency and at the same time producing minimum network traffic. We argue that exploiting agent-based technologies significantly enhances the performance of distributed systems in the domain of information retrieval.
Abstract: We present here the results for a comparative study of
some techniques, available in the literature, related to the relevance
feedback mechanism in the case of a short-term learning. Only one
method among those considered here is belonging to the data mining
field which is the K-nearest neighbors algorithm (KNN) while the
rest of the methods is related purely to the information retrieval field
and they fall under the purview of the following three major axes:
Shifting query, Feature Weighting and the optimization of the
parameters of similarity metric. As a contribution, and in addition to
the comparative purpose, we propose a new version of the KNN
algorithm referred to as an incremental KNN which is distinct from
the original version in the sense that besides the influence of the
seeds, the rate of the actual target image is influenced also by the
images already rated. The results presented here have been obtained
after experiments conducted on the Wang database for one iteration
and utilizing color moments on the RGB space. This compact
descriptor, Color Moments, is adequate for the efficiency purposes
needed in the case of interactive systems. The results obtained allow
us to claim that the proposed algorithm proves good results; it even
outperforms a wide range of techniques available in the literature.
Abstract: Software reuse can be considered as the most realistic
and promising way to improve software engineering productivity and
quality. Automated assistance for software reuse involves the
representation, classification, retrieval and adaptation of components.
The representation and retrieval of components are important to
software reuse in Component-Based on Software Development
(CBSD). However, current industrial component models mainly focus
on the implement techniques and ignore the semantic information
about component, so it is difficult to retrieve the components that
satisfy user-s requirements. This paper presents a method of business
component retrieval based on specification matching to solve the
software reuse of enterprise information system. First, a business
component model oriented reuse is proposed. In our model, the
business data type is represented as sign data type based on XML,
which can express the variable business data type that can describe the
variety of business operations. Based on this model, we propose
specification match relationships in two levels: business operation
level and business component level. In business operation level, we
use input business data types, output business data types and the
taxonomy of business operations evaluate the similarity between
business operations. In the business component level, we propose five
specification matches between business components. To retrieval
reusable business components, we propose the measure of similarity
degrees to calculate the similarities between business components.
Finally, a business component retrieval command like SQL is
proposed to help user to retrieve approximate business components
from component repository.
Abstract: Methods for organizing web data into groups in order
to analyze web-based hypertext data and facilitate data availability
are very important in terms of the number of documents available
online. Thereby, the task of clustering web-based document structures
has many applications, e.g., improving information retrieval on the
web, better understanding of user navigation behavior, improving web
users requests servicing, and increasing web information accessibility.
In this paper we investigate a new approach for clustering web-based
hypertexts on the basis of their graph structures. The hypertexts will
be represented as so called generalized trees which are more general
than usual directed rooted trees, e.g., DOM-Trees. As a important
preprocessing step we measure the structural similarity between the
generalized trees on the basis of a similarity measure d. Then,
we apply agglomerative clustering to the obtained similarity matrix
in order to create clusters of hypertext graph patterns representing
navigation structures. In the present paper we will run our approach
on a data set of hypertext structures and obtain good results in
Web Structure Mining. Furthermore we outline the application of
our approach in Web Usage Mining as future work.
Abstract: Increasing growth of information volume in the
internet causes an increasing need to develop new (semi)automatic
methods for retrieval of documents and ranking them according to
their relevance to the user query. In this paper, after a brief review
on ranking models, a new ontology based approach for ranking
HTML documents is proposed and evaluated in various
circumstances. Our approach is a combination of conceptual,
statistical and linguistic methods. This combination reserves the
precision of ranking without loosing the speed. Our approach
exploits natural language processing techniques for extracting
phrases and stemming words. Then an ontology based conceptual
method will be used to annotate documents and expand the query.
To expand a query the spread activation algorithm is improved so
that the expansion can be done in various aspects. The annotated
documents and the expanded query will be processed to compute
the relevance degree exploiting statistical methods. The outstanding
features of our approach are (1) combining conceptual, statistical
and linguistic features of documents, (2) expanding the query with
its related concepts before comparing to documents, (3) extracting
and using both words and phrases to compute relevance degree, (4)
improving the spread activation algorithm to do the expansion based
on weighted combination of different conceptual relationships and
(5) allowing variable document vector dimensions. A ranking
system called ORank is developed to implement and test the
proposed model. The test results will be included at the end of the
paper.
Abstract: In the field of concepts, the measure of Wu and Palmer [1] has the advantage of being simple to implement and have good performances compared to the other similarity measures [2]. Nevertheless, the Wu and Palmer measure present the following disadvantage: in some situations, the similarity of two elements of an IS-A ontology contained in the neighborhood exceeds the similarity value of two elements contained in the same hierarchy. This situation is inadequate within the information retrieval framework. To overcome this problem, we propose a new similarity measure based on the Wu and Palmer measure. Our objective is to obtain realistic results for concepts not located in the same way. The obtained results show that compared to the Wu and Palmer approach, our measure presents a profit in terms of relevance and execution time.
Abstract: This paper presents a dominant color descriptor
technique for medical image retrieval. The medical image system
will collect and store into medical database. The purpose of
dominant color descriptor (DCD) technique is to retrieve medical
image and to display similar image using queried image. First, this
technique will search and retrieve medical image based on keyword
entered by user. After image is found, the system will assign this
image as a queried image. DCD technique will calculate the image
value of dominant color. Then, system will search and retrieve again
medical image based on value of dominant color query image.
Finally, the system will display similar images with the queried
image to user. Simple application has been developed and tested
using dominant color descriptor. Result based on experiment
indicates this technique is effective and can be used for medical
image retrieval.
Abstract: In this contribution an innovative platform is being
presented that integrates intelligent agents and evolutionary
computation techniques in legacy e-learning environments. It
introduces the design and development of a scalable and
interoperable integration platform supporting:
I) various assessment agents for e-learning environments,
II) a specific resource retrieval agent for the provision of
additional information from Internet sources matching the
needs and profile of the specific user and
III) a genetic algorithm designed to extract efficient information
(classifying rules) based on the students- answering input
data.
The agents are implemented in order to provide intelligent
assessment services based on computational intelligence techniques
such as Bayesian Networks and Genetic Algorithms.
The proposed Genetic Algorithm (GA) is used in order to extract
efficient information (classifying rules) based on the students-
answering input data. The idea of using a GA in order to fulfil this
difficult task came from the fact that GAs have been widely used in
applications including classification of unknown data.
The utilization of new and emerging technologies like web
services allows integrating the provided services to any web based
legacy e-learning environment.
Abstract: Documents retrieval in Information Retrieval
Systems (IRS) is generally about understanding of
information in the documents concern. The more the system
able to understand the contents of documents the more
effective will be the retrieval outcomes. But understanding of the
contents is a very complex task. Conventional IRS apply algorithms
that can only approximate the meaning of document contents through
keywords approach using vector space model. Keywords may be
unstemmed or stemmed. When keywords are stemmed and conflated
in retrieving process, we are a step forwards in applying semantic
technology in IRS. Word stemming is a process in morphological
analysis under natural language processing, before syntactic and
semantic analysis. We have developed algorithms for Malay and
Arabic and incorporated stemming in our experimental systems in
order to measure retrieval effectiveness. The results have shown that
the retrieval effectiveness has increased when stemming is used in
the systems.
Abstract: The work presented in this paper focus on Knowledge Management services enabling CSCW (Computer Supported Cooperative Work) applications to provide an appropriate adaptation to the user and the situation in which the user is working. In this paper, we explain how a knowledge management system can be designed to support users in different situations exploiting contextual data, users' preferences, and profiles of involved artifacts (e.g., documents, multimedia files, mockups...). The presented work roots in the experience we had in the MILK project and early steps made in the MAIS project.
Abstract: Today, Genetic Algorithm has been used to solve
wide range of optimization problems. Some researches conduct on
applying Genetic Algorithm to text classification, summarization
and information retrieval system in text mining process. This
researches show a better performance due to the nature of Genetic
Algorithm. In this paper a new algorithm for using Genetic
Algorithm in concept weighting and topic identification, based on
concept standard deviation will be explored.
Abstract: All Text processing systems allow their users to
search a pattern of string from a given text. String matching is
fundamental to database and text processing applications. Every text
editor must contain a mechanism to search the current document for
arbitrary strings. Spelling checkers scan an input text for words in the
dictionary and reject any strings that do not match. We store our
information in data bases so that later on we can retrieve the same
and this retrieval can be done by using various string matching
algorithms. This paper is describing a new string matching algorithm
for various applications. A new algorithm has been designed with the
help of Rabin Karp Matcher, to improve string matching process.
Abstract: As the web continues to grow exponentially, the idea
of crawling the entire web on a regular basis becomes less and less
feasible, so the need to include information on specific domain,
domain-specific search engines was proposed. As more information
becomes available on the World Wide Web, it becomes more difficult
to provide effective search tools for information access. Today,
people access web information through two main kinds of search
interfaces: Browsers (clicking and following hyperlinks) and Query
Engines (queries in the form of a set of keywords showing the topic
of interest) [2]. Better support is needed for expressing one's
information need and returning high quality search results by web
search tools. There appears to be a need for systems that do reasoning
under uncertainty and are flexible enough to recover from the
contradictions, inconsistencies, and irregularities that such reasoning
involves. In a multi-view problem, the features of the domain can be
partitioned into disjoint subsets (views) that are sufficient to learn the
target concept. Semi-supervised, multi-view algorithms, which
reduce the amount of labeled data required for learning, rely on the
assumptions that the views are compatible and uncorrelated. This
paper describes the use of semi-structured machine learning approach
with Active learning for the “Domain Specific Search Engines". A
domain-specific search engine is “An information access system that
allows access to all the information on the web that is relevant to a
particular domain. The proposed work shows that with the help of
this approach relevant data can be extracted with the minimum
queries fired by the user. It requires small number of labeled data and
pool of unlabelled data on which the learning algorithm is applied to
extract the required data.