Abstract: In this paper, we propose an efficient hierarchical DNA
sequence search method to improve the search speed while the
accuracy is being kept constant. For a given query DNA sequence,
firstly, a fast local search method using histogram features is used as a
filtering mechanism before scanning the sequences in the database.
An overlapping processing is newly added to improve the robustness
of the algorithm. A large number of DNA sequences with low
similarity will be excluded for latter searching. The Smith-Waterman
algorithm is then applied to each remainder sequences. Experimental
results using GenBank sequence data show the proposed method
combining histogram information and Smith-Waterman algorithm is
more efficient for DNA sequence search.
Abstract: This paper proposes a new model to support user
queries on postgraduate research information at Universiti Tenaga
Nasional. The ontology to be developed will contribute towards
shareable and reusable domain knowledge that makes knowledge
assets intelligently accessible to both people and software. This work
adapts a methodology for ontology development based on the
framework proposed by Uschold and King. The concepts and
relations in this domain are represented in a class diagram using the
Protégé software. The ontology will be used to support a menudriven
query system for assisting students in searching for
information related to postgraduate research at the university.
Abstract: The paper proposes a unified model for multimedia data retrieval which includes data representatives, content representatives, index structure, and search algorithms. The multimedia data are defined as k-dimensional signals indexed in a multidimensional k-tree structure. The benefits of using the k-tree unified model were demonstrated by running the data retrieval application on a six networked nodes test bed cluster. The tests were performed with two retrieval algorithms, one that allows parallel searching using a single feature, the second that performs a weighted cascade search for multiple features querying. The experiments show a significant reduction of retrieval time while maintaining the quality of results.
Abstract: Number of documents being created increases at an
increasing pace while most of them being in already known topics
and little of them introducing new concepts. This fact has started a
new era in information retrieval discipline where the requirements
have their own specialties. That is digging into topics and concepts
and finding out subtopics or relations between topics. Up to now IR
researches were interested in retrieving documents about a general
topic or clustering documents under generic subjects. However these
conventional approaches can-t go deep into content of documents
which makes it difficult for people to reach to right documents they
were searching. So we need new ways of mining document sets
where the critic point is to know much about the contents of the
documents. As a solution we are proposing to enhance LSI, one of
the proven IR techniques by supporting its vector space with n-gram
forms of words. Positive results we have obtained are shown in two
different application area of IR domain; querying a document
database, clustering documents in the document database.
Abstract: This paper presents data annotation models at
five levels of granularity (database, relation, column, tuple, and cell) of relational data to address the problem of unsuitability of most relational databases to express annotations. These models
do not require any structural and schematic changes to the
underlying database. These models are also flexible, extensible,
customizable, database-neutral, and platform-independent. This paper also presents an SQL-like query language, named Annotation Query Language (AnQL), to query annotation documents. AnQL is simple to understand and exploits the already-existent wide knowledge and skill set of SQL.
Abstract: We present here the results for a comparative study of
some techniques, available in the literature, related to the relevance
feedback mechanism in the case of a short-term learning. Only one
method among those considered here is belonging to the data mining
field which is the K-nearest neighbors algorithm (KNN) while the
rest of the methods is related purely to the information retrieval field
and they fall under the purview of the following three major axes:
Shifting query, Feature Weighting and the optimization of the
parameters of similarity metric. As a contribution, and in addition to
the comparative purpose, we propose a new version of the KNN
algorithm referred to as an incremental KNN which is distinct from
the original version in the sense that besides the influence of the
seeds, the rate of the actual target image is influenced also by the
images already rated. The results presented here have been obtained
after experiments conducted on the Wang database for one iteration
and utilizing color moments on the RGB space. This compact
descriptor, Color Moments, is adequate for the efficiency purposes
needed in the case of interactive systems. The results obtained allow
us to claim that the proposed algorithm proves good results; it even
outperforms a wide range of techniques available in the literature.
Abstract: The rapid expansion of the web is causing the
constant growth of information, leading to several problems such as
increased difficulty of extracting potentially useful knowledge. Web
content mining confronts this problem gathering explicit information
from different web sites for its access and knowledge discovery.
Query interfaces of web databases share common building blocks.
After extracting information with parsing approach, we use a new
data mining algorithm to match a large number of schemas in
databases at a time. Using this algorithm increases the speed of
information matching. In addition, instead of simple 1:1 matching,
they do complex (m:n) matching between query interfaces. In this
paper we present a novel correlation mining algorithm that matches
correlated attributes with smaller cost. This algorithm uses Jaccard
measure to distinguish positive and negative correlated attributes.
After that, system matches the user query with different query
interfaces in special domain and finally chooses the nearest query
interface with user query to answer to it.
Abstract: Data Warehousing tools have become very popular and currently many of them have moved to Web-based user interfaces to make it easier to access and use the tools. The next step is to enable these tools to be used within a portal framework. The portal framework consists of pages having several small windows that contain individual data warehouse query results. There are several issues that need to be considered when designing the architecture for a portal enabled data warehouse query tool. Some issues need special techniques that can overcome the limitations that are imposed by the nature of data warehouse queries. Issues such as single sign-on, query result caching and sharing, customization, scheduling and authorization need to be considered. This paper discusses such issues and suggests an architecture to support data warehouse queries within Web portal frameworks.
Abstract: Increasing growth of information volume in the
internet causes an increasing need to develop new (semi)automatic
methods for retrieval of documents and ranking them according to
their relevance to the user query. In this paper, after a brief review
on ranking models, a new ontology based approach for ranking
HTML documents is proposed and evaluated in various
circumstances. Our approach is a combination of conceptual,
statistical and linguistic methods. This combination reserves the
precision of ranking without loosing the speed. Our approach
exploits natural language processing techniques for extracting
phrases and stemming words. Then an ontology based conceptual
method will be used to annotate documents and expand the query.
To expand a query the spread activation algorithm is improved so
that the expansion can be done in various aspects. The annotated
documents and the expanded query will be processed to compute
the relevance degree exploiting statistical methods. The outstanding
features of our approach are (1) combining conceptual, statistical
and linguistic features of documents, (2) expanding the query with
its related concepts before comparing to documents, (3) extracting
and using both words and phrases to compute relevance degree, (4)
improving the spread activation algorithm to do the expansion based
on weighted combination of different conceptual relationships and
(5) allowing variable document vector dimensions. A ranking
system called ORank is developed to implement and test the
proposed model. The test results will be included at the end of the
paper.
Abstract: The paper describes design of an ontology in the
financial domain for mutual funds. The design of this ontology
consists of four steps, namely, specification, knowledge acquisition,
implementation and semantic query. Specification includes a
description of the taxonomy and different types mutual funds and
their scope. Knowledge acquisition involves the information
extraction from heterogeneous resources. Implementation describes
the conceptualization and encoding of this data. Finally, semantic
query permits complex queries to integrated data, mapping of these
database entities to ontological concepts.
Abstract: This paper presents a dominant color descriptor
technique for medical image retrieval. The medical image system
will collect and store into medical database. The purpose of
dominant color descriptor (DCD) technique is to retrieve medical
image and to display similar image using queried image. First, this
technique will search and retrieve medical image based on keyword
entered by user. After image is found, the system will assign this
image as a queried image. DCD technique will calculate the image
value of dominant color. Then, system will search and retrieve again
medical image based on value of dominant color query image.
Finally, the system will display similar images with the queried
image to user. Simple application has been developed and tested
using dominant color descriptor. Result based on experiment
indicates this technique is effective and can be used for medical
image retrieval.
Abstract: As the web continues to grow exponentially, the idea
of crawling the entire web on a regular basis becomes less and less
feasible, so the need to include information on specific domain,
domain-specific search engines was proposed. As more information
becomes available on the World Wide Web, it becomes more difficult
to provide effective search tools for information access. Today,
people access web information through two main kinds of search
interfaces: Browsers (clicking and following hyperlinks) and Query
Engines (queries in the form of a set of keywords showing the topic
of interest) [2]. Better support is needed for expressing one's
information need and returning high quality search results by web
search tools. There appears to be a need for systems that do reasoning
under uncertainty and are flexible enough to recover from the
contradictions, inconsistencies, and irregularities that such reasoning
involves. In a multi-view problem, the features of the domain can be
partitioned into disjoint subsets (views) that are sufficient to learn the
target concept. Semi-supervised, multi-view algorithms, which
reduce the amount of labeled data required for learning, rely on the
assumptions that the views are compatible and uncorrelated. This
paper describes the use of semi-structured machine learning approach
with Active learning for the “Domain Specific Search Engines". A
domain-specific search engine is “An information access system that
allows access to all the information on the web that is relevant to a
particular domain. The proposed work shows that with the help of
this approach relevant data can be extracted with the minimum
queries fired by the user. It requires small number of labeled data and
pool of unlabelled data on which the learning algorithm is applied to
extract the required data.
Abstract: National Biodiversity Database System (NBIDS) has
been developed for collecting Thai biodiversity data. The goal of this
project is to provide advanced tools for querying, analyzing,
modeling, and visualizing patterns of species distribution for
researchers and scientists. NBIDS data record two types of datasets:
biodiversity data and environmental data. Biodiversity data are
specie presence data and species status. The attributes of biodiversity
data can be further classified into two groups: universal and projectspecific
attributes. Universal attributes are attributes that are common
to all of the records, e.g. X/Y coordinates, year, and collector name.
Project-specific attributes are attributes that are unique to one or a
few projects, e.g., flowering stage. Environmental data include
atmospheric data, hydrology data, soil data, and land cover data
collecting by using GLOBE protocols. We have developed webbased
tools for data entry. Google Earth KML and ArcGIS were used
as tools for map visualization. webMathematica was used for simple
data visualization and also for advanced data analysis and
visualization, e.g., spatial interpolation, and statistical analysis.
NBIDS will be used by park rangers at Khao Nan National Park, and
researchers.
Abstract: Task of object localization is one of the major
challenges in creating intelligent transportation. Unfortunately, in
densely built-up urban areas, localization based on GPS only
produces a large error, or simply becomes impossible. New
opportunities arise for the localization due to the rapidly emerging
concept of a wireless ad-hoc network. Such network, allows
estimating potential distance between these objects measuring
received signal level and construct a graph of distances in which
nodes are the localization objects, and edges - estimates of the
distances between pairs of nodes. Due to the known coordinates of
individual nodes (anchors), it is possible to determine the location of
all (or part) of the remaining nodes of the graph. Moreover, road
map, available in digital format can provide localization routines
with valuable additional information to narrow node location search.
However, despite abundance of well-known algorithms for solving
the problem of localization and significant research efforts, there are
still many issues that currently are addressed only partially. In this
paper, we propose localization approach based on the graph mapped
distances on the digital road map data basis. In fact, problem is
reduced to distance graph embedding into the graph representing area
geo location data. It makes possible to localize objects, in some cases
even if only one reference point is available. We propose simple
embedding algorithm and sample implementation as spatial queries
over sensor network data stored in spatial database, allowing
employing effectively spatial indexing, optimized spatial search
routines and geometry functions.
Abstract: Graph has become increasingly important in modeling
complicated structures and schemaless data such as proteins, chemical
compounds, and XML documents. Given a graph query, it is desirable
to retrieve graphs quickly from a large database via graph-based
indices. Different from the existing methods, our approach, called
VFM (Vertex to Frequent Feature Mapping), makes use of vertices
and decision features as the basic indexing feature. VFM constructs
two mappings between vertices and frequent features to answer graph
queries. The VFM approach not only provides an elegant solution to
the graph indexing problem, but also demonstrates how database
indexing and query processing can benefit from data mining,
especially frequent pattern mining. The results show that the proposed
method not only avoids the enumeration method of getting subgraphs
of query graph, but also effectively reduces the subgraph isomorphism
tests between the query graph and graphs in candidate answer set in
verification stage.
Abstract: Processing the data by computers and performing
reasoning tasks is an important aim in Computer Science. Semantic
Web is one step towards it. The use of ontologies to enhance the
information by semantically is the current trend. Huge amount of
domain specific, unstructured on-line data needs to be expressed in
machine understandable and semantically searchable format.
Currently users are often forced to search manually in the results
returned by the keyword-based search services. They also want to use
their native languages to express what they search. In this paper, an
ontology-based automated question answering system on software
test documents domain is presented. The system allows users to enter
a question about the domain by means of natural language and
returns exact answer of the questions. Conversion of the natural
language question into the ontology based query is the challenging
part of the system. To be able to achieve this, a new algorithm
regarding free text to ontology based search engine query conversion
is proposed. The algorithm is based on investigation of suitable
question type and parsing the words of the question sentence.
Abstract: In today-s information age, numbers of organizations
are still arguing on capitalizing the values of Information Technology
(IT) and Knowledge Management (KM) to which individuals can
benefit from and effective communication among the individuals can
be established. IT exists in enabling positive improvement for
communication among knowledge workers (k-workers) with a
number of social network technology domains at workplace. The
acceptance of digital discourse in sharing of knowledge and
facilitating the knowledge and information flows at most of the
organizations indeed impose the culture of knowledge sharing in
Digital Social Networks (DSN). Therefore, this study examines
whether the k-workers with IT background would confer an effect on
the three knowledge characteristics -- conceptual, contextual, and
operational. Derived from these three knowledge characteristics, five
potential factors will be examined on the effects of knowledge
exchange via e-mail domain as the chosen query. It is expected, that
the results could provide such a parameter in exploring how DSN
contributes in supporting the k-workers- virtues, performance and
qualities as well as revealing the mutual point between IT and KM.
Abstract: The purpose of this paper is to propose a framework for constructing correct parallel processing programs based on Equivalent Transformation Framework (ETF). ETF regards computation as In the framework, a problem-s domain knowledge and a query are described in definite clauses, and computation is regarded as transformation of the definite clauses. Its meaning is defined by a model of the set of definite clauses, and the transformation rules generated must preserve meaning. We have proposed a parallel processing method based on “specialization", a part of operation in the transformations, which resembles substitution in logic programming. The method requires “Memo-tree", a history of specialization to maintain correctness. In this paper we proposes the new method for the specialization-base parallel processing without Memo-tree.
Abstract: Annotation of a protein sequence is pivotal for the understanding of its function. Accuracy of manual annotation provided by curators is still questionable by having lesser evidence strength and yet a hard task and time consuming. A number of computational methods including tools have been developed to tackle this challenging task. However, they require high-cost hardware, are difficult to be setup by the bioscientists, or depend on time intensive and blind sequence similarity search like Basic Local Alignment Search Tool. This paper introduces a new method of assigning highly correlated Gene Ontology terms of annotated protein sequences to partially annotated or newly discovered protein sequences. This method is fully based on Gene Ontology data and annotations. Two problems had been identified to achieve this method. The first problem relates to splitting the single monolithic Gene Ontology RDF/XML file into a set of smaller files that can be easy to assess and process. Thus, these files can be enriched with protein sequences and Inferred from Electronic Annotation evidence associations. The second problem involves searching for a set of semantically similar Gene Ontology terms to a given query. The details of macro and micro problems involved and their solutions including objective of this study are described. This paper also describes the protein sequence annotation and the Gene Ontology. The methodology of this study and Gene Ontology based protein sequence annotation tool namely extended UTMGO is presented. Furthermore, its basic version which is a Gene Ontology browser that is based on semantic similarity search is also introduced.
Abstract: In this paper, a model for an information retrieval
system is proposed which takes into account that knowledge about
documents and information need of users are dynamic. Two
methods are combined, one qualitative or symbolic and the other
quantitative or numeric, which are deemed suitable for many
clustering contexts, data analysis, concept exploring and
knowledge discovery. These two methods may be classified as
inductive learning techniques. In this model, they are introduced to
build “long term" knowledge about past queries and concepts in a
collection of documents. The “long term" knowledge can guide
and assist the user to formulate an initial query and can be
exploited in the process of retrieving relevant information. The
different kinds of knowledge are organized in different points of
view. This may be considered an enrichment of the exploration
level which is coherent with the concept of document/query
structure.