Abstract: Web usage mining is an interesting application of data
mining which provides insight into customer behaviour on the Internet. An important technique to discover user access and navigation trails is based on sequential patterns mining. One of the
key challenges for web access patterns mining is tackling the problem
of mining richly structured patterns. This paper proposes a novel
model called Web Access Patterns Graph (WAP-Graph) to represent all of the access patterns from web mining graphically. WAP-Graph
also motivates the search for new structural relation patterns, i.e. Concurrent Access Patterns (CAP), to identify and predict more
complex web page requests. Corresponding CAP mining and modelling methods are proposed and shown to be effective in the
search for and representation of concurrency between access patterns
on the web. From experiments conducted on large-scale synthetic
sequence data as well as real web access data, it is demonstrated that
CAP mining provides a powerful method for structural knowledge discovery, which can be visualised through the CAP-Graph model.
Abstract: Expression data analysis is based mostly on the
statistical approaches that are indispensable for the study of
biological systems. Large amounts of multidimensional data resulting
from the high-throughput technologies are not completely served by
biostatistical techniques and are usually complemented with visual,
knowledge discovery and other computational tools. In many cases,
in biological systems we only speculate on the processes that are
causing the changes, and it is the visual explorative analysis of data
during which a hypothesis is formed. We would like to show the
usability of multidimensional visualization tools and promote their
use in life sciences. We survey and show some of the
multidimensional visualization tools in the process of data
exploration, such as parallel coordinates and radviz and we extend
them by combining them with the self-organizing map algorithm. We
use a time course data set of transitional cell carcinoma of the bladder
in our examples. Analysis of data with these tools has the potential to
uncover additional relationships and non-trivial structures.
Abstract: Automated discovery of hierarchical structures in
large data sets has been an active research area in the recent past.
This paper focuses on the issue of mining generalized rules with crisp
hierarchical structure using Genetic Programming (GP) approach to
knowledge discovery. The post-processing scheme presented in this
work uses flat rules as initial individuals of GP and discovers
hierarchical structure. Suitable genetic operators are proposed for the
suggested encoding. Based on the Subsumption Matrix(SM), an
appropriate fitness function is suggested. Finally, Hierarchical
Production Rules (HPRs) are generated from the discovered
hierarchy. Experimental results are presented to demonstrate the
performance of the proposed algorithm.
Abstract: The drug discovery process starts with protein
identification because proteins are responsible for many functions
required for maintenance of life. Protein identification further needs
determination of protein function. Proposed method develops a
classifier for human protein function prediction. The model uses
decision tree for classification process. The protein function is
predicted on the basis of matched sequence derived features per each
protein function. The research work includes the development of a
tool which determines sequence derived features by analyzing
different parameters. The other sequence derived features are
determined using various web based tools.
Abstract: Despite the internet, which is one of the mass media
that has become quite common in recent years, the relationship of
Advertisement with Television and Cinema, which have always
drawn attention of researchers as basic media and where visual use is
in the foreground, have also become the subject of various studies.
Based on the assumption that the known fundamental effects of
advertisements on consumers are closely related to the creative
process of advertisements as well as the nature and characteristics of
the medium where they are used, these basic mass media (Television
and Cinema) and the consumer motivations of the advertisements
they broadcast have become a focus of study.
Given that the viewers of the mass media in question have shifted
from a passive position to a more active one especially in recent years
and approach contents of advertisements, as they do all contents, in a
more critical and “pitiless" manner, it is possible to say that
individuals make more use of advertisements than in the past and
combine their individual goals with the goals of the advertisements.
This study, which aims at finding out what the goals of these new
individual advertisement use are, how they are shaped by the distinct
characteristics of Television and Cinema, where visuality takes
precedence as basic mass media, and what kind of places they occupy
in the minds of consumers, has determined consumers- motivations
as: “Entertainment", “Escapism", “Play", “Monitoring/Discovery",
“Opposite Sex" and “Aspirations and Role Models".
This study intends to reveal the differences or similarities among
the needs and hence the gratifications of viewers who consume
advertisements on Television or at the Cinema, which are two basic
media where visuality is prioritized.
Abstract: Sensitive and predictive DILI (Drug Induced Liver
Injury) biomarkers are needed in drug R&D to improve early
detection of hepatotoxicity. The discovery of DILI biomarkers that
demonstrate the predictive power to identify individuals at risk to
DILI would represent a major advance in the development of
personalized healthcare approaches. In this healthy volunteer
acetaminophen study (4g/day for 7 days, with 3 monitored nontreatment
days before and 4 after), 450 serum samples from 32
subjects were analyzed using protein profiling by antibody
suspension bead arrays. Multiparallel protein profiles were generated
using a DILI target protein array with 300 antibodies, where the
antibodies were selected based on previous literature findings of
putative DILI biomarkers and a screening process using pre dose
samples from the same cohort. Of the 32 subjects, 16 were found to
develop an elevated ALT value (2Xbaseline, responders). Using the
plasma profiling approach together with multivariate statistical
analysis some novel findings linked to lipid metabolism were found
and more important, endogenous protein profiles in baseline samples
(prior to treatment) with predictive power for ALT elevations were
identified.
Abstract: Censored Production Rule is an extension of standard
production rule, which is concerned with problems of reasoning with
incomplete information, subject to resource constraints and problem
of reasoning efficiently with exceptions. A CPR has a form: IF A
(Condition) THEN B (Action) UNLESS C (Censor), Where C is the
exception condition. Fuzzy CPR are obtained by augmenting
ordinary fuzzy production rule “If X is A then Y is B with an
exception condition and are written in the form “If X is A then Y is B
Unless Z is C. Such rules are employed in situation in which the
fuzzy conditional statement “If X is A then Y is B" holds frequently
and the exception condition “Z is C" holds rarely. Thus “If X is A
then Y is B" part of the fuzzy CPR express important information
while the unless part acts only as a switch that changes the polarity of
“Y is B" to “Y is not B" when the assertion “Z is C" holds. The
proposed approach is an attempt to discover fuzzy censored
production rules from set of discovered fuzzy if then rules in the
form:
A(X)  B(Y) || C(Z).
Abstract: The purpose of this paper is to develop models that would enable predicting student success. These models could improve allocation of students among colleges and optimize the newly introduced model of government subsidies for higher education. For the purpose of collecting data, an anonymous survey was carried out in the last year of undergraduate degree student population using random sampling method. Decision trees were created of which two have been chosen that were most successful in predicting student success based on two criteria: Grade Point Average (GPA) and time that a student needs to finish the undergraduate program (time-to-degree). Decision trees have been shown as a good method of classification student success and they could be even more improved by increasing survey sample and developing specialized decision trees for each type of college. These types of methods have a big potential for use in decision support systems.
Abstract: This paper describes the NEAR (Navigating Exhibitions, Annotations and Resources) panel, a novel interactive visualization technique designed to help people navigate and interpret groups of resources, exhibitions and annotations by revealing hidden relations such as similarities and references. NEAR is implemented on A•VI•RE, an extended online information repository. A•VI•RE supports a semi-structured collection of exhibitions containing various resources and annotations. Users are encouraged to contribute, share, annotate and interpret resources in the system by building their own exhibitions and annotations. However, it is hard to navigate smoothly and efficiently in A•VI•RE because of its high capacity and complexity. We present a visual panel that implements new navigation and communication approaches that support discovery of implied relations. By quickly scanning and interacting with NEAR, users can see not only implied relations but also potential connections among different data elements. NEAR was tested by several users in the A•VI•RE system and shown to be a supportive navigation tool. In the paper, we further analyze the design, report the evaluation and consider its usage in other applications.
Abstract: Frequent pattern discovery over data stream is a hard
problem because a continuously generated nature of stream does not
allow a revisit on each data element. Furthermore, pattern discovery
process must be fast to produce timely results. Based on these
requirements, we propose an approximate approach to tackle the
problem of discovering frequent patterns over continuous stream.
Our approximation algorithm is intended to be applied to process a
stream prior to the pattern discovery process. The results of
approximate frequent pattern discovery have been reported in the
paper.
Abstract: Nowadays social media are important tools for web
resource discovery. The performance and capabilities of web searches
are vital, especially search results from social research paper
bookmarking. This paper proposes a new algorithm for ranking
method that is a combination of similarity ranking with paper posted
time or CSTRank. The paper posted time is static ranking for
improving search results. For this particular study, the paper posted
time is combined with similarity ranking to produce a better ranking
than other methods such as similarity ranking or SimRank. The
retrieval performance of combination rankings is evaluated using
mean values of NDCG. The evaluation in the experiments implies
that the chosen CSTRank ranking by using weight score at ratio 90:10
can improve the efficiency of research paper searching on social
bookmarking websites.
Abstract: Knowledge of an organization does not merely reside
in structured form of information and data; it is also embedded in
unstructured form. The discovery of such knowledge is particularly
difficult as the characteristic is dynamic, scattered, massive and
multiplying at high speed. Conventional methods of managing
unstructured information are considered too resource demanding and
time consuming to cope with the rapid information growth.
In this paper, a Multi-faceted and Automatic Knowledge
Elicitation System (MAKES) is introduced for the purpose of
discovery and capture of organizational knowledge. A trial
implementation has been conducted in a public organization to
achieve the objective of decision capture and navigation from a
number of meeting minutes which are autonomously organized,
classified and presented in a multi-faceted taxonomy map in both
document and content level. Key concepts such as critical decision
made, key knowledge workers, knowledge flow and the relationship
among them are elicited and displayed in predefined knowledge
model and maps. Hence, the structured knowledge can be retained,
shared and reused.
Conducting Knowledge Management with MAKES reduces work
in searching and retrieving the target decision, saves a great deal of
time and manpower, and also enables an organization to keep pace
with the knowledge life cycle. This is particularly important when
the amount of unstructured information and data grows extremely
quickly. This system approach of knowledge management can
accelerate value extraction and creation cycles of organizations.
Abstract: In the current Grid environment, efficient workload
management presents a significant challenge, for which there are
exorbitant de facto standards encompassing resource discovery,
brokerage, and data transfer, among others. In addition, the real-time
resource status, essential for an optimal resource allocation strategy,
is often not readily accessible. To address these issues and provide a
cleaner abstraction of the Grid with the potential of generalizing into
arbitrary resource-sharing environment, this paper proposes a new
Condor-based pilot mechanism applied in the PanDA architecture,
PanDA-PF WMS, with the goal of providing a more generic yet
efficient resource allocating strategy. In this architecture, the PanDA
server primarily acts as a repository of user jobs, responding to pilot
requests from distributed, remote resources. Scheduling decisions are
subsequently made according to the real-time resource information
reported by pilots. Pilot Factory is a Condor-inspired solution for a
scalable pilot dissemination and effectively functions as a resource
provisioning mechanism through which the user-job server, PanDA,
reaches out to the candidate resources only on demand.
Abstract: In this paper, a model for an information retrieval
system is proposed which takes into account that knowledge about
documents and information need of users are dynamic. Two
methods are combined, one qualitative or symbolic and the other
quantitative or numeric, which are deemed suitable for many
clustering contexts, data analysis, concept exploring and
knowledge discovery. These two methods may be classified as
inductive learning techniques. In this model, they are introduced to
build “long term" knowledge about past queries and concepts in a
collection of documents. The “long term" knowledge can guide
and assist the user to formulate an initial query and can be
exploited in the process of retrieving relevant information. The
different kinds of knowledge are organized in different points of
view. This may be considered an enrichment of the exploration
level which is coherent with the concept of document/query
structure.
Abstract: This paper proposes a method that discovers time series event patterns from textual data with time information. The patterns are composed of sequences of events and each event is extracted from the textual data, where an event is characteristic content included in the textual data such as a company name, an action, and an impression of a customer. The method introduces 7 types of time constraints based on the analysis of the textual data. The method also evaluates these constraints when the frequency of a time series event pattern is calculated. We can flexibly define the time constraints for interesting combinations of events and can discover valid time series event patterns which satisfy these conditions. The paper applies the method to daily business reports collected by a sales force automation system and verifies its effectiveness through numerical experiments.
Abstract: In the last few years, the Semantic Web gained scientific acceptance as a means of relationships identification in knowledge base, widely known by semantic association. Query about complex relationships between entities is a strong requirement for many applications in analytical domains. In bioinformatics for example, it is critical to extract exchanges between proteins. Currently, the widely known result of such queries is to provide paths between connected entities from data graph. However, they do not always give good results while facing the user need by the best association or a set of limited best association, because they only consider all existing paths but ignore the path evaluation. In this paper, we present an approach for supporting association discovery queries. Our proposal includes (i) a query language PmSPRQL which provides a multiparadigm query expressions for association extraction and (ii) some quantification measures making easy the process of association ranking. The originality of our proposal is demonstrated by a performance evaluation of our approach on real world datasets.
Abstract: This research presents a system for post processing of
data that takes mined flat rules as input and discovers crisp as well as
fuzzy hierarchical structures using Learning Classifier System
approach. Learning Classifier System (LCS) is basically a machine
learning technique that combines evolutionary computing,
reinforcement learning, supervised or unsupervised learning and
heuristics to produce adaptive systems. A LCS learns by interacting
with an environment from which it receives feedback in the form of
numerical reward. Learning is achieved by trying to maximize the
amount of reward received. Crisp description for a concept usually
cannot represent human knowledge completely and practically. In the
proposed Learning Classifier System initial population is constructed
as a random collection of HPR–trees (related production rules) and
crisp / fuzzy hierarchies are evolved. A fuzzy subsumption relation is
suggested for the proposed system and based on Subsumption Matrix
(SM), a suitable fitness function is proposed. Suitable genetic
operators are proposed for the chosen chromosome representation
method. For implementing reinforcement a suitable reward and
punishment scheme is also proposed. Experimental results are
presented to demonstrate the performance of the proposed system.
Abstract: Biological data has several characteristics that strongly differentiate it from typical business data. It is much more complex, usually large in size, and continuously changes. Until recently business data has been the main target for discovering trends, patterns or future expectations. However, with the recent rise in biotechnology, the powerful technology that was used for analyzing business data is now being applied to biological data. With the advanced technology at hand, the main trend in biological research is rapidly changing from structural DNA analysis to understanding cellular functions of the DNA sequences. DNA chips are now being used to perform experiments and DNA analysis processes are being used by researchers. Clustering is one of the important processes used for grouping together similar entities. There are many clustering algorithms such as hierarchical clustering, self-organizing maps, K-means clustering and so on. In this paper, we propose a clustering algorithm that imitates the ecosystem taking into account the features of biological data. We implemented the system using an Ant-Colony clustering algorithm. The system decides the number of clusters automatically. The system processes the input biological data, runs the Ant-Colony algorithm, draws the Topic Map, assigns clusters to the genes and displays the output. We tested the algorithm with a test data of 100 to1000 genes and 24 samples and show promising results for applying this algorithm to clustering DNA chip data.
Abstract: Chronic hepatitis B can evolve to cirrhosis and liver
cancer. Interferon is the only effective treatment, for carefully selected
patients, but it is very expensive. Some of the selection criteria are
based on liver biopsy, an invasive, costly and painful medical procedure.
Therefore, developing efficient non-invasive selection systems,
could be in the patients benefit and also save money. We investigated
the possibility to create intelligent systems to assist the Interferon
therapeutical decision, mainly by predicting with acceptable accuracy
the results of the biopsy. We used a knowledge discovery in integrated
medical data - imaging, clinical, and laboratory data. The resulted
intelligent systems, tested on 500 patients with chronic hepatitis
B, based on C5.0 decision trees and boosting, predict with 100%
accuracy the results of the liver biopsy. Also, by integrating the other
patients selection criteria, they offer a non-invasive support for the
correct Interferon therapeutic decision. To our best knowledge, these
decision systems outperformed all similar systems published in the
literature, and offer a realistic opportunity to replace liver biopsy in
this medical context.
Abstract: The theory of Groebner Bases, which has recently been
honored with the ACM Paris Kanellakis Theory and Practice Award,
has become a crucial building block to computer algebra, and is
widely used in science, engineering, and computer science. It is wellknown
that Groebner bases computation is EXP-SPACE in a general
setting. In this paper, we give an algorithm to show that Groebner
bases computation is P-SPACE in Boolean rings. We also show that
with this discovery, the Groebner bases method can theoretically be
as efficient as other methods for automated verification of hardware
and software. Additionally, many useful and interesting properties of
Groebner bases including the ability to efficiently convert the bases
for different orders of variables making Groebner bases a promising
method in automated verification.