Abstract: The goal of data mining algorithms is to discover
useful information embedded in large databases. One of the most
important data mining problems is discovery of frequently occurring
patterns in sequential data. In a multidimensional sequence each
event depends on more than one dimension. The search space is quite
large and the serial algorithms are not scalable for very large
datasets. To address this, it is necessary to study scalable parallel
implementations of sequence mining algorithms.
In this paper, we present a model for multidimensional sequence
and describe a parallel algorithm based on data parallelism.
Simulation experiments show good load balancing and scalable and
acceptable speedup over different processors and problem sizes and
demonstrate that our approach can works efficiently in a real parallel
computing environment.
Abstract: In the era of great competition, understanding and satisfying
customers- requirements are the critical tasks for a company
to make a profits. Customer relationship management (CRM) thus
becomes an important business issue at present. With the help of
the data mining techniques, the manager can explore and analyze
from a large quantity of data to discover meaningful patterns and
rules. Among all methods, well-known association rule is most
commonly seen. This paper is based on Apriori algorithm and uses
genetic algorithms combining a data mining method to discover fuzzy
classification rules. The mined results can be applied in CRM to
help decision marker make correct business decisions for marketing
strategies.
Abstract: In many data mining applications, it is a priori known
that the target function should satisfy certain constraints imposed
by, for example, economic theory or a human-decision maker. In this
paper we consider partially monotone prediction problems, where the
target variable depends monotonically on some of the input variables
but not on all. We propose a novel method to construct prediction
models, where monotone dependences with respect to some of
the input variables are preserved by virtue of construction. Our
method belongs to the class of mixture models. The basic idea is to
convolute monotone neural networks with weight (kernel) functions
to make predictions. By using simulation and real case studies,
we demonstrate the application of our method. To obtain sound
assessment for the performance of our approach, we use standard
neural networks with weight decay and partially monotone linear
models as benchmark methods for comparison. The results show that
our approach outperforms partially monotone linear models in terms
of accuracy. Furthermore, the incorporation of partial monotonicity
constraints not only leads to models that are in accordance with the
decision maker's expertise, but also reduces considerably the model
variance in comparison to standard neural networks with weight
decay.
Abstract: Diabetes is one of the high prevalence diseases
worldwide with increased number of complications, with retinopathy
as one of the most common one. This paper describes how data
mining and case-based reasoning were integrated to predict
retinopathy prevalence among diabetes patients in Malaysia. The
knowledge base required was built after literature reviews and
interviews with medical experts. A total of 140 diabetes patients- data
were used to train the prediction system. A voting mechanism selects
the best prediction results from the two techniques used. It has been
successfully proven that both data mining and case-based reasoning
can be used for retinopathy prediction with an improved accuracy of
85%.
Abstract: Knowledge is indispensable but voluminous knowledge becomes a bottleneck for efficient processing. A great challenge for data mining activity is the generation of large number of potential rules as a result of mining process. In fact sometimes result size is comparable to the original data. Traditional data mining pruning activities such as support do not sufficiently reduce the huge rule space. Moreover, many practical applications are characterized by continual change of data and knowledge, thereby making knowledge voluminous with each change. The most predominant representation of the discovered knowledge is the standard Production Rules (PRs) in the form If P Then D. Michalski & Winston proposed Censored Production Rules (CPRs), as an extension of production rules, that exhibit variable precision and supports an efficient mechanism for handling exceptions. A CPR is an augmented production rule of the form: If P Then D Unless C, where C (Censor) is an exception to the rule. Such rules are employed in situations in which the conditional statement 'If P Then D' holds frequently and the assertion C holds rarely. By using a rule of this type we are free to ignore the exception conditions, when the resources needed to establish its presence, are tight or there is simply no information available as to whether it holds or not. Thus the 'If P Then D' part of the CPR expresses important information while the Unless C part acts only as a switch changes the polarity of D to ~D. In this paper a scheme based on Dempster-Shafer Theory (DST) interpretation of a CPR is suggested for discovering CPRs from the discovered flat PRs. The discovery of CPRs from flat rules would result in considerable reduction of the already discovered rules. The proposed scheme incrementally incorporates new knowledge and also reduces the size of knowledge base considerably with each episode. Examples are given to demonstrate the behaviour of the proposed scheme. The suggested cumulative learning scheme would be useful in mining data streams.
Abstract: The vast amount of information hidden in huge
databases has created tremendous interests in the field of data
mining. This paper examines the possibility of using data clustering
techniques in oral medicine to identify functional relationships
between different attributes and classification of similar patient
examinations. Commonly used data clustering algorithms have been
reviewed and as a result several interesting results have been
gathered.
Abstract: Support vector machines (SVMs) have shown
superior performance compared to other machine learning techniques,
especially in classification problems. Yet one limitation of SVMs is
the lack of an explanation capability which is crucial in some
applications, e.g. in the medical and security domains. In this paper, a
novel approach for eclectic rule-extraction from support vector
machines is presented. This approach utilizes the knowledge acquired
by the SVM and represented in its support vectors as well as the
parameters associated with them. The approach includes three stages;
training, propositional rule-extraction and rule quality evaluation.
Results from four different experiments have demonstrated the value
of the approach for extracting comprehensible rules of high accuracy
and fidelity.
Abstract: Lung cancer accounts for the most cancer related deaths for men as well as for women. The identification of cancer associated genes and the related pathways are essential to provide an important possibility in the prevention of many types of cancer. In this work two filter approaches, namely the information gain and the biomarker identifier (BMI) are used for the identification of different types of small-cell and non-small-cell lung cancer. A new method to determine the BMI thresholds is proposed to prioritize genes (i.e., primary, secondary and tertiary) using a k-means clustering approach. Sets of key genes were identified that can be found in several pathways. It turned out that the modified BMI is well suited for microarray data and therefore BMI is proposed as a powerful tool for the search for new and so far undiscovered genes related to cancer.
Abstract: Artificial Intelligence (AI) methods are increasingly being used for problem solving. This paper concerns using AI-type learning machines for power quality problem, which is a problem of general interest to power system to provide quality power to all appliances. Electrical power of good quality is essential for proper operation of electronic equipments such as computers and PLCs. Malfunction of such equipment may lead to loss of production or disruption of critical services resulting in huge financial and other losses. It is therefore necessary that critical loads be supplied with electricity of acceptable quality. Recognition of the presence of any disturbance and classifying any existing disturbance into a particular type is the first step in combating the problem. In this work two classes of AI methods for Power quality data mining are studied: Artificial Neural Networks (ANNs) and Support Vector Machines (SVMs). We show that SVMs are superior to ANNs in two critical respects: SVMs train and run an order of magnitude faster; and SVMs give higher classification accuracy.
Abstract: This study analyzes the effect of discretization on
classification of datasets including continuous valued features. Six
datasets from UCI which containing continuous valued features are
discretized with entropy-based discretization method. The
performance improvement between the dataset with original features
and the dataset with discretized features is compared with k-nearest
neighbors, Naive Bayes, C4.5 and CN2 data mining classification
algorithms. As the result the classification accuracies of the six
datasets are improved averagely by 1.71% to 12.31%.
Abstract: Nowadays predicting political risk level of country
has become a critical issue for investors who intend to achieve
accurate information concerning stability of the business
environments. Since, most of the times investors are layman and
nonprofessional IT personnel; this paper aims to propose a
framework named GECR in order to help nonexpert persons to
discover political risk stability across time based on the political
news and events.
To achieve this goal, the Bayesian Networks approach was
utilized for 186 political news of Pakistan as sample dataset.
Bayesian Networks as an artificial intelligence approach has been
employed in presented framework, since this is a powerful technique
that can be applied to model uncertain domains. The results showed
that our framework along with Bayesian Networks as decision
support tool, predicted the political risk level with a high degree of
accuracy.
Abstract: Data mining and knowledge engineering have become a tough task due to the availability of large amount of data in the web nowadays. Validity and reliability of data also become a main debate in knowledge acquisition. Besides, acquiring knowledge from different languages has become another concern. There are many language translators and corpora developed but the function of these translators and corpora are usually limited to certain languages and domains. Furthermore, search results from engines with traditional 'keyword' approach are no longer satisfying. More intelligent knowledge engineering agents are needed. To address to these problems, a system known as Multilingual Word Semantic Network is proposed. This system adapted semantic network to organize words according to concepts and relations. The system also uses open source as the development philosophy to enable the native language speakers and experts to contribute their knowledge to the system. The contributed words are then defined and linked using lexical and semantic relations. Thus, related words and derivatives can be identified and linked. From the outcome of the system implementation, it contributes to the development of semantic web and knowledge engineering.
Abstract: Clustering is one of an interesting data mining topics
that can be applied in many fields. Recently, the problem of cluster
analysis is formulated as a problem of nonsmooth, nonconvex optimization,
and an algorithm for solving the cluster analysis problem
based on nonsmooth optimization techniques is developed. This
optimization problem has a number of characteristics that make it
challenging: it has many local minimum, the optimization variables
can be either continuous or categorical, and there are no exact
analytical derivatives. In this study we show how to apply a particular
class of optimization methods known as pattern search methods
to address these challenges. These methods do not explicitly use
derivatives, an important feature that has not been addressed in
previous studies. Results of numerical experiments are presented
which demonstrate the effectiveness of the proposed method.
Abstract: In this paper we used data mining techniques to
identify outlier patients who are using large amount of drugs over a
long period of time. Any healthcare or health insurance system
should deal with the quantities of drugs utilized by chronic diseases
patients. In Kingdom of Bahrain, about 20% of health budget is spent
on medications. For the managers of healthcare systems, there is no
enough information about the ways of drug utilization by chronic
diseases patients, is there any misuse or is there outliers patients. In
this work, which has been done in cooperation with information
department in the Bahrain Defence Force hospital; we select the data
for Cardiac patients in the period starting from 1/1/2008 to
December 31/12/2008 to be the data for the model in this paper. We
used three techniques for finding the drug utilization for cardiac
patients. First we applied a clustering technique, followed by
measuring of clustering validity, and finally we applied a decision
tree as classification algorithm. The clustering results is divided into
three clusters according to the drug utilization, for 1603 patients, who
received 15,806 prescriptions during this period can be partitioned
into three groups, where 23 patients (2.59%) who received 1316
prescriptions (8.32%) are classified to be outliers. The classification
algorithm shows that the use of average drug utilization and the age,
and the gender of the patient can be considered to be the main
predictive factors in the induced model.
Abstract: Data mining incorporates a group of statistical
methods used to analyze a set of information, or a data set. It operates
with models and algorithms, which are powerful tools with the great
potential. They can help people to understand the patterns in certain
chunk of information so it is obvious that the data mining tools have
a wide area of applications. For example in the theoretical chemistry
data mining tools can be used to predict moleculeproperties or
improve computer-assisted drug design. Classification analysis is one
of the major data mining methodologies. The aim of thecontribution
is to create a classification model, which would be able to deal with a
huge data set with high accuracy. For this purpose logistic regression,
Bayesian logistic regression and random forest models were built
using R software. TheBayesian logistic regression in Latent GOLD
software was created as well. These classification methods belong to
supervised learning methods.
It was necessary to reduce data matrix dimension before construct
models and thus the factor analysis (FA) was used. Those models
were applied to predict the biological activity of molecules, potential
new drug candidates.
Abstract: Clustering is a very well known technique in data mining. One of the most widely used clustering techniques is the kmeans algorithm. Solutions obtained from this technique depend on the initialization of cluster centers and the final solution converges to local minima. In order to overcome K-means algorithm shortcomings, this paper proposes a hybrid evolutionary algorithm based on the combination of PSO, SA and K-means algorithms, called PSO-SA-K, which can find better cluster partition. The performance is evaluated through several benchmark data sets. The simulation results show that the proposed algorithm outperforms previous approaches, such as PSO, SA and K-means for partitional clustering problem.
Abstract: Chess is one of the indoor games, which improves the
level of human confidence, concentration, planning skills and
knowledge. The main objective of this paper is to help the chess
players to improve their chess openings using data mining
techniques. Budding Chess Players usually do practices by analyzing
various existing openings. When they analyze and correlate
thousands of openings it becomes tedious and complex for them. The
work done in this paper is to analyze the best lines of Blackmar-
Diemer Gambit(BDG) which opens with White D4... using data
mining analysis. It is carried out on the collection of winning games
by applying association rules. The first step of this analysis is
assigning variables to each different sequence moves. In the second
step, the sequence association rules were generated to calculate
support and confidence factor which help us to find the best
subsequence chess moves that may lead to winning position.
Abstract: In recent years, many researches to mine the exploding Web world, especially User Generated Content (UGC) such as
weblogs, for knowledge about various phenomena and events in the physical world have been done actively, and also Web services
with the Web-mined knowledge have begun to be developed for
the public. However, there are few detailed investigations on how accurately Web-mined data reflect physical-world data. It must be
problematic to idolatrously utilize the Web-mined data in public Web services without ensuring their accuracy sufficiently. Therefore,
this paper introduces the simplest Web Sensor and spatiotemporallynormalized
Web Sensor to extract spatiotemporal data about a target
phenomenon from weblogs searched by keyword(s) representing the
target phenomenon, and tries to validate the potential and reliability of the Web-sensed spatiotemporal data by four kinds of granularity
analyses of coefficient correlation with temperature, rainfall, snowfall,
and earthquake statistics per day by region of Japan Meteorological
Agency as physical-world data: spatial granularity (region-s population
density), temporal granularity (time period, e.g., per day vs. per week), representation granularity (e.g., “rain" vs. “heavy rain"), and
media granularity (weblogs vs. microblogs such as Tweets).
Abstract: Currently, web usage make a huge data from a lot of
user attention. In general, proxy server is a system to support web
usage from user and can manage system by using hit rates. This
research tries to improve hit rates in proxy system by applying data
mining technique. The data set are collected from proxy servers in the
university and are investigated relationship based on several features.
The model is used to predict the future access websites. Association
rule technique is applied to get the relation among Date, Time, Main
Group web, Sub Group web, and Domain name for created model.
The results showed that this technique can predict web content for the
next day, moreover the future accesses of websites increased from
38.15% to 85.57 %.
This model can predict web page access which tends to increase
the efficient of proxy servers as a result. In additional, the
performance of internet access will be improved and help to reduce
traffic in networks.
Abstract: Property investment in the real estate industry has a
high risk due to the uncertainty factors that will affect the decisions
made and high cost. Analytic hierarchy process has existed for some
time in which referred to an expert-s opinion to measure the
uncertainty of the risk factors for the risk analysis. Therefore,
different level of experts- experiences will create different opinion
and lead to the conflict among the experts in the field. The objective
of this paper is to propose a new technique to measure the uncertainty
of the risk factors based on multidimensional data model and data
mining techniques as deterministic approach. The propose technique
consist of a basic framework which includes four modules: user,
technology, end-user access tools and applications. The property
investment risk analysis defines as a micro level analysis as the
features of the property will be considered in the analysis in this
paper.