Abstract: The exponential increase in the volume of medical image database has imposed new challenges to clinical routine in maintaining patient history, diagnosis, treatment and monitoring. With the advent of data mining and machine learning techniques it is possible to automate and/or assist physicians in clinical diagnosis. In this research a medical image classification framework using data mining techniques is proposed. It involves feature extraction, feature selection, feature discretization and classification. In the classification phase, the performance of the traditional kNN k nearest neighbor classifier is improved using a feature weighting scheme and a distance weighted voting instead of simple majority voting. Feature weights are calculated using the interestingness measures used in association rule mining. Experiments on the retinal fundus images show that the proposed framework improves the classification accuracy of traditional kNN from 78.57 % to 92.85 %.
Abstract: The most important subtype of non-Hodgkin-s
lymphoma is the Diffuse Large B-Cell Lymphoma. Approximately
40% of the patients suffering from it respond well to therapy,
whereas the remainder needs a more aggressive treatment, in order to
better their chances of survival. Data Mining techniques have helped
to identify the class of the lymphoma in an efficient manner. Despite
that, thousands of genes should be processed to obtain the results.
This paper presents a comparison of the use of various attribute
selection methods aiming to reduce the number of genes to be
searched, looking for a more effective procedure as a whole.
Abstract: Estimates of temperature values at a specific time of day, from daytime and daily profiles, are needed for a number of environmental, ecological, agricultural and technical applications, ranging from natural hazards assessments, crop growth forecasting to design of solar energy systems. The scope of this research is to investigate the efficiency of data mining techniques in estimating minimum, maximum and mean temperature values. For this reason, a number of experiments have been conducted with well-known regression algorithms using temperature data from the city of Patras in Greece. The performance of these algorithms has been evaluated using standard statistical indicators, such as Correlation Coefficient, Root Mean Squared Error, etc.
Abstract: Website plays a significant role in success of an e-business. It is the main start point of any organization and corporation for its customers, so it's important to customize and design it according to the visitors' preferences. Also, websites are a place to introduce services of an organization and highlight new service to the visitors and audiences. In this paper, we will use web usage mining techniques, as a new field of research in data mining and knowledge discovery, in an Iranian government website. Using the results, a framework for web content layour is proposed. An agent is designed to dynamically update and improve web links locations and layout. Then, we will explain how it is used to directly enable top managers of the organization to influence on the arrangement of web contents and also to enhance customization of web site navigation due to online users' behaviors.
Abstract: Music Information Retrieval (MIR) and modern data mining techniques are applied to identify style markers in midi music for stylometric analysis and author attribution. Over 100 attributes are extracted from a library of 2830 songs then mined using supervised learning data mining techniques. Two attributes are identified that provide high informational gain. These attributes are then used as style markers to predict authorship. Using these style markers the authors are able to correctly distinguish songs written by the Beatles from those that were not with a precision and accuracy of over 98 per cent. The identification of these style markers as well as the architecture for this research provides a foundation for future research in musical stylometry.
Abstract: Text Mining is around applying knowledge discovery
techniques to unstructured text is termed knowledge discovery in text
(KDT), or Text data mining or Text Mining. In decision tree
approach is most useful in classification problem. With this
technique, tree is constructed to model the classification process.
There are two basic steps in the technique: building the tree and
applying the tree to the database. This paper describes a proposed
C5.0 classifier that performs rulesets, cross validation and boosting
for original C5.0 in order to reduce the optimization of error ratio.
The feasibility and the benefits of the proposed approach are
demonstrated by means of medial data set like hypothyroid. It is
shown that, the performance of a classifier on the training cases from
which it was constructed gives a poor estimate by sampling or using a
separate test file, either way, the classifier is evaluated on cases that
were not used to build and evaluate the classifier are both are large. If
the cases in hypothyroid.data and hypothyroid.test were to be
shuffled and divided into a new 2772 case training set and a 1000
case test set, C5.0 might construct a different classifier with a lower
or higher error rate on the test cases. An important feature of see5 is
its ability to classifiers called rulesets. The ruleset has an error rate
0.5 % on the test cases. The standard errors of the means provide an
estimate of the variability of results. One way to get a more reliable
estimate of predictive is by f-fold –cross- validation. The error rate of
a classifier produced from all the cases is estimated as the ratio of the
total number of errors on the hold-out cases to the total number of
cases. The Boost option with x trials instructs See5 to construct up to
x classifiers in this manner. Trials over numerous datasets, large and
small, show that on average 10-classifier boosting reduces the error
rate for test cases by about 25%.
Abstract: For the past one decade, biclustering has become popular data mining technique not only in the field of biological data analysis but also in other applications like text mining, market data analysis with high-dimensional two-way datasets. Biclustering clusters both rows and columns of a dataset simultaneously, as opposed to traditional clustering which clusters either rows or columns of a dataset. It retrieves subgroups of objects that are similar in one subgroup of variables and different in the remaining variables. Firefly Algorithm (FA) is a recently-proposed metaheuristic inspired by the collective behavior of fireflies. This paper provides a preliminary assessment of discrete version of FA (DFA) while coping with the task of mining coherent and large volume bicluster from web usage dataset. The experiments were conducted on two web usage datasets from public dataset repository whereby the performance of FA was compared with that exhibited by other population-based metaheuristic called binary Particle Swarm Optimization (PSO). The results achieved demonstrate the usefulness of DFA while tackling the biclustering problem.
Abstract: Biclustering is a very useful data mining technique for
identifying patterns where different genes are co-related based on a
subset of conditions in gene expression analysis. Association rules
mining is an efficient approach to achieve biclustering as in
BIMODULE algorithm but it is sensitive to the value given to its
input parameters and the discretization procedure used in the
preprocessing step, also when noise is present, classical association
rules miners discover multiple small fragments of the true bicluster,
but miss the true bicluster itself. This paper formally presents a
generalized noise tolerant bicluster model, termed as μBicluster. An
iterative algorithm termed as BIDENS based on the proposed model
is introduced that can discover a set of k possibly overlapping
biclusters simultaneously. Our model uses a more flexible method to
partition the dimensions to preserve meaningful and significant
biclusters. The proposed algorithm allows discovering biclusters that
hard to be discovered by BIMODULE. Experimental study on yeast,
human gene expression data and several artificial datasets shows that
our algorithm offers substantial improvements over several
previously proposed biclustering algorithms.
Abstract: There are several approaches in trying to solve the
Quantitative 1Structure-Activity Relationship (QSAR) problem.
These approaches are based either on statistical methods or on
predictive data mining. Among the statistical methods, one should
consider regression analysis, pattern recognition (such as cluster
analysis, factor analysis and principal components analysis) or partial
least squares. Predictive data mining techniques use either neural
networks, or genetic programming, or neuro-fuzzy knowledge. These
approaches have a low explanatory capability or non at all. This
paper attempts to establish a new approach in solving QSAR
problems using descriptive data mining. This way, the relationship
between the chemical properties and the activity of a substance
would be comprehensibly modeled.
Abstract: The purpose of this research aims to discover the
knowledge for analysis student motivation behavior on e-Learning
based on Data Mining Techniques, in case of the Information
Technology for Communication and Learning Course at Suan
Sunandha Rajabhat University. The data mining techniques was
applied in this research including association rules, classification
techniques. The results showed that using data mining technique can
indicate the important variables that influence the student motivation
behavior on e-Learning.
Abstract: Evidence-based medicine is a new direction in modern healthcare. Its task is to prevent, diagnose and medicate diseases using medical evidence. Medical data about a large patient population is analyzed to perform healthcare management and medical research. In order to obtain the best evidence for a given disease, external clinical expertise as well as internal clinical experience must be available to the healthcare practitioners at right time and in the right manner. External evidence-based knowledge can not be applied directly to the patient without adjusting it to the patient-s health condition. We propose a data warehouse based approach as a suitable solution for the integration of external evidence-based data sources into the existing clinical information system and data mining techniques for finding appropriate therapy for a given patient and a given disease. Through integration of data warehousing, OLAP and data mining techniques in the healthcare area, an easy to use decision support platform, which supports decision making process of care givers and clinical managers, is built. We present three case studies, which show, that a clinical data warehouse that facilitates evidence-based medicine is a reliable, powerful and user-friendly platform for strategic decision making, which has a great relevance for the practice and acceptance of evidence-based medicine.
Abstract: Methods of clustering which were developed in the
data mining theory can be successfully applied to the investigation of
different kinds of dependencies between the conditions of
environment and human activities. It is known, that environmental
parameters such as temperature, relative humidity, atmospheric
pressure and illumination have significant effects on the human
mental performance. To investigate these parameters effect, data
mining technique of clustering using entropy and Information Gain
Ratio (IGR) K(Y/X) = (H(X)–H(Y/X))/H(Y) is used, where
H(Y)=-ΣPi ln(Pi). This technique allows adjusting the boundaries of
clusters. It is shown that the information gain ratio (IGR) grows
monotonically and simultaneously with degree of connectivity
between two variables. This approach has some preferences if
compared, for example, with correlation analysis due to relatively
smaller sensitivity to shape of functional dependencies. Variant of an
algorithm to implement the proposed method with some analysis of
above problem of environmental effects is also presented. It was
shown that proposed method converges with finite number of steps.
Abstract: Generalized Center String (GCS) problem are
generalized from Common Approximate Substring problem
and Common substring problems. GCS are known to be
NP-hard allowing the problems lies in the explosion of
potential candidates. Finding longest center string without
concerning the sequence that may not contain any motifs is
not known in advance in any particular biological gene
process. GCS solved by frequent pattern-mining techniques
and known to be fixed parameter tractable based on the
fixed input sequence length and symbol set size. Efficient
method known as Bpriori algorithms can solve GCS with
reasonable time/space complexities. Bpriori 2 and Bpriori
3-2 algorithm are been proposed of any length and any
positions of all their instances in input sequences. In this
paper, we reduced the time/space complexity of Bpriori
algorithm by Constrained Based Frequent Pattern mining
(CBFP) technique which integrates the idea of Constraint
Based Mining and FP-tree mining. CBFP mining technique
solves the GCS problem works for all center string of any
length, but also for the positions of all their mutated copies
of input sequence. CBFP mining technique construct TRIE
like with FP tree to represent the mutated copies of center
string of any length, along with constraints to restraint
growth of the consensus tree. The complexity analysis for
Constrained Based FP mining technique and Bpriori
algorithm is done based on the worst case and average case
approach. Algorithm's correctness compared with the
Bpriori algorithm using artificial data is shown.
Abstract: Data mining techniques have been used in medical
research for many years and have been known to be effective. In order
to solve such problems as long-waiting time, congestion, and delayed
patient care, faced by emergency departments, this study concentrates
on building a hybrid methodology, combining data mining techniques
such as association rules and classification trees. The methodology is
applied to real-world emergency data collected from a hospital and is
evaluated by comparing with other techniques. The methodology is
expected to help physicians to make a faster and more accurate
classification of chest pain diseases.
Abstract: Application of Information Technology (IT) has
revolutionized the functioning of business all over the world. Its
impact has been felt mostly among the information of dependent
industries. Tourism is one of such industry. The conceptual
framework in this study represents an innovation of travel
information searching system on mobile devices which is used as
tools to deliver travel information (such as hotels, restaurants, tourist
attractions and souvenir shops) for each user by travelers
segmentation based on data mining technique to segment the tourists-
behavior patterns then match them with tourism products and
services. This system innovation is designed to be a knowledge
incremental learning. It is a marketing strategy to support business to
respond traveler-s demand effectively.
Abstract: This paper proposes an auto-classification algorithm
of Web pages using Data mining techniques. We consider the
problem of discovering association rules between terms in a set of
Web pages belonging to a category in a search engine database, and
present an auto-classification algorithm for solving this problem that
are fundamentally based on Apriori algorithm. The proposed
technique has two phases. The first phase is a training phase where
human experts determines the categories of different Web pages, and
the supervised Data mining algorithm will combine these categories
with appropriate weighted index terms according to the highest
supported rules among the most frequent words. The second phase is
the categorization phase where a web crawler will crawl through the
World Wide Web to build a database categorized according to the
result of the data mining approach. This database contains URLs and
their categories.
Abstract: This work concerns the evolution and the maintenance
of an ontological resource in relation with the evolution of the corpus
of texts from which it had been built.
The knowledge forming a text corpus, especially in dynamic domains,
is in continuous evolution. When a change in the corpus occurs, the
domain ontology must evolve accordingly. Most methods manage
ontology evolution independently from the corpus from which it is
built; in addition, they treat evolution just as a process of knowledge
addition, not considering other knowledge changes. We propose a
methodology for managing an evolving ontology from a text corpus
that evolves over time, while preserving the consistency and the
persistence of this ontology.
Our methodology is based on the changes made on the corpus to
reflect the evolution of the considered domain - augmented surgery
in our case. In this context, the results of text mining techniques,
as well as the ARCHONTE method slightly modified, are used to
support the evolution process.
Abstract: this article proposed a methodology for computer
numerical control (CNC) machine scoring. The case study company
is a manufacturer of hard disk drive parts in Thailand. In this
company, sample of parts manufactured from CNC machine are
usually taken randomly for quality inspection. These inspection data
were used to make a decision to shut down the machine if it has
tendency to produce parts that are out of specification. Large amount
of data are produced in this process and data mining could be very
useful technique in analyzing them. In this research, data mining
techniques were used to construct a machine scoring model called
'machine priority assessment model (MPAM)'. This model helps to
ensure that the machine with higher risk of producing defective parts
be inspected before those with lower risk. If the defective prone
machine is identified sooner, defective part and rework could be
reduced hence improving the overall productivity. The results
showed that the proposed method can be successfully implemented
and approximately 351,000 baht of opportunity cost could have
saved in the case study company.
Abstract: This study proposes a novel recommender system to
provide the advertisements of context-aware services. Our proposed
model is designed to apply a modified collaborative filtering (CF)
algorithm with regard to the several dimensions for the personalization
of mobile devices – location, time and the user-s needs type. In
particular, we employ a classification rule to understand user-s needs
type using a decision tree algorithm. In addition, we collect primary
data from the mobile phone users and apply them to the proposed
model to validate its effectiveness. Experimental results show that the
proposed system makes more accurate and satisfactory advertisements
than comparative systems.
Abstract: The paper gives the pilot results of the project that is
oriented on the use of data mining techniques and knowledge
discoveries from production systems through them. They have been
used in the management of these systems. The simulation models of
manufacturing systems have been developed to obtain the necessary
data about production. The authors have developed the way of
storing data obtained from the simulation models in the data
warehouse. Data mining model has been created by using specific
methods and selected techniques for defined problems of production
system management. The new knowledge has been applied to
production management system. Gained knowledge has been tested
on simulation models of the production system. An important benefit
of the project has been proposal of the new methodology. This
methodology is focused on data mining from the databases that store
operational data about the production process.