Abstract: Support vector machines (SVMs) are considered to be
the best machine learning algorithms for minimizing the predictive
probability of misclassification. However, their drawback is that for
large data sets the computation of the optimal decision boundary is a
time consuming function of the size of the training set. Hence several
methods have been proposed to speed up the SVM algorithm. Here
three methods used to speed up the computation of the SVM
classifiers are compared experimentally using a musical genre
classification problem. The simplest method pre-selects a random
sample of the data before the application of the SVM algorithm. Two
additional methods use proximity graphs to pre-select data that are
near the decision boundary. One uses k-Nearest Neighbor graphs and
the other Relative Neighborhood Graphs to accomplish the task.
Abstract: The tagging data of (users, tags and resources) constitutes a folksonomy that is the user-driven and bottom-up approach to organizing and classifying information on the Web. Tagging data stored in the folksonomy include a lot of very useful information and knowledge. However, appropriate approach for analyzing tagging data and discovering hidden knowledge from them still remains one of the main problems on the folksonomy mining researches. In this paper, we have proposed a folksonomy data mining approach based on FCA for discovering hidden knowledge easily from folksonomy. Also we have demonstrated how our proposed approach can be applied in the collaborative tagging system through our experiment. Our proposed approach can be applied to some interesting areas such as social network analysis, semantic web mining and so on.
Abstract: It is important to predict yield in semiconductor test process in order to increase yield. In this study, yield prediction means finding out defective die, wafer or lot effectively. Semiconductor test process consists of some test steps and each test includes various test items. In other world, test data has a big and complicated characteristic. It also is disproportionably distributed as the number of data belonging to FAIL class is extremely low. For yield prediction, general data mining techniques have a limitation without any data preprocessing due to eigen properties of test data. Therefore, this study proposes an under-sampling method using support vector machine (SVM) to eliminate an imbalanced characteristic. For evaluating a performance, randomly under-sampling method is compared with the proposed method using actual semiconductor test data. As a result, sampling method using SVM is effective in generating robust model for yield prediction.
Abstract: Intrusion Detection Systems are increasingly a key
part of systems defense. Various approaches to Intrusion Detection
are currently being used, but they are relatively ineffective. Artificial
Intelligence plays a driving role in security services. This paper
proposes a dynamic model Intelligent Intrusion Detection System,
based on specific AI approach for intrusion detection. The
techniques that are being investigated includes neural networks and
fuzzy logic with network profiling, that uses simple data mining
techniques to process the network data. The proposed system is a
hybrid system that combines anomaly, misuse and host based
detection. Simple Fuzzy rules allow us to construct if-then rules that
reflect common ways of describing security attacks. For host based
intrusion detection we use neural-networks along with self
organizing maps. Suspicious intrusions can be traced back to its
original source path and any traffic from that particular source will
be redirected back to them in future. Both network traffic and system
audit data are used as inputs for both.
Abstract: In this paper we use data mining techniques to investigate factors that contribute significantly to enhancing the risk of acute coronary syndrome. We assume that the dependent variable is diagnosis – with dichotomous values showing presence or absence of disease. We have applied binary regression to the factors affecting the dependent variable. The data set has been taken from two different cardiac hospitals of Karachi, Pakistan. We have total sixteen variables out of which one is assumed dependent and other 15 are independent variables. For better performance of the regression model in predicting acute coronary syndrome, data reduction techniques like principle component analysis is applied. Based on results of data reduction, we have considered only 14 out of sixteen factors.
Abstract: The purpose of determining impact significance is to
place value on impacts. Environmental impact assessment review is a
process that judges whether impact significance is acceptable or not in
accordance with the scientific facts regarding environmental,
ecological and socio-economical impacts described in environmental
impact statements (EIS) or environmental impact assessment reports
(EIAR). The first aim of this paper is to summarize the criteria of
significance evaluation from the past review results and accordingly
utilize fuzzy logic to incorporate these criteria into scientific facts. The
second aim is to employ data mining technique to construct an EIS or
EIAR prediction model for reviewing results which can assist
developers to prepare and revise better environmental management
plans in advance. The validity of the previous prediction model
proposed by authors in 2009 is 92.7%. The enhanced validity in this
study can attain 100.0%.
Abstract: Recommender systems are usually regarded as an
important marketing tool in the e-commerce. They use important
information about users to facilitate accurate recommendation. The
information includes user context such as location, time and interest
for personalization of mobile users. We can easily collect information
about location and time because mobile devices communicate with the
base station of the service provider. However, information about user
interest can-t be easily collected because user interest can not be
captured automatically without user-s approval process. User interest
usually represented as a need. In this study, we classify needs into two
types according to prior research. This study investigates the
usefulness of data mining techniques for classifying user need type for
recommendation systems. We employ several data mining techniques
including artificial neural networks, decision trees, case-based
reasoning, and multivariate discriminant analysis. Experimental
results show that CHAID algorithm outperforms other models for
classifying user need type. This study performs McNemar test to
examine the statistical significance of the differences of classification
results. The results of McNemar test also show that CHAID performs
better than the other models with statistical significance.
Abstract: In this paper a novel algorithm is proposed that integrates the process of fuzzy hierarchy generation and rule discovery for automated discovery of Production Rules with Fuzzy Hierarchy (PRFH) in large databases.A concept of frequency matrix (Freq) introduced to summarize large database that helps in minimizing the number of database accesses, identification and removal of irrelevant attribute values and weak classes during the fuzzy hierarchy generation.Experimental results have established the effectiveness of the proposed algorithm.
Abstract: Clustering techniques have been used by many intelligent software agents to group similar access patterns of the Web users into high level themes which express users intentions and interests. However, such techniques have been mostly focusing on one salient feature of the Web document visited by the user, namely the extracted keywords. The major aim of these techniques is to come up with an optimal threshold for the number of keywords needed to produce more focused themes. In this paper we focus on both keyword and similarity thresholds to generate themes with concentrated themes, and hence build a more sound model of the user behavior. The purpose of this paper is two fold: use distance based clustering methods to recognize overall themes from the Proxy log file, and suggest an efficient cut off levels for the keyword and similarity thresholds which tend to produce more optimal clusters with better focus and efficient size.
Abstract: This paper investigates the issue of building decision
trees from data with imprecise class values where imprecision is
encoded in the form of possibility distributions. The Information
Affinity similarity measure is introduced into the well-known gain
ratio criterion in order to assess the homogeneity of a set of
possibility distributions representing instances-s classes belonging to
a given training partition. For the experimental study, we proposed an
information affinity based performance criterion which we have used
in order to show the performance of the approach on well-known
benchmarks.
Abstract: The aim of this paper is to present a methodology in
three steps to forecast supply chain demand. In first step, various data
mining techniques are applied in order to prepare data for entering
into forecasting models. In second step, the modeling step, an
artificial neural network and support vector machine is presented
after defining Mean Absolute Percentage Error index for measuring
error. The structure of artificial neural network is selected based on
previous researchers' results and in this article the accuracy of
network is increased by using sensitivity analysis. The best forecast
for classical forecasting methods (Moving Average, Exponential
Smoothing, and Exponential Smoothing with Trend) is resulted based
on prepared data and this forecast is compared with result of support
vector machine and proposed artificial neural network. The results
show that artificial neural network can forecast more precisely in
comparison with other methods. Finally, forecasting methods'
stability is analyzed by using raw data and even the effectiveness of
clustering analysis is measured.
Abstract: The prevalence of non organic constipation differs
from country to country and the reliability of the estimate rates is
uncertain. Moreover, the clinical relevance of subdividing the
heterogeneous functional constipation disorders into pre-defined
subgroups is largely unknown.. Aim: to estimate the prevalence of
constipation in a population-based sample and determine whether
clinical subgroups can be identified. An age and gender stratified
sample population from 5 Italian cities was evaluated using a
previously validated questionnaire. Data mining by cluster analysis
was used to determine constipation subgroups. Results: 1,500
complete interviews were obtained from 2,083 contacted households
(72%). Self-reported constipation correlated poorly with symptombased
constipation found in 496 subjects (33.1%). Cluster analysis
identified four constipation subgroups which correlated to subgroups
identified according to pre-defined symptom criteria. Significant
differences in socio-demographics and lifestyle were observed
among subgroups.
Abstract: Bagging and boosting are among the most popular resampling ensemble methods that generate and combine a diversity of classifiers using the same learning algorithm for the base-classifiers. Boosting algorithms are considered stronger than bagging on noisefree data. However, there are strong empirical indications that bagging is much more robust than boosting in noisy settings. For this reason, in this work we built an ensemble using a voting methodology of bagging and boosting ensembles with 10 subclassifiers in each one. We performed a comparison with simple bagging and boosting ensembles with 25 sub-classifiers, as well as other well known combining methods, on standard benchmark datasets and the proposed technique was the most accurate.
Abstract: Diagnosis can be achieved by building a model of a
certain organ under surveillance and comparing it with the real time
physiological measurements taken from the patient. This paper deals
with the presentation of the benefits of using Data Mining techniques
in the computer-aided diagnosis (CAD), focusing on the cancer
detection, in order to help doctors to make optimal decisions quickly
and accurately. In the field of the noninvasive diagnosis techniques,
the endoscopic ultrasound elastography (EUSE) is a recent elasticity
imaging technique, allowing characterizing the difference between
malignant and benign tumors. Digitalizing and summarizing the main
EUSE sample movies features in a vector form concern with the use
of the exploratory data analysis (EDA). Neural networks are then
trained on the corresponding EUSE sample movies vector input in
such a way that these intelligent systems are able to offer a very
precise and objective diagnosis, discriminating between benign and
malignant tumors. A concrete application of these Data Mining
techniques illustrates the suitability and the reliability of this
methodology in CAD.
Abstract: With the extensive inclusion of document, especially
text, in the business systems, data mining does not cover the full
scope of Business Intelligence. Data mining cannot deliver its impact
on extracting useful details from the large collection of unstructured
and semi-structured written materials based on natural languages.
The most pressing issue is to draw the potential business intelligence
from text. In order to gain competitive advantages for the business, it
is necessary to develop the new powerful tool, text mining, to expand
the scope of business intelligence.
In this paper, we will work out the strong points of text mining in
extracting business intelligence from huge amount of textual
information sources within business systems. We will apply text
mining to each stage of Business Intelligence systems to prove that
text mining is the powerful tool to expand the scope of BI. After
reviewing basic definitions and some related technologies, we will
discuss the relationship and the benefits of these to text mining. Some
examples and applications of text mining will also be given. The
motivation behind is to develop new approach to effective and
efficient textual information analysis. Thus we can expand the scope
of Business Intelligence using the powerful tool, text mining.
Abstract: Self-organizing map (SOM) is a well known data
reduction technique used in data mining. It can reveal structure in
data sets through data visualization that is otherwise hard to detect
from raw data alone. However, interpretation through visual
inspection is prone to errors and can be very tedious. There are
several techniques for the automatic detection of clusters of code
vectors found by SOM, but they generally do not take into account
the distribution of code vectors; this may lead to unsatisfactory
clustering and poor definition of cluster boundaries, particularly
where the density of data points is low. In this paper, we propose the
use of an adaptive heuristic particle swarm optimization (PSO)
algorithm for finding cluster boundaries directly from the code
vectors obtained from SOM. The application of our method to
several standard data sets demonstrates its feasibility. PSO algorithm
utilizes a so-called U-matrix of SOM to determine cluster boundaries;
the results of this novel automatic method compare very favorably to
boundary detection through traditional algorithms namely k-means
and hierarchical based approach which are normally used to interpret
the output of SOM.
Abstract: Text Mining is around applying knowledge discovery techniques to unstructured text is termed knowledge discovery in text (KDT), or Text data mining or Text Mining. In Neural Network that address classification problems, training set, testing set, learning rate are considered as key tasks. That is collection of input/output patterns that are used to train the network and used to assess the network performance, set the rate of adjustments. This paper describes a proposed back propagation neural net classifier that performs cross validation for original Neural Network. In order to reduce the optimization of classification accuracy, training time. The feasibility the benefits of the proposed approach are demonstrated by means of five data sets like contact-lenses, cpu, weather symbolic, Weather, labor-nega-data. It is shown that , compared to exiting neural network, the training time is reduced by more than 10 times faster when the dataset is larger than CPU or the network has many hidden units while accuracy ('percent correct') was the same for all datasets but contact-lences, which is the only one with missing attributes. For contact-lences the accuracy with Proposed Neural Network was in average around 0.3 % less than with the original Neural Network. This algorithm is independent of specify data sets so that many ideas and solutions can be transferred to other classifier paradigms.
Abstract: Currently, slider process of Hard Disk Drive Industry
become more complex, defective diagnosis for yield improvement
becomes more complicated and time-consumed. Manufacturing data
analysis with data mining approach is widely used for solving that
problem. The existing mining approach from combining of the KMean
clustering, the machine oriented Kruskal-Wallis test and the
multivariate chart were applied for defective diagnosis but it is still
be a semiautomatic diagnosis system. This article aims to modify an
algorithm to support an automatic decision for the existing approach.
Based on the research framework, the new approach can do an
automatic diagnosis and help engineer to find out the defective
factors faster than the existing approach about 50%.
Abstract: This paper discusses the use of explorative data
mining tools that allow the educator to explore new relationships
between reported learning experiences and actual activities,
even if there are multiple dimensions with a large number
of measured items. The underlying technology is based on
the so-called Compendium Platform for Reproducible Computing
(http://www.freestatistics.org) which was built on top the computational
R Framework (http://www.wessa.net).
Abstract: This research aims to create a model for analysis of student motivation behavior on e-Learning based on association rule mining techniques in case of the Information Technology for Communication and Learning Course at Suan Sunandha Rajabhat University. The model was created under association rules, one of the data mining techniques with minimum confidence. The results showed that the student motivation behavior model by using association rule technique can indicate the important variables that influence the student motivation behavior on e-Learning.