Abstract: Choosing the right metadata is a critical, as good
information (metadata) attached to an image will facilitate its
visibility from a pile of other images. The image-s value is enhanced
not only by the quality of attached metadata but also by the technique
of the search. This study proposes a technique that is simple but
efficient to predict a single human image from a website using the
basic image data and the embedded metadata of the image-s content
appearing on web pages. The result is very encouraging with the
prediction accuracy of 95%. This technique may become a great
assist to librarians, researchers and many others for automatically and
efficiently identifying a set of human images out of a greater set of
images.
Abstract: With the extensive inclusion of document, especially
text, in the business systems, data mining does not cover the full
scope of Business Intelligence. Data mining cannot deliver its impact
on extracting useful details from the large collection of unstructured
and semi-structured written materials based on natural languages.
The most pressing issue is to draw the potential business intelligence
from text. In order to gain competitive advantages for the business, it
is necessary to develop the new powerful tool, text mining, to expand
the scope of business intelligence.
In this paper, we will work out the strong points of text mining in
extracting business intelligence from huge amount of textual
information sources within business systems. We will apply text
mining to each stage of Business Intelligence systems to prove that
text mining is the powerful tool to expand the scope of BI. After
reviewing basic definitions and some related technologies, we will
discuss the relationship and the benefits of these to text mining. Some
examples and applications of text mining will also be given. The
motivation behind is to develop new approach to effective and
efficient textual information analysis. Thus we can expand the scope
of Business Intelligence using the powerful tool, text mining.
Abstract: Self-organizing map (SOM) is a well known data
reduction technique used in data mining. It can reveal structure in
data sets through data visualization that is otherwise hard to detect
from raw data alone. However, interpretation through visual
inspection is prone to errors and can be very tedious. There are
several techniques for the automatic detection of clusters of code
vectors found by SOM, but they generally do not take into account
the distribution of code vectors; this may lead to unsatisfactory
clustering and poor definition of cluster boundaries, particularly
where the density of data points is low. In this paper, we propose the
use of an adaptive heuristic particle swarm optimization (PSO)
algorithm for finding cluster boundaries directly from the code
vectors obtained from SOM. The application of our method to
several standard data sets demonstrates its feasibility. PSO algorithm
utilizes a so-called U-matrix of SOM to determine cluster boundaries;
the results of this novel automatic method compare very favorably to
boundary detection through traditional algorithms namely k-means
and hierarchical based approach which are normally used to interpret
the output of SOM.
Abstract: Text Mining is around applying knowledge discovery techniques to unstructured text is termed knowledge discovery in text (KDT), or Text data mining or Text Mining. In Neural Network that address classification problems, training set, testing set, learning rate are considered as key tasks. That is collection of input/output patterns that are used to train the network and used to assess the network performance, set the rate of adjustments. This paper describes a proposed back propagation neural net classifier that performs cross validation for original Neural Network. In order to reduce the optimization of classification accuracy, training time. The feasibility the benefits of the proposed approach are demonstrated by means of five data sets like contact-lenses, cpu, weather symbolic, Weather, labor-nega-data. It is shown that , compared to exiting neural network, the training time is reduced by more than 10 times faster when the dataset is larger than CPU or the network has many hidden units while accuracy ('percent correct') was the same for all datasets but contact-lences, which is the only one with missing attributes. For contact-lences the accuracy with Proposed Neural Network was in average around 0.3 % less than with the original Neural Network. This algorithm is independent of specify data sets so that many ideas and solutions can be transferred to other classifier paradigms.
Abstract: Currently, slider process of Hard Disk Drive Industry
become more complex, defective diagnosis for yield improvement
becomes more complicated and time-consumed. Manufacturing data
analysis with data mining approach is widely used for solving that
problem. The existing mining approach from combining of the KMean
clustering, the machine oriented Kruskal-Wallis test and the
multivariate chart were applied for defective diagnosis but it is still
be a semiautomatic diagnosis system. This article aims to modify an
algorithm to support an automatic decision for the existing approach.
Based on the research framework, the new approach can do an
automatic diagnosis and help engineer to find out the defective
factors faster than the existing approach about 50%.
Abstract: This research aims to create a model for analysis of student motivation behavior on e-Learning based on association rule mining techniques in case of the Information Technology for Communication and Learning Course at Suan Sunandha Rajabhat University. The model was created under association rules, one of the data mining techniques with minimum confidence. The results showed that the student motivation behavior model by using association rule technique can indicate the important variables that influence the student motivation behavior on e-Learning.
Abstract: With the proliferation of World Wide Web,
development of web-based technologies and the growth in web
content, the structure of a website becomes more complex and web
navigation becomes a critical issue to both web designers and users.
In this paper we define the content and web pages as two important
and influential factors in website navigation and paraphrase the
enhancement in the website navigation as making some useful
changes in the link structure of the website based on the
aforementioned factors. Then we suggest a new method for
proposing the changes using fuzzy approach to optimize the website
architecture. Applying the proposed method to a real case of Iranian
Civil Aviation Organization (CAO) website, we discuss the results of
the novel approach at the final section.
Abstract: Information is power. Geographical information is an
emerging science that is advancing the development of knowledge to
further help in the understanding of the relationship of “place" with
other disciplines such as crime. The researchers used crime data for
the years 2004 to 2007 from the Baguio City Police Office to
determine the incidence and actual locations of crime hotspots.
Combined qualitative and quantitative research methodology was
employed through extensive fieldwork and observation, geographic
visualization with Geographic Information Systems (GIS) and Global
Positioning Systems (GPS), and data mining. The paper discusses
emerging geographic visualization and data mining tools and
methodologies that can be used to generate baseline data for
environmental initiatives such as urban renewal and rejuvenation.
The study was able to demonstrate that crime hotspots can be
computed and were seen to be occurring to some select places in the
Central Business District (CBD) of Baguio City. It was observed that
some characteristics of the hotspot places- physical design and milieu
may play an important role in creating opportunities for crime. A list
of these environmental attributes was generated. This derived
information may be used to guide the design or redesign of the urban
environment of the City to be able to reduce crime and at the same
time improve it physically.
Abstract: Phishing, or stealing of sensitive information on the
web, has dealt a major blow to Internet Security in recent times. Most
of the existing anti-phishing solutions fail to handle the fuzziness
involved in phish detection, thus leading to a large number of false
positives. This fuzziness is attributed to the use of highly flexible and
at the same time, highly ambiguous HTML language. We introduce a
new perspective against phishing, that tries to systematically prove,
whether a given page is phished or not, using the corresponding
original page as the basis of the comparison. It analyzes the layout of
the pages under consideration to determine the percentage distortion
between them, indicative of any form of malicious alteration. The
system design represents an intelligent system, employing dynamic
assessment which accurately identifies brand new phishing attacks
and will prove effective in reducing the number of false positives.
This framework could potentially be used as a knowledge base, in
educating the internet users against phishing.
Abstract: This paper presents a system for discovering
association rules from collections of unstructured documents called
EART (Extract Association Rules from Text). The EART system
treats texts only not images or figures. EART discovers association
rules amongst keywords labeling the collection of textual documents.
The main characteristic of EART is that the system integrates XML
technology (to transform unstructured documents into structured
documents) with Information Retrieval scheme (TF-IDF) and Data
Mining technique for association rules extraction. EART depends on
word feature to extract association rules. It consists of four phases:
structure phase, index phase, text mining phase and visualization
phase. Our work depends on the analysis of the keywords in the
extracted association rules through the co-occurrence of the keywords
in one sentence in the original text and the existing of the keywords
in one sentence without co-occurrence. Experiments applied on a
collection of scientific documents selected from MEDLINE that are
related to the outbreak of H5N1 avian influenza virus.
Abstract: Mining Sequential Patterns in large databases has become
an important data mining task with broad applications. It is
an important task in data mining field, which describes potential
sequenced relationships among items in a database. There are many
different algorithms introduced for this task. Conventional algorithms
can find the exact optimal Sequential Pattern rule but it takes a
long time, particularly when they are applied on large databases.
Nowadays, some evolutionary algorithms, such as Particle Swarm
Optimization and Genetic Algorithm, were proposed and have been
applied to solve this problem. This paper will introduce a new kind
of hybrid evolutionary algorithm that combines Genetic Algorithm
(GA) with Particle Swarm Optimization (PSO) to mine Sequential
Pattern, in order to improve the speed of evolutionary algorithms
convergence. This algorithm is referred to as SP-GAPSO.
Abstract: Web usage mining algorithms have been widely
utilized for modeling user web navigation behavior. In this study we
advance a model for mining of user-s navigation pattern. The model
makes user model based on expectation-maximization (EM)
algorithm.An EM algorithm is used in statistics for finding maximum
likelihood estimates of parameters in probabilistic models, where the
model depends on unobserved latent variables. The experimental
results represent that by decreasing the number of clusters, the log
likelihood converges toward lower values and probability of the
largest cluster will be decreased while the number of the clusters
increases in each treatment.
Abstract: This paper is a description approach to predict
incoming and outgoing data rate in network system by using
association rule discover, which is one of the data mining
techniques. Information of incoming and outgoing data in each
times and network bandwidth are network performance
parameters, which needed to solve in the traffic problem. Since
congestion and data loss are important network problems. The result
of this technique can predicted future network traffic. In addition,
this research is useful for network routing selection and network
performance improvement.
Abstract: In the recent past Learning Classifier Systems have
been successfully used for data mining. Learning Classifier System
(LCS) is basically a machine learning technique which combines
evolutionary computing, reinforcement learning, supervised or
unsupervised learning and heuristics to produce adaptive systems. A
LCS learns by interacting with an environment from which it
receives feedback in the form of numerical reward. Learning is
achieved by trying to maximize the amount of reward received. All
LCSs models more or less, comprise four main components; a finite
population of condition–action rules, called classifiers; the
performance component, which governs the interaction with the
environment; the credit assignment component, which distributes the
reward received from the environment to the classifiers accountable
for the rewards obtained; the discovery component, which is
responsible for discovering better rules and improving existing ones
through a genetic algorithm. The concatenate of the production rules
in the LCS form the genotype, and therefore the GA should operate
on a population of classifier systems. This approach is known as the
'Pittsburgh' Classifier Systems. Other LCS that perform their GA at
the rule level within a population are known as 'Mitchigan' Classifier
Systems. The most predominant representation of the discovered
knowledge is the standard production rules (PRs) in the form of IF P
THEN D. The PRs, however, are unable to handle exceptions and do
not exhibit variable precision. The Censored Production Rules
(CPRs), an extension of PRs, were proposed by Michalski and
Winston that exhibit variable precision and supports an efficient
mechanism for handling exceptions. A CPR is an augmented
production rule of the form: IF P THEN D UNLESS C, where
Censor C is an exception to the rule. Such rules are employed in
situations, in which conditional statement IF P THEN D holds
frequently and the assertion C holds rarely. By using a rule of this
type we are free to ignore the exception conditions, when the
resources needed to establish its presence are tight or there is simply
no information available as to whether it holds or not. Thus, the IF P
THEN D part of CPR expresses important information, while the
UNLESS C part acts only as a switch and changes the polarity of D
to ~D. In this paper Pittsburgh style LCSs approach is used for
automated discovery of CPRs. An appropriate encoding scheme is
suggested to represent a chromosome consisting of fixed size set of
CPRs. Suitable genetic operators are designed for the set of CPRs
and individual CPRs and also appropriate fitness function is proposed
that incorporates basic constraints on CPR. Experimental results are
presented to demonstrate the performance of the proposed learning
classifier system.
Abstract: The vast amount of information hidden in huge
databases has created tremendous interests in the field of data
mining. This paper examines the possibility of using data clustering
techniques in oral medicine to identify functional relationships
between different attributes and classification of similar patient
examinations. Commonly used data clustering algorithms have been
reviewed and as a result several interesting results have been
gathered.
Abstract: Clustering is a very well known technique in data mining. One of the most widely used clustering techniques is the kmeans algorithm. Solutions obtained from this technique depend on the initialization of cluster centers and the final solution converges to local minima. In order to overcome K-means algorithm shortcomings, this paper proposes a hybrid evolutionary algorithm based on the combination of PSO, SA and K-means algorithms, called PSO-SA-K, which can find better cluster partition. The performance is evaluated through several benchmark data sets. The simulation results show that the proposed algorithm outperforms previous approaches, such as PSO, SA and K-means for partitional clustering problem.
Abstract: Currently, web usage make a huge data from a lot of
user attention. In general, proxy server is a system to support web
usage from user and can manage system by using hit rates. This
research tries to improve hit rates in proxy system by applying data
mining technique. The data set are collected from proxy servers in the
university and are investigated relationship based on several features.
The model is used to predict the future access websites. Association
rule technique is applied to get the relation among Date, Time, Main
Group web, Sub Group web, and Domain name for created model.
The results showed that this technique can predict web content for the
next day, moreover the future accesses of websites increased from
38.15% to 85.57 %.
This model can predict web page access which tends to increase
the efficient of proxy servers as a result. In additional, the
performance of internet access will be improved and help to reduce
traffic in networks.
Abstract: The increasing importance of data stream arising in a
wide range of advanced applications has led to the extensive study of
mining frequent patterns. Mining data streams poses many new
challenges amongst which are the one-scan nature, the unbounded
memory requirement and the high arrival rate of data streams. In this
paper, we propose a new approach for mining itemsets on data
stream. Our approach SFIDS has been developed based on FIDS
algorithm. The main attempts were to keep some advantages of the
previous approach and resolve some of its drawbacks, and
consequently to improve run time and memory consumption. Our
approach has the following advantages: using a data structure similar
to lattice for keeping frequent itemsets, separating regions from each
other with deleting common nodes that results in a decrease in search
space, memory consumption and run time; and Finally, considering
CPU constraint, with increasing arrival rate of data that result in
overloading system, SFIDS automatically detect this situation and
discard some of unprocessing data. We guarantee that error of results
is bounded to user pre-specified threshold, based on a probability
technique. Final results show that SFIDS algorithm could attain
about 50% run time improvement than FIDS approach.
Abstract: The number of features required to represent an image
can be very huge. Using all available features to recognize objects
can suffer from curse dimensionality. Feature selection and
extraction is the pre-processing step of image mining. Main issues in
analyzing images is the effective identification of features and
another one is extracting them. The mining problem that has been
focused is the grouping of features for different shapes. Experiments
have been conducted by using shape outline as the features. Shape
outline readings are put through normalization and dimensionality
reduction process using an eigenvector based method to produce a
new set of readings. After this pre-processing step data will be
grouped through their shapes. Through statistical analysis, these
readings together with peak measures a robust classification and
recognition process is achieved. Tests showed that the suggested
methods are able to automatically recognize objects through their
shapes. Finally, experiments also demonstrate the system invariance
to rotation, translation, scale, reflection and to a small degree of
distortion.
Abstract: A data cutting and sorting method (DCSM) is proposed
to optimize the performance of data mining. DCSM reduces the
calculation time by getting rid of redundant data during the data
mining process. In addition, DCSM minimizes the computational units
by splitting the database and by sorting data with support counts. In the
process of searching for the relationship between metabolic syndrome
and lifestyles with the health examination database of an electronics
manufacturing company, DCSM demonstrates higher search
efficiency than the traditional Apriori algorithm in tests with different
support counts.