Abstract: This paper focuses on analyzing medical diagnostic data using classification rules in data mining and context reduction in formal concept analysis. It helps in finding redundancies among the various medical examination tests used in diagnosis of a disease. Classification rules have been derived from positive and negative association rules using the Concept lattice structure of the Formal Concept Analysis. Context reduction technique given in Formal Concept Analysis along with classification rules has been used to find redundancies among the various medical examination tests. Also it finds out whether expensive medical tests can be replaced by some cheaper tests.
Abstract: The purpose of this paper is to develop models that would enable predicting student success. These models could improve allocation of students among colleges and optimize the newly introduced model of government subsidies for higher education. For the purpose of collecting data, an anonymous survey was carried out in the last year of undergraduate degree student population using random sampling method. Decision trees were created of which two have been chosen that were most successful in predicting student success based on two criteria: Grade Point Average (GPA) and time that a student needs to finish the undergraduate program (time-to-degree). Decision trees have been shown as a good method of classification student success and they could be even more improved by increasing survey sample and developing specialized decision trees for each type of college. These types of methods have a big potential for use in decision support systems.
Abstract: Numerical analysis naturally finds applications in all
fields of engineering and the physical sciences, but in the
21st century, the life sciences and even the arts have adopted
elements of scientific computations. The numerical data analysis
became key process in research and development of all the fields [6].
In this paper we have made an attempt to analyze the specified
numerical patterns with reference to the association rule mining
techniques with minimum confidence and minimum support mining
criteria. The extracted rules and analyzed results are graphically
demonstrated. Association rules are a simple but very useful form of
data mining that describe the probabilistic co-occurrence of certain
events within a database [7]. They were originally designed to
analyze market-basket data, in which the likelihood of items being
purchased together within the same transactions are analyzed.
Abstract: Data mining uses a variety of techniques each of which
is useful for some particular task. It is important to have a deep
understanding of each technique and be able to perform sophisticated
analysis. In this article we describe a tool built to simulate a variation
of the Kohonen network to perform unsupervised clustering and
support the entire data mining process up to results visualization. A
graphical representation helps the user to find out a strategy to
optimize classification by adding, moving or delete a neuron in order
to change the number of classes. The tool is able to automatically
suggest a strategy to optimize the number of classes optimization, but
also support both tree classifications and semi-lattice organizations of
the classes to give to the users the possibility of passing from one
class to the ones with which it has some aspects in common.
Examples of using tree and semi-lattice classifications are given to
illustrate advantages and problems. The tool is applied to classify
macroeconomic data that report the most developed countries- import
and export. It is possible to classify the countries based on their
economic behaviour and use the tool to characterize the commercial
behaviour of a country in a selected class from the analysis of
positive and negative features that contribute to classes formation.
Possible interrelationships between the classes and their meaning are
also discussed.
Abstract: The occurrence of missing values in database is a serious problem for Data Mining tasks, responsible for degrading data quality and accuracy of analyses. In this context, the area has shown a lack of standardization for experiments to treat missing values, introducing difficulties to the evaluation process among different researches due to the absence in the use of common parameters. This paper proposes a testbed intended to facilitate the experiments implementation and provide unbiased parameters using available datasets and suited performance metrics in order to optimize the evaluation and comparison between the state of art missing values treatments.
Abstract: Water quality is a subject of ongoing concern.
Deterioration of water quality has initiated serious management
efforts in many countries. This study endeavors to automatically
classify water quality. The water quality classes are evaluated using 6
factor indices. These factors are pH value (pH), Dissolved Oxygen
(DO), Biochemical Oxygen Demand (BOD), Nitrate Nitrogen
(NO3N), Ammonia Nitrogen (NH3N) and Total Coliform (TColiform).
The methodology involves applying data mining
techniques using multilayer perceptron (MLP) neural network
models. The data consisted of 11 sites of canals in Dusit district in
Bangkok, Thailand. The data is obtained from the Department of
Drainage and Sewerage Bangkok Metropolitan Administration
during 2007-2011. The results of multilayer perceptron neural
network exhibit a high accuracy multilayer perception rate at 96.52%
in classifying the water quality of Dusit district canal in Bangkok
Subsequently, this encouraging result could be applied with plan and
management source of water quality.
Abstract: Text data mining is a process of exploratory data
analysis. Classification maps data into predefined groups or classes.
It is often referred to as supervised learning because the classes are
determined before examining the data. This paper describes proposed
radial basis function Classifier that performs comparative crossvalidation
for existing radial basis function Classifier. The feasibility
and the benefits of the proposed approach are demonstrated by means
of data mining problem: direct Marketing. Direct marketing has
become an important application field of data mining. Comparative
Cross-validation involves estimation of accuracy by either stratified
k-fold cross-validation or equivalent repeated random subsampling.
While the proposed method may have high bias; its performance
(accuracy estimation in our case) may be poor due to high variance.
Thus the accuracy with proposed radial basis function Classifier was
less than with the existing radial basis function Classifier. However
there is smaller the improvement in runtime and larger improvement
in precision and recall. In the proposed method Classification
accuracy and prediction accuracy are determined where the
prediction accuracy is comparatively high.
Abstract: In this paper we present a photo mosaic smartphone
application in client-server based large-scale image databases. Photo
mosaic is not a new concept, but there are very few smartphone
applications especially for a huge number of images in the
client-server environment. To support large-scale image databases,
we first propose an overall framework working as a client-server
model. We then present a concept of image-PAA features to efficiently
handle a huge number of images and discuss its lower bounding
property. We also present a best-match algorithm that exploits the
lower bounding property of image-PAA. We finally implement an
efficient Android-based application and demonstrate its feasibility.
Abstract: Selecting the data modeling technique for an
information system is determined by the objective of the resultant
data model. Dimensional modeling is the preferred modeling
technique for data destined for data warehouses and data mining,
presenting data models that ease analysis and queries which are in
contrast with entity relationship modeling. The establishment of data
warehouses as components of information system landscapes in
many organizations has subsequently led to the development of
dimensional modeling. This has been significantly more developed
and reported for the commercial database management systems as
compared to the open sources thereby making it less affordable for
those in resource constrained settings. This paper presents
dimensional modeling of HIV patient information using open source
modeling tools. It aims to take advantage of the fact that the most
affected regions by the HIV virus are also heavily resource
constrained (sub-Saharan Africa) whereas having large quantities of
HIV data. Two HIV data source systems were studied to identify
appropriate dimensions and facts these were then modeled using two
open source dimensional modeling tools. Use of open source would
reduce the software costs for dimensional modeling and in turn make
data warehousing and data mining more feasible even for those in
resource constrained settings but with data available.
Abstract: The paper investigates the feasibility of constructing a software multi-agent based monitoring and classification system and utilizing it to provide an automated and accurate classification of end users developing applications in the spreadsheet domain. The agents function autonomously to provide continuous and periodic monitoring of excels spreadsheet workbooks. Resulting in, the development of the MultiAgent classification System (MACS) that is in compliance with the specifications of the Foundation for Intelligent Physical Agents (FIPA). However, different technologies have been brought together to build MACS. The strength of the system is the integration of the agent technology with the FIPA specifications together with other technologies that are Windows Communication Foundation (WCF) services, Service Oriented Architecture (SOA), and Oracle Data Mining (ODM). The Microsoft's .NET widows service based agents were utilized to develop the monitoring agents of MACS, the .NET WCF services together with SOA approach allowed the distribution and communication between agents over the WWW that is in order to satisfy the monitoring and classification of the multiple developer aspect. ODM was used to automate the classification phase of MACS.
Abstract: this article proposed a methodology for computer
numerical control (CNC) machine scoring. The case study company
is a manufacturer of hard disk drive parts in Thailand. In this
company, sample of parts manufactured from CNC machine are
usually taken randomly for quality inspection. These inspection data
were used to make a decision to shut down the machine if it has
tendency to produce parts that are out of specification. Large amount
of data are produced in this process and data mining could be very
useful technique in analyzing them. In this research, data mining
techniques were used to construct a machine scoring model called
'machine priority assessment model (MPAM)'. This model helps to
ensure that the machine with higher risk of producing defective parts
be inspected before those with lower risk. If the defective prone
machine is identified sooner, defective part and rework could be
reduced hence improving the overall productivity. The results
showed that the proposed method can be successfully implemented
and approximately 351,000 baht of opportunity cost could have
saved in the case study company.
Abstract: With the advance of information technology in the
new era the applications of Internet to access data resources has
steadily increased and huge amount of data have become accessible
in various forms. Obviously, the network providers and agencies,
look after to prevent electronic attacks that may be harmful or may
be related to terrorist applications. Thus, these have facilitated the
authorities to under take a variety of methods to protect the special
regions from harmful data. One of the most important approaches is
to use firewall in the network facilities. The main objectives of
firewalls are to stop the transfer of suspicious packets in several
ways. However because of its blind packet stopping, high process
power requirements and expensive prices some of the providers are
reluctant to use the firewall. In this paper we proposed a method to
find a discriminate function to distinguish between usual packets and
harmful ones by the statistical processing on the network router logs.
By discriminating these data, an administrator may take an approach
action against the user. This method is very fast and can be used
simply in adjacent with the Internet routers.
Abstract: Names are important in many societies, even in technologically oriented ones which use e.g. ID systems to identify individual people. Names such as surnames are the most important as they are used in many processes, such as identifying of people and genealogical research. On the other hand variation of names can be a major problem for the identification and search for people, e.g. web search or security reasons. Name matching presumes a-priori that the recorded name written in one alphabet reflects the phonetic identity of two samples or some transcription error in copying a previously recorded name. We add to this the lode that the two names imply the same person. This paper describes name variations and some basic description of various name matching algorithms developed to overcome name variation and to find reasonable variants of names which can be used to further increasing mismatches for record linkage and name search. The implementation contains algorithms for computing a range of fuzzy matching based on different types of algorithms, e.g. composite and hybrid methods and allowing us to test and measure algorithms for accuracy. NYSIIS, LIG2 and Phonex have been shown to perform well and provided sufficient flexibility to be included in the linkage/matching process for optimising name searching.
Abstract: In this study, the Multi-Layer Perceptron (MLP)with Back-Propagation learning algorithm are used to classify to effective diagnosis Parkinsons disease(PD).It-s a challenging problem for medical community.Typically characterized by tremor, PD occurs due to the loss of dopamine in the brains thalamic region that results in involuntary or oscillatory movement in the body. A feature selection algorithm along with biomedical test values to diagnose Parkinson disease.Clinical diagnosis is done mostly by doctor-s expertise and experience.But still cases are reported of wrong diagnosis and treatment. Patients are asked to take number of tests for diagnosis.In many cases,not all the tests contribute towards effective diagnosis of a disease.Our work is to classify the presence of Parkinson disease with reduced number of attributes.Original,22 attributes are involved in classify.We use Information Gain to determine the attributes which reduced the number of attributes which is need to be taken from patients.The Artificial neural networks is used to classify the diagnosis of patients.Twenty-Two attributes are reduced to sixteen attributes.The accuracy is in training data set is 82.051% and in the validation data set is 83.333%.
Abstract: This study proposes a novel recommender system to
provide the advertisements of context-aware services. Our proposed
model is designed to apply a modified collaborative filtering (CF)
algorithm with regard to the several dimensions for the personalization
of mobile devices – location, time and the user-s needs type. In
particular, we employ a classification rule to understand user-s needs
type using a decision tree algorithm. In addition, we collect primary
data from the mobile phone users and apply them to the proposed
model to validate its effectiveness. Experimental results show that the
proposed system makes more accurate and satisfactory advertisements
than comparative systems.
Abstract: This research presents a system for post processing of
data that takes mined flat rules as input and discovers crisp as well as
fuzzy hierarchical structures using Learning Classifier System
approach. Learning Classifier System (LCS) is basically a machine
learning technique that combines evolutionary computing,
reinforcement learning, supervised or unsupervised learning and
heuristics to produce adaptive systems. A LCS learns by interacting
with an environment from which it receives feedback in the form of
numerical reward. Learning is achieved by trying to maximize the
amount of reward received. Crisp description for a concept usually
cannot represent human knowledge completely and practically. In the
proposed Learning Classifier System initial population is constructed
as a random collection of HPR–trees (related production rules) and
crisp / fuzzy hierarchies are evolved. A fuzzy subsumption relation is
suggested for the proposed system and based on Subsumption Matrix
(SM), a suitable fitness function is proposed. Suitable genetic
operators are proposed for the chosen chromosome representation
method. For implementing reinforcement a suitable reward and
punishment scheme is also proposed. Experimental results are
presented to demonstrate the performance of the proposed system.
Abstract: The main goal of data mining is to extract accurate, comprehensible and interesting knowledge from databases that may be considered as large search spaces. In this paper, a new, efficient type of Genetic Algorithm (GA) called uniform two-level GA is proposed as a search strategy to discover truly interesting, high-level prediction rules, a difficult problem and relatively little researched, rather than discovering classification knowledge as usual in the literatures. The proposed method uses the advantage of uniform population method and addresses the task of generalized rule induction that can be regarded as a generalization of the task of classification. Although the task of generalized rule induction requires a lot of computations, which is usually not satisfied with the normal algorithms, it was demonstrated that this method increased the performance of GAs and rapidly found interesting rules.
Abstract: The nature of consumer products causes the difficulty
in forecasting the future demands and the accuracy of the forecasts
significantly affects the overall performance of the supply chain
system. In this study, two data mining methods, artificial neural
network (ANN) and support vector machine (SVM), were utilized to
predict the demand of consumer products. The training data used was
the actual demand of six different products from a consumer product
company in Thailand. The results indicated that SVM had a better
forecast quality (in term of MAPE) than ANN in every category of
products. Moreover, another important finding was the margin
difference of MAPE from these two methods was significantly high
when the data was highly correlated.
Abstract: This paper applies fuzzy clustering algorithm in classifying real estate companies in China according to some general financial indexes, such as income per share, share accumulation fund, net profit margins, weighted net assets yield and shareholders' equity. By constructing and normalizing initial partition matrix, getting fuzzy similar matrix with Minkowski metric and gaining the transitive closure, the dynamic fuzzy clustering analysis for real estate companies is shown clearly that different clustered result change gradually with the threshold reducing, and then, it-s shown there is the similar relationship with the prices of those companies in stock market. In this way, it-s great valuable in contrasting the real estate companies- financial condition in order to grasp some good chances of investment, and so on.
Abstract: Recently the use of data mining to scientific bibliographic data bases has been implemented to analyze the pathways of the knowledge or the core scientific relevances of a laureated novel or a country. This specific case of data mining has been named citation mining, and it is the integration of citation bibliometrics and text mining. In this paper we present an improved WEB implementation of statistical physics algorithms to perform the text mining component of citation mining. In particular we use an entropic like distance between the compression of text as an indicator of the similarity between them. Finally, we have included the recently proposed index h to characterize the scientific production. We have used this web implementation to identify users, applications and impact of the Mexican scientific institutions located in the State of Morelos.