Abstract: Self-organizing map (SOM) is a well known data reduction technique used in data mining. Data visualization can reveal structure in data sets that is otherwise hard to detect from raw data alone. However, interpretation through visual inspection is prone to errors and can be very tedious. There are several techniques for the automatic detection of clusters of code vectors found by SOMs, but they generally do not take into account the distribution of code vectors; this may lead to unsatisfactory clustering and poor definition of cluster boundaries, particularly where the density of data points is low. In this paper, we propose the use of a generic particle swarm optimization (PSO) algorithm for finding cluster boundaries directly from the code vectors obtained from SOMs. The application of our method to unlabeled call data for a mobile phone operator demonstrates its feasibility. PSO algorithm utilizes U-matrix of SOMs to determine cluster boundaries; the results of this novel automatic method correspond well to boundary detection through visual inspection of code vectors and k-means algorithm.
Abstract: This paper presents the combination of different precipitation data sets and the distributed hydrological model, in order to examine the flood runoff reproductivity of scattered observation catchments. The precipitation data sets were obtained from observation using rain-gages, satellite based estimate (TRMM), and numerical weather prediction model (NWP), then were coupled with the super tank model. The case study was conducted in three basins (small, medium, and large size) located in Central Vietnam. Calculated hydrographs based on ground observation rainfall showed best fit to measured stream flow, while those obtained from TRMM and NWP showed high uncertainty of peak discharges. However, calculated hydrographs using the adjusted rainfield depicted a promising alternative for the application of TRMM and NWP in flood modeling for scattered observation catchments, especially for the extension of forecast lead time.
Abstract: Logic based methods for learning from structured data
is limited w.r.t. handling large search spaces, preventing large-sized
substructures from being considered by the resulting classifiers. A
novel approach to learning from structured data is introduced that
employs a structure transformation method, called finger printing, for
addressing these limitations. The method, which generates features
corresponding to arbitrarily complex substructures, is implemented in
a system, called DIFFER. The method is demonstrated to perform
comparably to an existing state-of-art method on some benchmark
data sets without requiring restrictions on the search space.
Furthermore, learning from the union of features generated by finger
printing and the previous method outperforms learning from each
individual set of features on all benchmark data sets, demonstrating
the benefit of developing complementary, rather than competing,
methods for structure classification.
Abstract: Accurate demand forecasting is one of the most key
issues in inventory management of spare parts. The problem of
modeling future consumption becomes especially difficult for lumpy
patterns, which characterized by intervals in which there is no
demand and, periods with actual demand occurrences with large
variation in demand levels. However, many of the forecasting
methods may perform poorly when demand for an item is lumpy.
In this study based on the characteristic of lumpy demand patterns
of spare parts a hybrid forecasting approach has been developed,
which use a multi-layered perceptron neural network and a
traditional recursive method for forecasting future demands. In the
described approach the multi-layered perceptron are adapted to
forecast occurrences of non-zero demands, and then a conventional
recursive method is used to estimate the quantity of non-zero
demands. In order to evaluate the performance of the proposed
approach, their forecasts were compared to those obtained by using
Syntetos & Boylan approximation, recently employed multi-layered
perceptron neural network, generalized regression neural network
and elman recurrent neural network in this area. The models were
applied to forecast future demand of spare parts of Arak
Petrochemical Company in Iran, using 30 types of real data sets. The
results indicate that the forecasts obtained by using our proposed
mode are superior to those obtained by using other methods.
Abstract: Clustering techniques have received attention in many areas including engineering, medicine, biology and data mining. The purpose of clustering is to group together data points, which are close to one another. The K-means algorithm is one of the most widely used techniques for clustering. However, K-means has two shortcomings: dependency on the initial state and convergence to local optima and global solutions of large problems cannot found with reasonable amount of computation effort. In order to overcome local optima problem lots of studies done in clustering. This paper is presented an efficient hybrid evolutionary optimization algorithm based on combining Particle Swarm Optimization (PSO) and Ant Colony Optimization (ACO), called PSO-ACO, for optimally clustering N object into K clusters. The new PSO-ACO algorithm is tested on several data sets, and its performance is compared with those of ACO, PSO and K-means clustering. The simulation results show that the proposed evolutionary optimization algorithm is robust and suitable for handing data clustering.
Abstract: An alternative approach to the use of Discrete Fourier
Transform (DFT) for Magnetic Resonance Imaging (MRI) reconstruction
is the use of parametric modeling technique. This method
is suitable for problems in which the image can be modeled by
explicit known source functions with a few adjustable parameters.
Despite the success reported in the use of modeling technique as an
alternative MRI reconstruction technique, two important problems
constitutes challenges to the applicability of this method, these are
estimation of Model order and model coefficient determination. In
this paper, five of the suggested method of evaluating the model
order have been evaluated, these are: The Final Prediction Error
(FPE), Akaike Information Criterion (AIC), Residual Variance (RV),
Minimum Description Length (MDL) and Hannan and Quinn (HNQ)
criterion. These criteria were evaluated on MRI data sets based on the
method of Transient Error Reconstruction Algorithm (TERA). The
result for each criterion is compared to result obtained by the use of a
fixed order technique and three measures of similarity were evaluated.
Result obtained shows that the use of MDL gives the highest measure
of similarity to that use by a fixed order technique.
Abstract: Methods for organizing web data into groups in order
to analyze web-based hypertext data and facilitate data availability
are very important in terms of the number of documents available
online. Thereby, the task of clustering web-based document structures
has many applications, e.g., improving information retrieval on the
web, better understanding of user navigation behavior, improving web
users requests servicing, and increasing web information accessibility.
In this paper we investigate a new approach for clustering web-based
hypertexts on the basis of their graph structures. The hypertexts will
be represented as so called generalized trees which are more general
than usual directed rooted trees, e.g., DOM-Trees. As a important
preprocessing step we measure the structural similarity between the
generalized trees on the basis of a similarity measure d. Then,
we apply agglomerative clustering to the obtained similarity matrix
in order to create clusters of hypertext graph patterns representing
navigation structures. In the present paper we will run our approach
on a data set of hypertext structures and obtain good results in
Web Structure Mining. Furthermore we outline the application of
our approach in Web Usage Mining as future work.
Abstract: Serial Analysis of Gene Expression is a powerful
quantification technique for generating cell or tissue gene expression
data. The profile of the gene expression of cell or tissue in several
different states is difficult for biologists to analyze because of the large
number of genes typically involved. However, feature selection in
machine learning can successfully reduce this problem. The method
allows reducing the features (genes) in specific SAGE data, and
determines only relevant genes. In this study, we used a genetic
algorithm to implement feature selection, and evaluate the
classification accuracy of the selected features with the K-nearest
neighbor method. In order to validate the proposed method, we used
two SAGE data sets for testing. The results of this study conclusively
prove that the number of features of the original SAGE data set can be
significantly reduced and higher classification accuracy can be
achieved.
Abstract: In this paper we compare the accuracy of data mining
methods to classifying students in order to predicting student-s class
grade. These predictions are more useful for identifying weak
students and assisting management to take remedial measures at early
stages to produce excellent graduate that will graduate at least with
second class upper. Firstly we examine single classifiers accuracy on
our data set and choose the best one and then ensembles it with a
weak classifier to produce simple voting method. We present results
show that combining different classifiers outperformed other single
classifiers for predicting student performance.
Abstract: Expression data analysis is based mostly on the
statistical approaches that are indispensable for the study of
biological systems. Large amounts of multidimensional data resulting
from the high-throughput technologies are not completely served by
biostatistical techniques and are usually complemented with visual,
knowledge discovery and other computational tools. In many cases,
in biological systems we only speculate on the processes that are
causing the changes, and it is the visual explorative analysis of data
during which a hypothesis is formed. We would like to show the
usability of multidimensional visualization tools and promote their
use in life sciences. We survey and show some of the
multidimensional visualization tools in the process of data
exploration, such as parallel coordinates and radviz and we extend
them by combining them with the self-organizing map algorithm. We
use a time course data set of transitional cell carcinoma of the bladder
in our examples. Analysis of data with these tools has the potential to
uncover additional relationships and non-trivial structures.
Abstract: There have been various methods created based on the regression ideas to resolve the problem of data set containing censored observations, i.e. the Buckley-James method, Miller-s method, Cox method, and Koul-Susarla-Van Ryzin estimators. Even though comparison studies show the Buckley-James method performs better than some other methods, it is still rarely used by researchers mainly because of the limited diagnostics analysis developed for the Buckley-James method thus far. Therefore, a diagnostic tool for the Buckley-James method is proposed in this paper. It is called the renovated Cook-s Distance, (RD* i ) and has been developed based on the Cook-s idea. The renovated Cook-s Distance (RD* i ) has advantages (depending on the analyst demand) over (i) the change in the fitted value for a single case, DFIT* i as it measures the influence of case i on all n fitted values Yˆ∗ (not just the fitted value for case i as DFIT* i) (ii) the change in the estimate of the coefficient when the ith case is deleted, DBETA* i since DBETA* i corresponds to the number of variables p so it is usually easier to look at a diagnostic measure such as RD* i since information from p variables can be considered simultaneously. Finally, an example using Stanford Heart Transplant data is provided to illustrate the proposed diagnostic tool.
Abstract: Automated discovery of hierarchical structures in
large data sets has been an active research area in the recent past.
This paper focuses on the issue of mining generalized rules with crisp
hierarchical structure using Genetic Programming (GP) approach to
knowledge discovery. The post-processing scheme presented in this
work uses flat rules as initial individuals of GP and discovers
hierarchical structure. Suitable genetic operators are proposed for the
suggested encoding. Based on the Subsumption Matrix(SM), an
appropriate fitness function is suggested. Finally, Hierarchical
Production Rules (HPRs) are generated from the discovered
hierarchy. Experimental results are presented to demonstrate the
performance of the proposed algorithm.
Abstract: This paper proposes a novel architecture for developing decision support systems. Unlike conventional decision support systems, the proposed architecture endeavors to reveal the decision-making process such that humans' subjectivity can be incorporated into a computerized system and, at the same time, to preserve the capability of the computerized system in processing information objectively. A number of techniques used in developing the decision support system are elaborated to make the decisionmarking process transparent. These include procedures for high dimensional data visualization, pattern classification, prediction, and evolutionary computational search. An artificial data set is first employed to compare the proposed approach with other methods. A simulated handwritten data set and a real data set on liver disease diagnosis are then employed to evaluate the efficacy of the proposed approach. The results are analyzed and discussed. The potentials of the proposed architecture as a useful decision support system are demonstrated.
Abstract: In this paper, estimation of the linear regression
model is made by ordinary least squares method and the
partially linear regression model is estimated by penalized
least squares method using smoothing spline. Then, it is
investigated that differences and similarity in the sum of
squares related for linear regression and partial linear
regression models (semi-parametric regression models). It is
denoted that the sum of squares in linear regression is reduced
to sum of squares in partial linear regression models.
Furthermore, we indicated that various sums of squares in the
linear regression are similar to different deviance statements in
partial linear regression. In addition to, coefficient of the
determination derived in linear regression model is easily
generalized to coefficient of the determination of the partial
linear regression model. For this aim, it is made two different
applications. A simulated and a real data set are considered to
prove the claim mentioned here. In this way, this study is
supported with a simulation and a real data example.
Abstract: This paper proposes new hybrid approaches for face
recognition. Gabor wavelets representation of face images is an
effective approach for both facial action recognition and face
identification. Perform dimensionality reduction and linear
discriminate analysis on the down sampled Gabor wavelet faces can
increase the discriminate ability. Nearest feature space is extended to
various similarity measures. In our experiments, proposed Gabor
wavelet faces combined with extended neural net feature space
classifier shows very good performance, which can achieve 93 %
maximum correct recognition rate on ORL data set without any preprocessing
step.
Abstract: Although many researchers have studied the flow
hydraulics in compound channels, there are still many complicated problems in determination of their flow rating curves. Many different
methods have been presented for these channels but extending them
for all types of compound channels with different geometrical and
hydraulic conditions is certainly difficult. In this study, by aid of nearly 400 laboratory and field data sets of geometry and flow rating
curves from 30 different straight compound sections and using artificial neural networks (ANNs), flow discharge in compound channels was estimated. 13 dimensionless input variables including relative depth, relative roughness, relative width, aspect ratio, bed
slope, main channel side slopes, flood plains side slopes and berm
inclination and one output variable (flow discharge), have been used
in ANNs. Comparison of ANNs model and traditional method
(divided channel method-DCM) shows high accuracy of ANNs model results. The results of Sensitivity analysis showed that the relative depth with 47.6 percent contribution, is the most effective input parameter for flow discharge prediction. Relative width and
relative roughness have 19.3 and 12.2 percent of importance, respectively. On the other hand, shape parameter, main channel and
flood plains side slopes with 2.1, 3.8 and 3.8 percent of contribution, have the least importance.
Abstract: The small interfering RNA (siRNA) alters the
regulatory role of mRNA during gene expression by translational
inhibition. Recent studies show that upregulation of mRNA because
serious diseases like cancer. So designing effective siRNA with good
knockdown effects plays an important role in gene silencing. Various
siRNA design tools had been developed earlier. In this work, we are
trying to analyze the existing good scoring second generation siRNA
predicting tools and to optimize the efficiency of siRNA prediction
by designing a computational model using Artificial Neural Network
and whole stacking energy (%G), which may help in gene silencing
and drug design in cancer therapy. Our model is trained and tested
against a large data set of siRNA sequences. Validation of our results
is done by finding correlation coefficient of experimental versus
observed inhibition efficacy of siRNA. We achieved a correlation
coefficient of 0.727 in our previous computational model and we
could improve the correlation coefficient up to 0.753 when the
threshold of whole tacking energy is greater than or equal to -32.5
kcal/mol.
Abstract: To overcome the product overload of Internet
shoppers, we introduce a semantic recommendation procedure which
is more efficient when applied to Internet shopping malls. The
suggested procedure recommends the semantic products to the
customers and is originally based on Web usage mining, product
classification, association rule mining, and frequently purchasing.
We applied the procedure to the data set of MovieLens Company for
performance evaluation, and some experimental results are provided.
The experimental results have shown superior performance in
terms of coverage and precision.
Abstract: An emotional speech recognition system for the
applications on smart phones was proposed in this study to combine
with 3G mobile communications and social networks to provide users
and their groups with more interaction and care. This study developed
a mechanism using the support vector machines (SVM) to recognize
the emotions of speech such as happiness, anger, sadness and normal.
The mechanism uses a hierarchical classifier to adjust the weights of
acoustic features and divides various parameters into the categories of
energy and frequency for training. In this study, 28 commonly used
acoustic features including pitch and volume were proposed for
training. In addition, a time-frequency parameter obtained by
continuous wavelet transforms was also used to identify the accent and
intonation in a sentence during the recognition process. The Berlin
Database of Emotional Speech was used by dividing the speech into
male and female data sets for training. According to the experimental
results, the accuracies of male and female test sets were increased by
4.6% and 5.2% respectively after using the time-frequency parameter
for classifying happy and angry emotions. For the classification of all
emotions, the average accuracy, including male and female data, was
63.5% for the test set and 90.9% for the whole data set.
Abstract: Many works have been carried out to compare the
efficiency of several goodness of fit procedures for identifying
whether or not a particular distribution could adequately explain a
data set. In this paper a study is conducted to investigate the power
of several goodness of fit tests such as Kolmogorov Smirnov (KS),
Anderson-Darling(AD), Cramer- von- Mises (CV) and a proposed
modification of Kolmogorov-Smirnov goodness of fit test which
incorporates a variance stabilizing transformation (FKS). The
performances of these selected tests are studied under simple
random sampling (SRS) and Ranked Set Sampling (RSS). This
study shows that, in general, the Anderson-Darling (AD) test
performs better than other GOF tests. However, there are some
cases where the proposed test can perform as equally good as the
AD test.