Determining Cluster Boundaries Using Particle Swarm Optimization

Self-organizing map (SOM) is a well known data reduction technique used in data mining. Data visualization can reveal structure in data sets that is otherwise hard to detect from raw data alone. However, interpretation through visual inspection is prone to errors and can be very tedious. There are several techniques for the automatic detection of clusters of code vectors found by SOMs, but they generally do not take into account the distribution of code vectors; this may lead to unsatisfactory clustering and poor definition of cluster boundaries, particularly where the density of data points is low. In this paper, we propose the use of a generic particle swarm optimization (PSO) algorithm for finding cluster boundaries directly from the code vectors obtained from SOMs. The application of our method to unlabeled call data for a mobile phone operator demonstrates its feasibility. PSO algorithm utilizes U-matrix of SOMs to determine cluster boundaries; the results of this novel automatic method correspond well to boundary detection through visual inspection of code vectors and k-means algorithm.

Examination of Flood Runoff Reproductivity for Different Rainfall Sources in Central Vietnam

This paper presents the combination of different precipitation data sets and the distributed hydrological model, in order to examine the flood runoff reproductivity of scattered observation catchments. The precipitation data sets were obtained from observation using rain-gages, satellite based estimate (TRMM), and numerical weather prediction model (NWP), then were coupled with the super tank model. The case study was conducted in three basins (small, medium, and large size) located in Central Vietnam. Calculated hydrographs based on ground observation rainfall showed best fit to measured stream flow, while those obtained from TRMM and NWP showed high uncertainty of peak discharges. However, calculated hydrographs using the adjusted rainfield depicted a promising alternative for the application of TRMM and NWP in flood modeling for scattered observation catchments, especially for the extension of forecast lead time.

DIFFER: A Propositionalization approach for Learning from Structured Data

Logic based methods for learning from structured data is limited w.r.t. handling large search spaces, preventing large-sized substructures from being considered by the resulting classifiers. A novel approach to learning from structured data is introduced that employs a structure transformation method, called finger printing, for addressing these limitations. The method, which generates features corresponding to arbitrarily complex substructures, is implemented in a system, called DIFFER. The method is demonstrated to perform comparably to an existing state-of-art method on some benchmark data sets without requiring restrictions on the search space. Furthermore, learning from the union of features generated by finger printing and the previous method outperforms learning from each individual set of features on all benchmark data sets, demonstrating the benefit of developing complementary, rather than competing, methods for structure classification.

A Hybrid Neural Network and Traditional Approach for Forecasting Lumpy Demand

Accurate demand forecasting is one of the most key issues in inventory management of spare parts. The problem of modeling future consumption becomes especially difficult for lumpy patterns, which characterized by intervals in which there is no demand and, periods with actual demand occurrences with large variation in demand levels. However, many of the forecasting methods may perform poorly when demand for an item is lumpy. In this study based on the characteristic of lumpy demand patterns of spare parts a hybrid forecasting approach has been developed, which use a multi-layered perceptron neural network and a traditional recursive method for forecasting future demands. In the described approach the multi-layered perceptron are adapted to forecast occurrences of non-zero demands, and then a conventional recursive method is used to estimate the quantity of non-zero demands. In order to evaluate the performance of the proposed approach, their forecasts were compared to those obtained by using Syntetos & Boylan approximation, recently employed multi-layered perceptron neural network, generalized regression neural network and elman recurrent neural network in this area. The models were applied to forecast future demand of spare parts of Arak Petrochemical Company in Iran, using 30 types of real data sets. The results indicate that the forecasts obtained by using our proposed mode are superior to those obtained by using other methods.

Application of a New Hybrid Optimization Algorithm on Cluster Analysis

Clustering techniques have received attention in many areas including engineering, medicine, biology and data mining. The purpose of clustering is to group together data points, which are close to one another. The K-means algorithm is one of the most widely used techniques for clustering. However, K-means has two shortcomings: dependency on the initial state and convergence to local optima and global solutions of large problems cannot found with reasonable amount of computation effort. In order to overcome local optima problem lots of studies done in clustering. This paper is presented an efficient hybrid evolutionary optimization algorithm based on combining Particle Swarm Optimization (PSO) and Ant Colony Optimization (ACO), called PSO-ACO, for optimally clustering N object into K clusters. The new PSO-ACO algorithm is tested on several data sets, and its performance is compared with those of ACO, PSO and K-means clustering. The simulation results show that the proposed evolutionary optimization algorithm is robust and suitable for handing data clustering.

Optimal Model Order Selection for Transient Error Autoregressive Moving Average (TERA) MRI Reconstruction Method

An alternative approach to the use of Discrete Fourier Transform (DFT) for Magnetic Resonance Imaging (MRI) reconstruction is the use of parametric modeling technique. This method is suitable for problems in which the image can be modeled by explicit known source functions with a few adjustable parameters. Despite the success reported in the use of modeling technique as an alternative MRI reconstruction technique, two important problems constitutes challenges to the applicability of this method, these are estimation of Model order and model coefficient determination. In this paper, five of the suggested method of evaluating the model order have been evaluated, these are: The Final Prediction Error (FPE), Akaike Information Criterion (AIC), Residual Variance (RV), Minimum Description Length (MDL) and Hannan and Quinn (HNQ) criterion. These criteria were evaluated on MRI data sets based on the method of Transient Error Reconstruction Algorithm (TERA). The result for each criterion is compared to result obtained by the use of a fixed order technique and three measures of similarity were evaluated. Result obtained shows that the use of MDL gives the highest measure of similarity to that use by a fixed order technique.

Towards Clustering of Web-based Document Structures

Methods for organizing web data into groups in order to analyze web-based hypertext data and facilitate data availability are very important in terms of the number of documents available online. Thereby, the task of clustering web-based document structures has many applications, e.g., improving information retrieval on the web, better understanding of user navigation behavior, improving web users requests servicing, and increasing web information accessibility. In this paper we investigate a new approach for clustering web-based hypertexts on the basis of their graph structures. The hypertexts will be represented as so called generalized trees which are more general than usual directed rooted trees, e.g., DOM-Trees. As a important preprocessing step we measure the structural similarity between the generalized trees on the basis of a similarity measure d. Then, we apply agglomerative clustering to the obtained similarity matrix in order to create clusters of hypertext graph patterns representing navigation structures. In the present paper we will run our approach on a data set of hypertext structures and obtain good results in Web Structure Mining. Furthermore we outline the application of our approach in Web Usage Mining as future work.

Reducing SAGE Data Using Genetic Algorithms

Serial Analysis of Gene Expression is a powerful quantification technique for generating cell or tissue gene expression data. The profile of the gene expression of cell or tissue in several different states is difficult for biologists to analyze because of the large number of genes typically involved. However, feature selection in machine learning can successfully reduce this problem. The method allows reducing the features (genes) in specific SAGE data, and determines only relevant genes. In this study, we used a genetic algorithm to implement feature selection, and evaluate the classification accuracy of the selected features with the K-nearest neighbor method. In order to validate the proposed method, we used two SAGE data sets for testing. The results of this study conclusively prove that the number of features of the original SAGE data set can be significantly reduced and higher classification accuracy can be achieved.

Improving Academic Performance Prediction using Voting Technique in Data Mining

In this paper we compare the accuracy of data mining methods to classifying students in order to predicting student-s class grade. These predictions are more useful for identifying weak students and assisting management to take remedial measures at early stages to produce excellent graduate that will graduate at least with second class upper. Firstly we examine single classifiers accuracy on our data set and choose the best one and then ensembles it with a weak classifier to produce simple voting method. We present results show that combining different classifiers outperformed other single classifiers for predicting student performance.

Multidimensional Visualization Tools for Analysis of Expression Data

Expression data analysis is based mostly on the statistical approaches that are indispensable for the study of biological systems. Large amounts of multidimensional data resulting from the high-throughput technologies are not completely served by biostatistical techniques and are usually complemented with visual, knowledge discovery and other computational tools. In many cases, in biological systems we only speculate on the processes that are causing the changes, and it is the visual explorative analysis of data during which a hypothesis is formed. We would like to show the usability of multidimensional visualization tools and promote their use in life sciences. We survey and show some of the multidimensional visualization tools in the process of data exploration, such as parallel coordinates and radviz and we extend them by combining them with the self-organizing map algorithm. We use a time course data set of transitional cell carcinoma of the bladder in our examples. Analysis of data with these tools has the potential to uncover additional relationships and non-trivial structures.

A Renovated Cook's Distance Based On The Buckley-James Estimate In Censored Regression

There have been various methods created based on the regression ideas to resolve the problem of data set containing censored observations, i.e. the Buckley-James method, Miller-s method, Cox method, and Koul-Susarla-Van Ryzin estimators. Even though comparison studies show the Buckley-James method performs better than some other methods, it is still rarely used by researchers mainly because of the limited diagnostics analysis developed for the Buckley-James method thus far. Therefore, a diagnostic tool for the Buckley-James method is proposed in this paper. It is called the renovated Cook-s Distance, (RD* i ) and has been developed based on the Cook-s idea. The renovated Cook-s Distance (RD* i ) has advantages (depending on the analyst demand) over (i) the change in the fitted value for a single case, DFIT* i as it measures the influence of case i on all n fitted values Yˆ∗ (not just the fitted value for case i as DFIT* i) (ii) the change in the estimate of the coefficient when the ith case is deleted, DBETA* i since DBETA* i corresponds to the number of variables p so it is usually easier to look at a diagnostic measure such as RD* i since information from p variables can be considered simultaneously. Finally, an example using Stanford Heart Transplant data is provided to illustrate the proposed diagnostic tool.

Genetic Programming Approach to Hierarchical Production Rule Discovery

Automated discovery of hierarchical structures in large data sets has been an active research area in the recent past. This paper focuses on the issue of mining generalized rules with crisp hierarchical structure using Genetic Programming (GP) approach to knowledge discovery. The post-processing scheme presented in this work uses flat rules as initial individuals of GP and discovers hierarchical structure. Suitable genetic operators are proposed for the suggested encoding. Based on the Subsumption Matrix(SM), an appropriate fitness function is suggested. Finally, Hierarchical Production Rules (HPRs) are generated from the discovered hierarchy. Experimental results are presented to demonstrate the performance of the proposed algorithm.

An Intelligent Human-Computer Interaction System for Decision Support

This paper proposes a novel architecture for developing decision support systems. Unlike conventional decision support systems, the proposed architecture endeavors to reveal the decision-making process such that humans' subjectivity can be incorporated into a computerized system and, at the same time, to preserve the capability of the computerized system in processing information objectively. A number of techniques used in developing the decision support system are elaborated to make the decisionmarking process transparent. These include procedures for high dimensional data visualization, pattern classification, prediction, and evolutionary computational search. An artificial data set is first employed to compare the proposed approach with other methods. A simulated handwritten data set and a real data set on liver disease diagnosis are then employed to evaluate the efficacy of the proposed approach. The results are analyzed and discussed. The potentials of the proposed architecture as a useful decision support system are demonstrated.

A Comparison of the Sum of Squares in Linear and Partial Linear Regression Models

In this paper, estimation of the linear regression model is made by ordinary least squares method and the partially linear regression model is estimated by penalized least squares method using smoothing spline. Then, it is investigated that differences and similarity in the sum of squares related for linear regression and partial linear regression models (semi-parametric regression models). It is denoted that the sum of squares in linear regression is reduced to sum of squares in partial linear regression models. Furthermore, we indicated that various sums of squares in the linear regression are similar to different deviance statements in partial linear regression. In addition to, coefficient of the determination derived in linear regression model is easily generalized to coefficient of the determination of the partial linear regression model. For this aim, it is made two different applications. A simulated and a real data set are considered to prove the claim mentioned here. In this way, this study is supported with a simulation and a real data example.

Face Detection using Gabor Wavelets and Neural Networks

This paper proposes new hybrid approaches for face recognition. Gabor wavelets representation of face images is an effective approach for both facial action recognition and face identification. Perform dimensionality reduction and linear discriminate analysis on the down sampled Gabor wavelet faces can increase the discriminate ability. Nearest feature space is extended to various similarity measures. In our experiments, proposed Gabor wavelet faces combined with extended neural net feature space classifier shows very good performance, which can achieve 93 % maximum correct recognition rate on ORL data set without any preprocessing step.

Flow Discharge Determination in Straight Compound Channels Using ANNs

Although many researchers have studied the flow hydraulics in compound channels, there are still many complicated problems in determination of their flow rating curves. Many different methods have been presented for these channels but extending them for all types of compound channels with different geometrical and hydraulic conditions is certainly difficult. In this study, by aid of nearly 400 laboratory and field data sets of geometry and flow rating curves from 30 different straight compound sections and using artificial neural networks (ANNs), flow discharge in compound channels was estimated. 13 dimensionless input variables including relative depth, relative roughness, relative width, aspect ratio, bed slope, main channel side slopes, flood plains side slopes and berm inclination and one output variable (flow discharge), have been used in ANNs. Comparison of ANNs model and traditional method (divided channel method-DCM) shows high accuracy of ANNs model results. The results of Sensitivity analysis showed that the relative depth with 47.6 percent contribution, is the most effective input parameter for flow discharge prediction. Relative width and relative roughness have 19.3 and 12.2 percent of importance, respectively. On the other hand, shape parameter, main channel and flood plains side slopes with 2.1, 3.8 and 3.8 percent of contribution, have the least importance.

Computational Model for Predicting Effective siRNA Sequences Using Whole Stacking Energy (% G) for Gene Silencing

The small interfering RNA (siRNA) alters the regulatory role of mRNA during gene expression by translational inhibition. Recent studies show that upregulation of mRNA because serious diseases like cancer. So designing effective siRNA with good knockdown effects plays an important role in gene silencing. Various siRNA design tools had been developed earlier. In this work, we are trying to analyze the existing good scoring second generation siRNA predicting tools and to optimize the efficiency of siRNA prediction by designing a computational model using Artificial Neural Network and whole stacking energy (%G), which may help in gene silencing and drug design in cancer therapy. Our model is trained and tested against a large data set of siRNA sequences. Validation of our results is done by finding correlation coefficient of experimental versus observed inhibition efficacy of siRNA. We achieved a correlation coefficient of 0.727 in our previous computational model and we could improve the correlation coefficient up to 0.753 when the threshold of whole tacking energy is greater than or equal to -32.5 kcal/mol.

A Semantic Recommendation Procedure for Electronic Product Catalog

To overcome the product overload of Internet shoppers, we introduce a semantic recommendation procedure which is more efficient when applied to Internet shopping malls. The suggested procedure recommends the semantic products to the customers and is originally based on Web usage mining, product classification, association rule mining, and frequently purchasing. We applied the procedure to the data set of MovieLens Company for performance evaluation, and some experimental results are provided. The experimental results have shown superior performance in terms of coverage and precision.

Applications of Support Vector Machines on Smart Phone Systems for Emotional Speech Recognition

An emotional speech recognition system for the applications on smart phones was proposed in this study to combine with 3G mobile communications and social networks to provide users and their groups with more interaction and care. This study developed a mechanism using the support vector machines (SVM) to recognize the emotions of speech such as happiness, anger, sadness and normal. The mechanism uses a hierarchical classifier to adjust the weights of acoustic features and divides various parameters into the categories of energy and frequency for training. In this study, 28 commonly used acoustic features including pitch and volume were proposed for training. In addition, a time-frequency parameter obtained by continuous wavelet transforms was also used to identify the accent and intonation in a sentence during the recognition process. The Berlin Database of Emotional Speech was used by dividing the speech into male and female data sets for training. According to the experimental results, the accuracies of male and female test sets were increased by 4.6% and 5.2% respectively after using the time-frequency parameter for classifying happy and angry emotions. For the classification of all emotions, the average accuracy, including male and female data, was 63.5% for the test set and 90.9% for the whole data set.

On the Comparison of Several Goodness of Fit tests under Simple Random Sampling and Ranked Set Sampling

Many works have been carried out to compare the efficiency of several goodness of fit procedures for identifying whether or not a particular distribution could adequately explain a data set. In this paper a study is conducted to investigate the power of several goodness of fit tests such as Kolmogorov Smirnov (KS), Anderson-Darling(AD), Cramer- von- Mises (CV) and a proposed modification of Kolmogorov-Smirnov goodness of fit test which incorporates a variance stabilizing transformation (FKS). The performances of these selected tests are studied under simple random sampling (SRS) and Ranked Set Sampling (RSS). This study shows that, in general, the Anderson-Darling (AD) test performs better than other GOF tests. However, there are some cases where the proposed test can perform as equally good as the AD test.