Feature Reduction of Nearest Neighbor Classifiers using Genetic Algorithm

The design of a pattern classifier includes an attempt to select, among a set of possible features, a minimum subset of weakly correlated features that better discriminate the pattern classes. This is usually a difficult task in practice, normally requiring the application of heuristic knowledge about the specific problem domain. The selection and quality of the features representing each pattern have a considerable bearing on the success of subsequent pattern classification. Feature extraction is the process of deriving new features from the original features in order to reduce the cost of feature measurement, increase classifier efficiency, and allow higher classification accuracy. Many current feature extraction techniques involve linear transformations of the original pattern vectors to new vectors of lower dimensionality. While this is useful for data visualization and increasing classification efficiency, it does not necessarily reduce the number of features that must be measured since each new feature may be a linear combination of all of the features in the original pattern vector. In this paper a new approach is presented to feature extraction in which feature selection, feature extraction, and classifier training are performed simultaneously using a genetic algorithm. In this approach each feature value is first normalized by a linear equation, then scaled by the associated weight prior to training, testing, and classification. A knn classifier is used to evaluate each set of feature weights. The genetic algorithm optimizes a vector of feature weights, which are used to scale the individual features in the original pattern vectors in either a linear or a nonlinear fashion. By this approach, the number of features used in classifying can be finely reduced.

Visualization and Indexing of Spectral Databases

On-line (near infrared) spectroscopy is widely used to support the operation of complex process systems. Information extracted from spectral database can be used to estimate unmeasured product properties and monitor the operation of the process. These techniques are based on looking for similar spectra by nearest neighborhood algorithms and distance based searching methods. Search for nearest neighbors in the spectral space is an NP-hard problem, the computational complexity increases by the number of points in the discrete spectrum and the number of samples in the database. To reduce the calculation time some kind of indexing could be used. The main idea presented in this paper is to combine indexing and visualization techniques to reduce the computational requirement of estimation algorithms by providing a two dimensional indexing that can also be used to visualize the structure of the spectral database. This 2D visualization of spectral database does not only support application of distance and similarity based techniques but enables the utilization of advanced clustering and prediction algorithms based on the Delaunay tessellation of the mapped spectral space. This means the prediction has not to use the high dimension space but can be based on the mapped space too. The results illustrate that the proposed method is able to segment (cluster) spectral databases and detect outliers that are not suitable for instance based learning algorithms.

Weighted k-Nearest-Neighbor Techniques for High Throughput Screening Data

The k-nearest neighbors (knn) is a simple but effective method of classification. In this paper we present an extended version of this technique for chemical compounds used in High Throughput Screening, where the distances of the nearest neighbors can be taken into account. Our algorithm uses kernel weight functions as guidance for the process of defining activity in screening data. Proposed kernel weight function aims to combine properties of graphical structure and molecule descriptors of screening compounds. We apply the modified knn method on several experimental data from biological screens. The experimental results confirm the effectiveness of the proposed method.

On Speeding Up Support Vector Machines: Proximity Graphs Versus Random Sampling for Pre-Selection Condensation

Support vector machines (SVMs) are considered to be the best machine learning algorithms for minimizing the predictive probability of misclassification. However, their drawback is that for large data sets the computation of the optimal decision boundary is a time consuming function of the size of the training set. Hence several methods have been proposed to speed up the SVM algorithm. Here three methods used to speed up the computation of the SVM classifiers are compared experimentally using a musical genre classification problem. The simplest method pre-selects a random sample of the data before the application of the SVM algorithm. Two additional methods use proximity graphs to pre-select data that are near the decision boundary. One uses k-Nearest Neighbor graphs and the other Relative Neighborhood Graphs to accomplish the task.

Implementing a Visual Servoing System for Robot Controlling

Nowadays, with the emerging of the new applications like robot control in image processing, artificial vision for visual servoing is a rapidly growing discipline and Human-machine interaction plays a significant role for controlling the robot. This paper presents a new algorithm based on spatio-temporal volumes for visual servoing aims to control robots. In this algorithm, after applying necessary pre-processing on video frames, a spatio-temporal volume is constructed for each gesture and feature vector is extracted. These volumes are then analyzed for matching in two consecutive stages. For hand gesture recognition and classification we tested different classifiers including k-Nearest neighbor, learning vector quantization and back propagation neural networks. We tested the proposed algorithm with the collected data set and results showed the correct gesture recognition rate of 99.58 percent. We also tested the algorithm with noisy images and algorithm showed the correct recognition rate of 97.92 percent in noisy images.

Codebook Generation for Vector Quantization on Orthogonal Polynomials based Transform Coding

In this paper, a new algorithm for generating codebook is proposed for vector quantization (VQ) in image coding. The significant features of the training image vectors are extracted by using the proposed Orthogonal Polynomials based transformation. We propose to generate the codebook by partitioning these feature vectors into a binary tree. Each feature vector at a non-terminal node of the binary tree is directed to one of the two descendants by comparing a single feature associated with that node to a threshold. The binary tree codebook is used for encoding and decoding the feature vectors. In the decoding process the feature vectors are subjected to inverse transformation with the help of basis functions of the proposed Orthogonal Polynomials based transformation to get back the approximated input image training vectors. The results of the proposed coding are compared with the VQ using Discrete Cosine Transform (DCT) and Pairwise Nearest Neighbor (PNN) algorithm. The new algorithm results in a considerable reduction in computation time and provides better reconstructed picture quality.

Identification of Cardiac Arrhythmias using Natural Resonance Complex Frequencies

An electrocardiogram (ECG) feature extraction system based on the calculation of the complex resonance frequency employing Prony-s method is developed. Prony-s method is applied on five different classes of ECG signals- arrhythmia as a finite sum of exponentials depending on the signal-s poles and the resonant complex frequencies. Those poles and resonance frequencies of the ECG signals- arrhythmia are evaluated for a large number of each arrhythmia. The ECG signals of lead II (ML II) were taken from MIT-BIH database for five different types. These are the ventricular couplet (VC), ventricular tachycardia (VT), ventricular bigeminy (VB), and ventricular fibrillation (VF) and the normal (NR). This novel method can be extended to any number of arrhythmias. Different classification techniques were tried using neural networks (NN), K nearest neighbor (KNN), linear discriminant analysis (LDA) and multi-class support vector machine (MC-SVM).

Ranking Genes from DNA Microarray Data of Cervical Cancer by a local Tree Comparison

The major objective of this paper is to introduce a new method to select genes from DNA microarray data. As criterion to select genes we suggest to measure the local changes in the correlation graph of each gene and to select those genes whose local changes are largest. More precisely, we calculate the correlation networks from DNA microarray data of cervical cancer whereas each network represents a tissue of a certain tumor stage and each node in the network represents a gene. From these networks we extract one tree for each gene by a local decomposition of the correlation network. The interpretation of a tree is that it represents the n-nearest neighbor genes on the n-th level of a tree, measured by the Dijkstra distance, and, hence, gives the local embedding of a gene within the correlation network. For the obtained trees we measure the pairwise similarity between trees rooted by the same gene from normal to cancerous tissues. This evaluates the modification of the tree topology due to tumor progression. Finally, we rank the obtained similarity values from all tissue comparisons and select the top ranked genes. For these genes the local neighborhood in the correlation networks changes most between normal and cancerous tissues. As a result we find that the top ranked genes are candidates suspected to be involved in tumor growth. This indicates that our method captures essential information from the underlying DNA microarray data of cervical cancer.

A Comparison between Heuristic and Meta-Heuristic Methods for Solving the Multiple Traveling Salesman Problem

The multiple traveling salesman problem (mTSP) can be used to model many practical problems. The mTSP is more complicated than the traveling salesman problem (TSP) because it requires determining which cities to assign to each salesman, as well as the optimal ordering of the cities within each salesman's tour. Previous studies proposed that Genetic Algorithm (GA), Integer Programming (IP) and several neural network (NN) approaches could be used to solve mTSP. This paper compared the results for mTSP, solved with Genetic Algorithm (GA) and Nearest Neighbor Algorithm (NNA). The number of cities is clustered into a few groups using k-means clustering technique. The number of groups depends on the number of salesman. Then, each group is solved with NNA and GA as an independent TSP. It is found that k-means clustering and NNA are superior to GA in terms of performance (evaluated by fitness function) and computing time.

Analysis of Feature Space for a 2d/3d Vision based Emotion Recognition Method

In modern human computer interaction systems (HCI), emotion recognition is becoming an imperative characteristic. The quest for effective and reliable emotion recognition in HCI has resulted in a need for better face detection, feature extraction and classification. In this paper we present results of feature space analysis after briefly explaining our fully automatic vision based emotion recognition method. We demonstrate the compactness of the feature space and show how the 2d/3d based method achieves superior features for the purpose of emotion classification. Also it is exposed that through feature normalization a widely person independent feature space is created. As a consequence, the classifier architecture has only a minor influence on the classification result. This is particularly elucidated with the help of confusion matrices. For this purpose advanced classification algorithms, such as Support Vector Machines and Artificial Neural Networks are employed, as well as the simple k- Nearest Neighbor classifier.

Multilevel Classifiers in Recognition of Handwritten Kannada Numerals

The recognition of handwritten numeral is an important area of research for its applications in post office, banks and other organizations. This paper presents automatic recognition of handwritten Kannada numerals based on structural features. Five different types of features, namely, profile based 10-segment string, water reservoir; vertical and horizontal strokes, end points and average boundary length from the minimal bounding box are used in the recognition of numeral. The effect of each feature and their combination in the numeral classification is analyzed using nearest neighbor classifiers. It is common to combine multiple categories of features into a single feature vector for the classification. Instead, separate classifiers can be used to classify based on each visual feature individually and the final classification can be obtained based on the combination of separate base classification results. One popular approach is to combine the classifier results into a feature vector and leaving the decision to next level classifier. This method is extended to extract a better information, possibility distribution, from the base classifiers in resolving the conflicts among the classification results. Here, we use fuzzy k Nearest Neighbor (fuzzy k-NN) as base classifier for individual feature sets, the results of which together forms the feature vector for the final k Nearest Neighbor (k-NN) classifier. Testing is done, using different features, individually and in combination, on a database containing 1600 samples of different numerals and the results are compared with the results of different existing methods.

A Learning-Community Recommendation Approach for Web-Based Cooperative Learning

Cooperative learning has been defined as learners working together as a team to solve a problem to complete a task or to accomplish a common goal, which emphasizes the importance of interactions among members to promote the whole learning performance. With the popularity of society networks, cooperative learning is no longer limited to traditional classroom teaching activities. Since society networks facilitate to organize online learners, to establish common shared visions, and to advance learning interaction, the online community and online learning community have triggered the establishment of web-based societies. Numerous research literatures have indicated that the collaborative learning community is a critical issue to enhance learning performance. Hence, this paper proposes a learning community recommendation approach to facilitate that a learner joins the appropriate learning communities, which is based on k-nearest neighbor (kNN) classification. To demonstrate the viability of the proposed approach, the proposed approach is implemented for 117 students to recommend learning communities. The experimental results indicate that the proposed approach can effectively recommend appropriate learning communities for learners.

A Systems Approach to Gene Ranking from DNA Microarray Data of Cervical Cancer

In this paper we present a method for gene ranking from DNA microarray data. More precisely, we calculate the correlation networks, which are unweighted and undirected graphs, from microarray data of cervical cancer whereas each network represents a tissue of a certain tumor stage and each node in the network represents a gene. From these networks we extract one tree for each gene by a local decomposition of the correlation network. The interpretation of a tree is that it represents the n-nearest neighbor genes on the n-th level of a tree, measured by the Dijkstra distance, and, hence, gives the local embedding of a gene within the correlation network. For the obtained trees we measure the pairwise similarity between trees rooted by the same gene from normal to cancerous tissues. This evaluates the modification of the tree topology due to progression of the tumor. Finally, we rank the obtained similarity values from all tissue comparisons and select the top ranked genes. For these genes the local neighborhood in the correlation networks changes most between normal and cancerous tissues. As a result we find that the top ranked genes are candidates suspected to be involved in tumor growth and, hence, indicates that our method captures essential information from the underlying DNA microarray data of cervical cancer.

Evaluation of the Impact of Dataset Characteristics for Classification Problems in Biological Applications

Availability of high dimensional biological datasets such as from gene expression, proteomic, and metabolic experiments can be leveraged for the diagnosis and prognosis of diseases. Many classification methods in this area have been studied to predict disease states and separate between predefined classes such as patients with a special disease versus healthy controls. However, most of the existing research only focuses on a specific dataset. There is a lack of generic comparison between classifiers, which might provide a guideline for biologists or bioinformaticians to select the proper algorithm for new datasets. In this study, we compare the performance of popular classifiers, which are Support Vector Machine (SVM), Logistic Regression, k-Nearest Neighbor (k-NN), Naive Bayes, Decision Tree, and Random Forest based on mock datasets. We mimic common biological scenarios simulating various proportions of real discriminating biomarkers and different effect sizes thereof. The result shows that SVM performs quite stable and reaches a higher AUC compared to other methods. This may be explained due to the ability of SVM to minimize the probability of error. Moreover, Decision Tree with its good applicability for diagnosis and prognosis shows good performance in our experimental setup. Logistic Regression and Random Forest, however, strongly depend on the ratio of discriminators and perform better when having a higher number of discriminators.

Spatial Mapping of Dengue Incidence: A Case Study in Hulu Langat District, Selangor, Malaysia

Dengue is a mosquito-borne infection that has peaked to an alarming rate in recent decades. It can be found in tropical and sub-tropical climate. In Malaysia, dengue has been declared as one of the national health threat to the public. This study aimed to map the spatial distributions of dengue cases in the district of Hulu Langat, Selangor via a combination of Geographic Information System (GIS) and spatial statistic tools. Data related to dengue was gathered from the various government health agencies. The location of dengue cases was geocoded using a handheld GPS Juno SB Trimble. A total of 197 dengue cases occurring in 2003 were used in this study. Those data then was aggregated into sub-district level and then converted into GIS format. The study also used population or demographic data as well as the boundary of Hulu Langat. To assess the spatial distribution of dengue cases three spatial statistics method (Moran-s I, average nearest neighborhood (ANN) and kernel density estimation) were applied together with spatial analysis in the GIS environment. Those three indices were used to analyze the spatial distribution and average distance of dengue incidence and to locate the hot spot of dengue cases. The results indicated that the dengue cases was clustered (p < 0.01) when analyze using Moran-s I with z scores 5.03. The results from ANN analysis showed that the average nearest neighbor ratio is less than 1 which is 0.518755 (p < 0.0001). From this result, we can expect the dengue cases pattern in Hulu Langat district is exhibiting a cluster pattern. The z-score for dengue incidence within the district is -13.0525 (p < 0.0001). It was also found that the significant spatial autocorrelation of dengue incidences occurs at an average distance of 380.81 meters (p < 0.0001). Several locations especially residential area also had been identified as the hot spots of dengue cases in the district.