Abstract: In this paper, an attribute weighting method called fuzzy C-means clustering based attribute weighting (FCMAW) for classification of Diabetes disease dataset has been used. The aims of this study are to reduce the variance within attributes of diabetes dataset and to improve the classification accuracy of classifier algorithm transforming from non-linear separable datasets to linearly separable datasets. Pima Indians Diabetes dataset has two classes including normal subjects (500 instances) and diabetes subjects (268 instances). Fuzzy C-means clustering is an improved version of K-means clustering method and is one of most used clustering methods in data mining and machine learning applications. In this study, as the first stage, fuzzy C-means clustering process has been used for finding the centers of attributes in Pima Indians diabetes dataset and then weighted the dataset according to the ratios of the means of attributes to centers of theirs. Secondly, after weighting process, the classifier algorithms including support vector machine (SVM) and k-NN (k- nearest neighbor) classifiers have been used for classifying weighted Pima Indians diabetes dataset. Experimental results show that the proposed attribute weighting method (FCMAW) has obtained very promising results in the classification of Pima Indians diabetes dataset.
Abstract: In this paper, we present a comparative study of three
methods of 2D face recognition system such as: Iso-Geodesic Curves
(IGC), Geodesic Distance (GD) and Geodesic-Intensity Histogram
(GIH). These approaches are based on computing of geodesic
distance between points of facial surface and between facial curves.
In this study we represented the image at gray level as a 2D surface in
a 3D space, with the third coordinate proportional to the intensity
values of pixels. In the classifying step, we use: Neural Networks
(NN), K-Nearest Neighbor (KNN) and Support Vector Machines
(SVM). The images used in our experiments are from two wellknown
databases of face images ORL and YaleB. ORL data base was
used to evaluate the performance of methods under conditions where
the pose and sample size are varied, and the database YaleB was used
to examine the performance of the systems when the facial
expressions and lighting are varied.
Abstract: This research was conducted in an automotive company in Indonesia to overcome the problem of high logistics cost. The problem causes high of additional truck delivery. From the breakdown of the problem, chosen one route, which has the highest gap value, namely for RE-04. Research methodology will be started from calculating the ideal condition, making simulation, calculating the ideal logistic cost, and proposing an improvement. From the calculation of the ideal condition, box arrangement was done on the truck has efficiency with three trucks delivery per day. Route simulation making uses Tecnomatix Plant Simulation software as a visualization for the company about how the system is occurred on route RE-04 in ideal condition. The last step is proposing improvements on the area of route RE-04. The route arrangement is done by Saving Method and sequence of each supplier with the Nearest Neighbor. The results of the proposed improvements are three new route groups, where was expected to decrease logistics cost and increase the average of the truck efficiency per day.
Abstract: The problems arising from unbalanced data sets
generally appear in real world applications. Due to unequal class
distribution, many researchers have found that the performance of
existing classifiers tends to be biased towards the majority class. The
k-nearest neighbors’ nonparametric discriminant analysis is a method
that was proposed for classifying unbalanced classes with good
performance. In this study, the methods of discriminant analysis are
of interest in investigating misclassification error rates for classimbalanced
data of three diabetes risk groups. The purpose of this
study was to compare the classification performance between
parametric discriminant analysis and nonparametric discriminant
analysis in a three-class classification of class-imbalanced data of
diabetes risk groups. Data from a project maintaining healthy
conditions for 599 employees of a government hospital in Bangkok
were obtained for the classification problem. The employees were
divided into three diabetes risk groups: non-risk (90%), risk (5%),
and diabetic (5%). The original data including the variables of
diabetes risk group, age, gender, blood glucose, and BMI were
analyzed and bootstrapped for 50 and 100 samples, 599 observations
per sample, for additional estimation of the misclassification error
rate. Each data set was explored for the departure of multivariate
normality and the equality of covariance matrices of the three risk
groups. Both the original data and the bootstrap samples showed nonnormality
and unequal covariance matrices. The parametric linear
discriminant function, quadratic discriminant function, and the
nonparametric k-nearest neighbors’ discriminant function were
performed over 50 and 100 bootstrap samples and applied to the
original data. Searching the optimal classification rule, the choices of
prior probabilities were set up for both equal proportions (0.33: 0.33:
0.33) and unequal proportions of (0.90:0.05:0.05), (0.80: 0.10: 0.10)
and (0.70, 0.15, 0.15). The results from 50 and 100 bootstrap samples
indicated that the k-nearest neighbors approach when k=3 or k=4 and
the defined prior probabilities of non-risk: risk: diabetic as 0.90:
0.05:0.05 or 0.80:0.10:0.10 gave the smallest error rate of
misclassification. The k-nearest neighbors approach would be
suggested for classifying a three-class-imbalanced data of diabetes
risk groups.
Abstract: This paper proposes a rotational invariant texture
feature based on the roughness property of the image for psoriasis
image analysis. In this work, we have applied this feature for image
classification and segmentation. The fuzzy concept is employed to
overcome the imprecision of roughness. Since the psoriasis lesion is
modeled by a rough surface, the feature is extended for calculating
the Psoriasis Area Severity Index value. For classification and
segmentation, the Nearest Neighbor algorithm is applied. We have
obtained promising results for identifying affected lesions by using
the roughness index and severity level estimation.
Abstract: A Distributed Denial of Service (DDoS) attack is a
major threat to cyber security. It originates from the network layer or
the application layer of compromised/attacker systems which are
connected to the network. The impact of this attack ranges from the
simple inconvenience to use a particular service to causing major
failures at the targeted server. When there is heavy traffic flow to a
target server, it is necessary to classify the legitimate access and
attacks. In this paper, a novel method is proposed to detect DDoS
attacks from the traces of traffic flow. An access matrix is created
from the traces. As the access matrix is multi dimensional, Principle
Component Analysis (PCA) is used to reduce the attributes used for
detection. Two classifiers Naive Bayes and K-Nearest neighborhood
are used to classify the traffic as normal or abnormal. The
performance of the classifier with PCA selected attributes and actual
attributes of access matrix is compared by the detection rate and
False Positive Rate (FPR).
Abstract: Red blood cells (RBC) are the most common types of
blood cells and are the most intensively studied in cell biology. The
lack of RBCs is a condition in which the amount of hemoglobin level
is lower than normal and is referred to as “anemia”. Abnormalities in
RBCs will affect the exchange of oxygen. This paper presents a
comparative study for various techniques for classifying the RBCs as
normal or abnormal (anemic) using WEKA. WEKA is an open
source consists of different machine learning algorithms for data
mining applications. The algorithms tested are Radial Basis Function
neural network, Support vector machine, and K-Nearest Neighbors
algorithm. Two sets of combined features were utilized for
classification of blood cells images. The first set, exclusively consist
of geometrical features, was used to identify whether the tested blood
cell has a spherical shape or non-spherical cells. While the second
set, consist mainly of textural features was used to recognize the
types of the spherical cells. We have provided an evaluation based on
applying these classification methods to our RBCs image dataset
which were obtained from Serdang Hospital - Malaysia, and
measuring the accuracy of test results. The best achieved
classification rates are 97%, 98%, and 79% for Support vector
machines, Radial Basis Function neural network, and K-Nearest
Neighbors algorithm respectively.
Abstract: Image spam is a kind of email spam where the spam
text is embedded with an image. It is a new spamming technique
being used by spammers to send their messages to bulk of internet
users. Spam email has become a big problem in the lives of internet
users, causing time consumption and economic losses. The main
objective of this paper is to detect the image spam by using histogram
properties of an image. Though there are many techniques to
automatically detect and avoid this problem, spammers employing
new tricks to bypass those techniques, as a result those techniques are
inefficient to detect the spam mails. In this paper we have proposed a
new method to detect the image spam. Here the image features are
extracted by using RGB histogram, HSV histogram and combination
of both RGB and HSV histogram. Based on the optimized image
feature set classification is done by using k- Nearest Neighbor(k-NN)
algorithm. Experimental result shows that our method has achieved
better accuracy. From the result it is known that combination of RGB
and HSV histogram with k-NN algorithm gives the best accuracy in
spam detection.
Abstract: In this paper, the comparison between k-Nearest Neighbor (kNN) algorithms for classifying the 3D EEG model in brain balancing is presented. The EEG signal recording was conducted on 51 healthy subjects. Development of 3D EEG models involves pre-processing of raw EEG signals and construction of spectrogram images. Then, maximum PSD values were extracted as features from the model. There are three indexes for balanced brain; index 3, index 4 and index 5. There are significant different of the EEG signals due to the brain balancing index (BBI). Alpha-α (8–13 Hz) and beta-β (13–30 Hz) were used as input signals for the classification model. The k-NN classification result is 88.46% accuracy. These results proved that k-NN can be used in order to predict the brain balancing application.
Abstract: Cloud outsource storage is one of important services in cloud computing. Cloud users upload data to cloud servers to reduce the cost of managing data and maintaining hardware and software. To ensure data confidentiality, users can encrypt their files before uploading them to a cloud system. However, retrieving the target file from the encrypted files exactly is difficult for cloud server. This study proposes a protocol for performing multikeyword searches for encrypted cloud data by applying k-nearest neighbor technology. The protocol ranks the relevance scores of encrypted files and keywords, and prevents cloud servers from learning search keywords submitted by a cloud user. To reduce the costs of file transfer communication, the cloud server returns encrypted files in order of relevance. Moreover, when a cloud user inputs an incorrect keyword and the number of wrong alphabet does not exceed a given threshold; the user still can retrieve the target files from cloud server. In addition, the proposed scheme satisfies security requirements for outsourced data storage.
Abstract: In general, algorithms to find continuous k-nearest neighbors have been researched on the location based services, monitoring periodically the moving objects such as vehicles and mobile phone. Those researches assume the environment that the number of query points is much less than that of moving objects and the query points are not moved but fixed. In gaming environments, this problem is when computing the next movement considering the neighbors such as flocking, crowd and robot simulations. In this case, every moving object becomes a query point so that the number of query point is same to that of moving objects and the query points are also moving. In this paper, we analyze the performance of the existing algorithms focused on location based services how they operate under gaming environments.
Abstract: Diabetic retinopathy is characterized by the development of retinal microaneurysms. The damage can be prevented if disease is treated in its early stages. In this paper, we are comparing Support Vector Machine (SVM) and Naïve Bayes (NB) classifiers for automatic microaneurysm detection in images acquired through non-dilated pupils. The Nearest Neighbor classifier is used as a baseline for comparison. Detected microaneurysms are validated with expert ophthalmologists’ hand-drawn ground-truths. The sensitivity, specificity, precision and accuracy of each method are also compared.
Abstract: In pattern clustering, nearest neighborhood point computation is a challenging issue for many applications in the area of research such as Remote Sensing, Computer Vision, Pattern Recognition and Statistical Imaging. Nearest neighborhood
computation is an essential computation for providing sufficient classification among the volume of pixels (voxels) in order to localize the active-region-of-interests (AROI). Furthermore, it is needed to compute spatial metric relationships of diverse area of imaging based on the applications of pattern recognition. In this paper, we propose a new methodology for finding the nearest neighbor point, depending on making a virtually grid of a hexagon cells, then locate every point beneath them. An algorithm is suggested for minimizing the computation and increasing the turnaround time of the process. The nearest neighbor query points Φ are fetched by seeking fashion of hexagon holistic. Seeking will be repeated until an AROI Φ is to be expected. If any point Υ is located then searching starts in the nearest hexagons in a circular way. The First hexagon is considered be level 0 (L0) and the surrounded hexagons is level 1 (L1). If Υ is located in L1, then search starts in the next level (L2) to ensure that Υ is the nearest neighbor for Φ. Based on the result and experimental results, we found that the proposed method has an advantage over the traditional methods in terms of minimizing the time complexity required for searching the neighbors, in turn, efficiency of classification will be improved sufficiently.
Abstract: BCI (Brain Computer Interface) is a communication machine that translates brain massages to computer commands. These machines with the help of computer programs can recognize the tasks that are imagined. Feature extraction is an important stage of the process in EEG classification that can effect in accuracy and the computation time of processing the signals. In this study we process the signal in three steps of active segment selection, fractal feature extraction, and classification. One of the great challenges in BCI applications is to improve classification accuracy and computation time together. In this paper, we have used student’s 2D sample t-statistics on continuous wavelet transforms for active segment selection to reduce the computation time. In the next level, the features are extracted from some famous fractal dimension estimation of the signal. These fractal features are Katz and Higuchi. In the classification stage we used ANFIS (Adaptive Neuro-Fuzzy Inference System) classifier, FKNN (Fuzzy K-Nearest Neighbors), LDA (Linear Discriminate Analysis), and SVM (Support Vector Machines). We resulted that active segment selection method would reduce the computation time and Fractal dimension features with ANFIS analysis on selected active segments is the best among investigated methods in EEG classification.
Abstract: Missing values in data are common in real world applications. Since the performance of many data mining algorithms depend critically on it being given a good metric over the input space, we decided in this paper to define a distance function for unlabeled
datasets with missing values. We use the Bhattacharyya distance, which measures the similarity of two probability distributions, to define our new distance function. According to this distance, the distance between two points without missing attributes values is simply the Mahalanobis distance. When on the other hand there is a missing value of one of the coordinates, the distance is computed according to the distribution of the missing coordinate. Our distance is general and can be used as part of any algorithm that computes the distance between data points. Because its performance depends strongly on the chosen distance measure, we opted for the k nearest neighbor classifier to evaluate its ability to accurately reflect object similarity. We experimented on standard numerical datasets from the UCI repository from different fields. On these datasets we simulated missing values and compared the performance of the kNN classifier using our distance to other three basic methods. Our experiments show that kNN using our distance function outperforms the kNN using other methods. Moreover, the runtime performance of our method is only slightly higher than the other methods.
Abstract: Red blood cells (RBCs) are among the most
commonly and intensively studied type of blood cells in cell biology.
Anemia is a lack of RBCs is characterized by its level compared to
the normal hemoglobin level. In this study, a system based image
processing methodology was developed to localize and extract RBCs
from microscopic images. Also, the machine learning approach is
adopted to classify the localized anemic RBCs images. Several
textural and geometrical features are calculated for each extracted
RBCs. The training set of features was analyzed using principal
component analysis (PCA). With the proposed method, RBCs were
isolated in 4.3secondsfrom an image containing 18 to 27 cells. The
reasons behind using PCA are its low computation complexity and
suitability to find the most discriminating features which can lead to
accurate classification decisions. Our classifier algorithm yielded
accuracy rates of 100%, 99.99%, and 96.50% for K-nearest neighbor
(K-NN) algorithm, support vector machine (SVM), and neural
network RBFNN, respectively. Classification was evaluated in highly
sensitivity, specificity, and kappa statistical parameters. In
conclusion, the classification results were obtained within short time
period, and the results became better when PCA was used.
Abstract: The travelling salesman problem (TSP) is a combinatorial optimization problem in which the goal is to find the shortest path between different cities that the salesman takes. In other words, the problem deals with finding a route covering all cities so that total distance and execution time is minimized. This paper adopts the nearest neighbor and minimum spanning tree algorithm to solve the well-known travelling salesman problem. The algorithms were implemented using java programming language. The approach is tested on three graphs that making a TSP tour instance of 5-city, 10 –city, and 229–city. The computation results validate the performance of the proposed algorithm.
Abstract: The exponential increase in the volume of medical image database has imposed new challenges to clinical routine in maintaining patient history, diagnosis, treatment and monitoring. With the advent of data mining and machine learning techniques it is possible to automate and/or assist physicians in clinical diagnosis. In this research a medical image classification framework using data mining techniques is proposed. It involves feature extraction, feature selection, feature discretization and classification. In the classification phase, the performance of the traditional kNN k nearest neighbor classifier is improved using a feature weighting scheme and a distance weighted voting instead of simple majority voting. Feature weights are calculated using the interestingness measures used in association rule mining. Experiments on the retinal fundus images show that the proposed framework improves the classification accuracy of traditional kNN from 78.57 % to 92.85 %.
Abstract: In Content-Based Image Retrieval systems it is
important to use an efficient indexing technique in order to perform
and accelerate the search in huge databases. The used indexing
technique should also support the high dimensions of image features.
In this paper we present the hierarchical index NOHIS-tree (Non
Overlapping Hierarchical Index Structure) when we scale up to very
large databases. We also present a study of the influence of clustering
on search time. The performance test results show that NOHIS-tree
performs better than SR-tree. Tests also show that NOHIS-tree keeps
its performances in high dimensional spaces. We include the
performance test that try to determine the number of clusters in
NOHIS-tree to have the best search time.
Abstract: Learning using labeled and unlabelled data has
received considerable amount of attention in the machine learning
community due its potential in reducing the need for expensive
labeled data. In this work we present a new method for combining
labeled and unlabeled data based on classifier ensembles. The model
we propose assumes each classifier in the ensemble observes the
input using different set of features. Classifiers are initially trained
using some labeled samples. The trained classifiers learn further
through labeling the unknown patterns using a teaching signals that is
generated using the decision of the classifier ensemble, i.e. the
classifiers self-supervise each other. Experiments on a set of object
images are presented. Our experiments investigate different classifier
models, different fusing techniques, different training sizes and
different input features. Experimental results reveal that the proposed
self-supervised ensemble learning approach reduces classification
error over the single classifier and the traditional ensemble classifier
approachs.