Abstract: Data mining and classification of objects is the process of data analysis, using various machine learning techniques, which is used today in various fields of research. This paper presents a concept of hybrid classification model improved with the expert knowledge. The hybrid model in its algorithm has integrated several machine learning techniques (Information Gain, K-means, and Case-Based Reasoning) and the expert’s knowledge into one. The knowledge of experts is used to determine the importance of features. The paper presents the model algorithm and the results of the case study in which the emphasis was put on achieving the maximum classification accuracy without reducing the number of features.
Abstract: Hyperspectral images and remote sensing are important for many applications. A problem in the use of these images is the high volume of data to be processed, stored and transferred. Dimensionality reduction techniques can be used to reduce the volume of data. In this paper, an approach to band selection based on clustering algorithms is presented. This approach allows to reduce the volume of data. The proposed structure is based on Fuzzy C-Means (or K-Means) and NWHFC algorithms. New attributes in relation to other studies in the literature, such as kurtosis and low correlation, are also considered. A comparison of the results of the approach using the Fuzzy C-Means and K-Means with different attributes is performed. The use of both algorithms show similar good results but, particularly when used attributes variance and kurtosis in the clustering process, however applicable in hyperspectral images.
Abstract: In this paper, an attribute weighting method called fuzzy C-means clustering based attribute weighting (FCMAW) for classification of Diabetes disease dataset has been used. The aims of this study are to reduce the variance within attributes of diabetes dataset and to improve the classification accuracy of classifier algorithm transforming from non-linear separable datasets to linearly separable datasets. Pima Indians Diabetes dataset has two classes including normal subjects (500 instances) and diabetes subjects (268 instances). Fuzzy C-means clustering is an improved version of K-means clustering method and is one of most used clustering methods in data mining and machine learning applications. In this study, as the first stage, fuzzy C-means clustering process has been used for finding the centers of attributes in Pima Indians diabetes dataset and then weighted the dataset according to the ratios of the means of attributes to centers of theirs. Secondly, after weighting process, the classifier algorithms including support vector machine (SVM) and k-NN (k- nearest neighbor) classifiers have been used for classifying weighted Pima Indians diabetes dataset. Experimental results show that the proposed attribute weighting method (FCMAW) has obtained very promising results in the classification of Pima Indians diabetes dataset.
Abstract: Rough set theory is used to handle uncertainty and incomplete information by applying two accurate sets, Lower approximation and Upper approximation. In this paper, the rough clustering algorithms are improved by adopting the Similarity, Dissimilarity–Similarity and Entropy based initial centroids selection method on three different clustering algorithms namely Entropy based Rough K-Means (ERKM), Similarity based Rough K-Means (SRKM) and Dissimilarity-Similarity based Rough K-Means (DSRKM) were developed and executed by yeast dataset. The rough clustering algorithms are validated by cluster validity indexes namely Rand and Adjusted Rand indexes. An experimental result shows that the ERKM clustering algorithm perform effectively and delivers better results than other clustering methods. Outlier detection is an important task in data mining and very much different from the rest of the objects in the clusters. Entropy based Rough Outlier Factor (EROF) method is seemly to detect outlier effectively for yeast dataset. In rough K-Means method, by tuning the epsilon (ᶓ) value from 0.8 to 1.08 can detect outliers on boundary region and the RKM algorithm delivers better results, when choosing the value of epsilon (ᶓ) in the specified range. An experimental result shows that the EROF method on clustering algorithm performed very well and suitable for detecting outlier effectively for all datasets. Further, experimental readings show that the ERKM clustering method outperformed the other methods.
Abstract: Texture is an important characteristic in real and
synthetic scenes. Texture analysis plays a critical role in inspecting
surfaces and provides important techniques in a variety of
applications. Although several descriptors have been presented to
extract texture features, the development of object recognition is still a
difficult task due to the complex aspects of texture. Recently, many
robust and scaling-invariant image features such as SIFT, SURF and
ORB have been successfully used in image retrieval and object
recognition. In this paper, we have tried to compare the performance
for texture classification using these feature descriptors with k-means
clustering. Different classifiers including K-NN, Naive Bayes, Back
Propagation Neural Network , Decision Tree and Kstar were applied in
three texture image sets - UIUCTex, KTH-TIPS and Brodatz,
respectively. Experimental results reveal SIFTS as the best average
accuracy rate holder in UIUCTex, KTH-TIPS and SURF is
advantaged in Brodatz texture set. BP neuro network works best in the
test set classification among all used classifiers.
Abstract: Given the increase in the number of e-commerce sites,
the number of competitors has become very important. This means
that companies have to take appropriate decisions in order to meet the
expectations of their customers and satisfy their needs. In this paper,
we present a case study of applying LRFM (length, recency,
frequency and monetary) model and clustering techniques in the
sector of electronic commerce with a view to evaluating customers’
values of the Moroccan e-commerce websites and then developing
effective marketing strategies. To achieve these objectives, we adopt
LRFM model by applying a two-stage clustering method. In the first
stage, the self-organizing maps method is used to determine the best
number of clusters and the initial centroid. In the second stage, kmeans
method is applied to segment 730 customers into nine clusters
according to their L, R, F and M values. The results show that the
cluster 6 is the most important cluster because the average values of
L, R, F and M are higher than the overall average value. In addition,
this study has considered another variable that describes the mode of
payment used by customers to improve and strengthen clusters’
analysis. The clusters’ analysis demonstrates that the payment method is
one of the key indicators of a new index which allows to assess the
level of customers’ confidence in the company's Website.
Abstract: During the post-Civil War era, the city of Nashville,
Tennessee, had the highest mortality rate in the United States. The
elevated death and disease rates among former slaves were
attributable to lack of quality healthcare. To address the paucity of
healthcare services, Meharry Medical College, an institution with the
mission of educating minority professionals and serving the
underserved population, was established in 1876.
Purpose: The social ecological framework and partial least squares
(PLS) path modeling were used to quantify the impact of
socioeconomic status and adverse health outcome on primary care
professionals serving the disadvantaged community. Thus, the study
results could demonstrate the accomplishment of the College’s
mission of training primary care professionals to serve in underserved
areas.
Methods: Various statistical methods were used to analyze alumni
data from 1975 – 2013. K-means cluster analysis was utilized to
identify individual medical and dental graduates in the cluster groups
of the practice communities (Disadvantaged or Non-disadvantaged
Communities). Discriminant analysis was implemented to verify the
classification accuracy of cluster analysis. The independent t-test was
performed to detect the significant mean differences of respective
clustering and criterion variables. Chi-square test was used to test if
the proportions of primary care and non-primary care specialists are
consistent with those of medical and dental graduates practicing in
the designated community clusters. Finally, the PLS path model was
constructed to explore the construct validity of analytic model by
providing the magnitude effects of socioeconomic status and adverse
health outcome on primary care professionals serving the
disadvantaged community.
Results: Approximately 83% (3,192/3,864) of Meharry Medical
College’s medical and dental graduates from 1975 to 2013 were
practicing in disadvantaged communities. Independent t-test confirmed the content validity of the cluster analysis model. Also, the
PLS path modeling demonstrated that alumni served as primary care
professionals in communities with significantly lower socioeconomic
status and higher adverse health outcome (p < .001). The PLS path
modeling exhibited the meaningful interrelation between primary
care professionals practicing communities and surrounding
environments (socioeconomic statues and adverse health outcome),
which yielded model reliability, validity, and applicability.
Conclusion: This study applied social ecological theory and
analytic modeling approaches to assess the attainment of Meharry
Medical College’s mission of training primary care professionals to
serve in underserved areas, particularly in communities with low
socioeconomic status and high rates of adverse health outcomes. In
summary, the majority of medical and dental graduates from Meharry
Medical College provided primary care services to disadvantaged
communities with low socioeconomic status and high adverse health
outcome, which demonstrated that Meharry Medical College has
fulfilled its mission. The high reliability, validity, and applicability of
this model imply that it could be replicated for comparable
universities and colleges elsewhere.
Abstract: The exponential growth of social media arouses much
attention on public opinion information. The online forums, blogs,
micro blogs are proving to be extremely valuable resources and are
having bulk volume of information. However, most of the social
media data is unstructured and semi structured form. So that it is
more difficult to decipher automatically. Therefore, it is very much
essential to understand and analyze those data for making a right
decision. The online forums hotspot detection is a promising research
field in the web mining and it guides to motivate the user to take right
decision in right time. The proposed system consist of a novel
approach to detect a hotspot forum for any given time period. It uses
aging theory to find the hot terms and E-K-means for detecting the
hotspot forum. Experimental results demonstrate that the proposed
approach outperforms k-means for detecting the hotspot forums with
the improved accuracy.
Abstract: Quantification of cardiac function is performed by
calculating blood volume and ejection fraction in routine clinical
practice. However, these works have been performed by manual
contouring, which requires computational costs and varies on the
observer. In this paper, an automatic left ventricle segmentation
algorithm on cardiac magnetic resonance images (MRI) is presented.
Using knowledge on cardiac MRI, a K-mean clustering technique is
applied to segment blood region on a coil-sensitivity corrected image.
Then, a graph searching technique is used to correct segmentation
errors from coil distortion and noises. Finally, blood volume and
ejection fraction are calculated. Using cardiac MRI from 15 subjects,
the presented algorithm is tested and compared with manual
contouring by experts to show outstanding performance.
Abstract: Clustering involves the partitioning of n objects into k
clusters. Many clustering algorithms use hard-partitioning techniques
where each object is assigned to one cluster. In this paper we propose
an overlapping algorithm MCOKE which allows objects to belong to
one or more clusters. The algorithm is different from fuzzy clustering
techniques because objects that overlap are assigned a membership
value of 1 (one) as opposed to a fuzzy membership degree. The
algorithm is also different from other overlapping algorithms that
require a similarity threshold be defined a priori which can be
difficult to determine by novice users.
Abstract: Leukaemia is a blood cancer disease that contributes
to the increment of mortality rate in Malaysia each year. There are
two main categories for leukaemia, which are acute and chronic
leukaemia. The production and development of acute leukaemia cells
occurs rapidly and uncontrollable. Therefore, if the identification of
acute leukaemia cells could be done fast and effectively, proper
treatment and medicine could be delivered. Due to the requirement of
prompt and accurate diagnosis of leukaemia, the current study has
proposed unsupervised pixel segmentation based on clustering
algorithm in order to obtain a fully segmented abnormal white blood
cell (blast) in acute leukaemia image. In order to obtain the
segmented blast, the current study proposed three clustering
algorithms which are k-means, fuzzy c-means and moving k-means
algorithms have been applied on the saturation component image.
Then, median filter and seeded region growing area extraction
algorithms have been applied, to smooth the region of segmented
blast and to remove the large unwanted regions from the image,
respectively. Comparisons among the three clustering algorithms are
made in order to measure the performance of each clustering
algorithm on segmenting the blast area. Based on the good sensitivity
value that has been obtained, the results indicate that moving kmeans
clustering algorithm has successfully produced the fully
segmented blast region in acute leukaemia image. Hence, indicating
that the resultant images could be helpful to haematologists for
further analysis of acute leukaemia.
Abstract: Considering the complexities involved in Cloud computing, there are still plenty of issues that affect the privacy of data in cloud environment. Unless these problems get solved, we think that the problem of preserving privacy in cloud databases is still open. In tokenization and homomorphic cryptography based solutions for privacy preserving cloud database querying, there is possibility that by colluding with service provider adversary may run brute force attacks that will reveal the attribute values.
In this paper we propose a solution by defining the variant of K –means clustering algorithm that effectively detects such brute force attacks and enhances privacy of cloud database querying by preventing this attacks.
Abstract: Microarrays are made it possible to simultaneously monitor the expression profiles of thousands of genes under various experimental conditions. It is used to identify the co-expressed genes in specific cells or tissues that are actively used to make proteins. This method is used to analysis the gene expression, an important task in bioinformatics research. Cluster analysis of gene expression data has proved to be a useful tool for identifying co-expressed genes, biologically relevant groupings of genes and samples. In this work K-Means algorithms has been applied for clustering of Gene Expression Data. Further, rough set based Quick reduct algorithm has been applied for each cluster in order to select the most similar genes having high correlation. Then the ACV measure is used to evaluate the refined clusters and classification is used to evaluate the proposed method. They could identify compact clusters with feature selection method used to genes are selected.
Abstract: This paper presents a preliminary attempt to apply classification of time series using meta-clusters in order to improve the quality of regression models. In this case, clustering was performed as a method to obtain subgroups of time series data with normal distribution from the inflow into wastewater treatment plant data, composed of several groups differing by mean value. Two simple algorithms, K-mean and EM, were chosen as a clustering method. The Rand index was used to measure the similarity. After simple meta-clustering, a regression model was performed for each subgroups. The final model was a sum of the subgroups models. The quality of the obtained model was compared with the regression model made using the same explanatory variables, but with no clustering of data. Results were compared using determination coefficient (R2), measure of prediction accuracy- mean absolute percentage error (MAPE) and comparison on a linear chart. Preliminary results allow us to foresee the potential of the presented technique.
Abstract: Feature selection is a process to select features which are more informative. It is one of the important steps in knowledge discovery. The problem is that all genes are not important in gene expression data. Some of the genes may be redundant, and others may be irrelevant and noisy. Here a novel approach is proposed Hybrid K-Mean-Quick Reduct (KMQR) algorithm for gene selection from gene expression data. In this study, the entire dataset is divided into clusters by applying K-Means algorithm. Each cluster contains similar genes. The high class discriminated genes has been selected based on their degree of dependence by applying Quick Reduct algorithm to all the clusters. Average Correlation Value (ACV) is calculated for the high class discriminated genes. The clusters which have the ACV value as 1 is determined as significant clusters, whose classification accuracy will be equal or high when comparing to the accuracy of the entire dataset. The proposed algorithm is evaluated using WEKA classifiers and compared. The proposed work shows that the high classification accuracy.
Abstract: Breast Cancer is the most common malignancy in women and the second leading cause of death for women all over the world. Earlier the detection of cancer, better the treatment. The diagnosis and treatment of the cancer rely on segmentation of Sonoelastographic images. Texture features has not considered for Sonoelastographic segmentation. Sonoelastographic images of 15 patients containing both benign and malignant tumorsare considered for experimentation.The images are enhanced to remove noise in order to improve contrast and emphasize tumor boundary. It is then decomposed into sub-bands using single level Daubechies wavelets varying from single co-efficient to six coefficients. The Grey Level Co-occurrence Matrix (GLCM), Local Binary Pattern (LBP) features are extracted and then selected by ranking it using Sequential Floating Forward Selection (SFFS) technique from each sub-band. The resultant images undergo K-Means clustering and then few post-processing steps to remove the false spots. The tumor boundary is detected from the segmented image. It is proposed that Local Binary Pattern (LBP) from the vertical coefficients of Daubechies wavelet with two coefficients is best suited for segmentation of Sonoelastographic breast images among the wavelet members using one to six coefficients for decomposition. The results are also quantified with the help of an expert radiologist. The proposed work can be used for further diagnostic process to decide if the segmented tumor is benign or malignant.
Abstract: The goal of a network-based intrusion detection
system is to classify activities of network traffics into two major
categories: normal and attack (intrusive) activities. Nowadays, data
mining and machine learning plays an important role in many
sciences; including intrusion detection system (IDS) using both
supervised and unsupervised techniques. However, one of the
essential steps of data mining is feature selection that helps in
improving the efficiency, performance and prediction rate of
proposed approach. This paper applies unsupervised K-means
clustering algorithm with information gain (IG) for feature selection
and reduction to build a network intrusion detection system. For our
experimental analysis, we have used the new NSL-KDD dataset,
which is a modified dataset for KDDCup 1999 intrusion detection
benchmark dataset. With a split of 60.0% for the training set and the
remainder for the testing set, a 2 class classifications have been
implemented (Normal, Attack). Weka framework which is a java
based open source software consists of a collection of machine
learning algorithms for data mining tasks has been used in the testing
process. The experimental results show that the proposed approach is
very accurate with low false positive rate and high true positive rate
and it takes less learning time in comparison with using the full
features of the dataset with the same algorithm.
Abstract: This paper presents an optimal and unsupervised satellite image segmentation approach based on Pearson system and k-Means Clustering Algorithm Initialization. Such method could be considered as original by the fact that it utilised K-Means clustering algorithm for an optimal initialisation of image class number on one hand and it exploited Pearson system for an optimal statistical distributions- affectation of each considered class on the other hand. Satellite image exploitation requires the use of different approaches, especially those founded on the unsupervised statistical segmentation principle. Such approaches necessitate definition of several parameters like image class number, class variables- estimation and generalised mixture distributions. Use of statistical images- attributes assured convincing and promoting results under the condition of having an optimal initialisation step with appropriated statistical distributions- affectation. Pearson system associated with a k-means clustering algorithm and Stochastic Expectation-Maximization 'SEM' algorithm could be adapted to such problem. For each image-s class, Pearson system attributes one distribution type according to different parameters and especially the Skewness 'β1' and the kurtosis 'β2'. The different adapted algorithms, K-Means clustering algorithm, SEM algorithm and Pearson system algorithm, are then applied to satellite image segmentation problem. Efficiency of those combined algorithms was firstly validated with the Mean Quadratic Error 'MQE' evaluation, and secondly with visual inspection along several comparisons of these unsupervised images- segmentation.
Abstract: Due to the constant increase in the volume of information available to applications in fields varying from medical diagnosis to web search engines, accurate support of similarity becomes an important task. This is also the case of spam filtering techniques where the similarities between the known and incoming messages are the fundaments of making the spam/not spam decision. We present a novel approach to filtering based solely on layout, whose goal is not only to correctly identify spam, but also warn about major emerging threats. We propose a mathematical formulation of the email message layout and based on it we elaborate an algorithm to separate different types of emails and find the new, numerically relevant spam types.
Abstract: In this paper a Pattern Recognition algorithm based on
a constrained version of the k-means clustering algorithm will be
presented. The proposed algorithm is a non parametric supervised
statistical pattern recognition algorithm, i.e. it works under very mild
assumptions on the dataset. The performance of the algorithm will
be tested, togheter with a feature extraction technique that captures
the information on the closed two-dimensional contour of an image,
on images of industrial mineral ores.