Abstract: In this paper we present the first Arabic sentence
dataset for on-line handwriting recognition written on tablet pc. The
dataset is natural, simple and clear. Texts are sampled from daily
newspapers. To collect naturally written handwriting, forms are
dictated to writers. The current version of our dataset includes 154
paragraphs written by 48 writers. It contains more than 3800 words
and more than 19,400 characters. Handwritten texts are mainly
written by researchers from different research centers. In order to use
this dataset in a recognition system word extraction is needed. In this
paper a new word extraction technique based on the Arabic
handwriting cursive nature is also presented. The technique is applied
to this dataset and good results are obtained. The results can be
considered as a bench mark for future research to be compared with.
Abstract: Many real-world data sets consist of a very high dimensional feature space. Most clustering techniques use the distance or similarity between objects as a measure to build clusters. But in high dimensional spaces, distances between points become relatively uniform. In such cases, density based approaches may give better results. Subspace Clustering algorithms automatically identify lower dimensional subspaces of the higher dimensional feature space in which clusters exist. In this paper, we propose a new clustering algorithm, ISC – Intelligent Subspace Clustering, which tries to overcome three major limitations of the existing state-of-art techniques. ISC determines the input parameter such as є – distance at various levels of Subspace Clustering which helps in finding meaningful clusters. The uniform parameters approach is not suitable for different kind of databases. ISC implements dynamic and adaptive determination of Meaningful clustering parameters based on hierarchical filtering approach. Third and most important feature of ISC is the ability of incremental learning and dynamic inclusion and exclusions of subspaces which lead to better cluster formation.
Abstract: This paper is concerned with the production of an Arabic word semantic similarity benchmark dataset. It is the first of its kind for Arabic which was particularly developed to assess the accuracy of word semantic similarity measurements. Semantic similarity is an essential component to numerous applications in fields such as natural language processing, artificial intelligence, linguistics, and psychology. Most of the reported work has been done for English. To the best of our knowledge, there is no word similarity measure developed specifically for Arabic. In this paper, an Arabic benchmark dataset of 70 word pairs is presented. New methods and best possible available techniques have been used in this study to produce the Arabic dataset. This includes selecting and creating materials, collecting human ratings from a representative sample of participants, and calculating the overall ratings. This dataset will make a substantial contribution to future work in the field of Arabic WSS and hopefully it will be considered as a reference basis from which to evaluate and compare different methodologies in the field.
Abstract: Microarrays have become the effective, broadly used tools in biological and medical research to address a wide range of problems, including classification of disease subtypes and tumors. Many statistical methods are available for analyzing and systematizing these complex data into meaningful information, and one of the main goals in analyzing gene expression data is the detection of samples or genes with similar expression patterns. In this paper, we express and compare the performance of several clustering methods based on data preprocessing including strategies of normalization or noise clearness. We also evaluate each of these clustering methods with validation measures for both simulated data and real gene expression data. Consequently, clustering methods which are common used in microarray data analysis are affected by normalization and degree of noise and clearness for datasets.
Abstract: Probability-based identity disclosure risk
measurement may give the same overall risk for different
anonymization strategy of the same dataset. Some entities in the
anonymous dataset may have higher identification risks than the
others. Individuals are more concerned about higher risks than the
average and are more interested to know if they have a possibility of
being under higher risk. A notation of overall risk in the above
measurement method doesn-t indicate whether some of the involved
entities have higher identity disclosure risk than the others. In this
paper, we have introduced an identity disclosure risk measurement
method that not only implies overall risk, but also indicates whether
some of the members have higher risk than the others. The proposed
method quantifies the overall risk based on the individual risk values,
the percentage of the records that have a risk value higher than the
average and how larger the higher risk values are compared to the
average. We have analyzed the disclosure risks for different
disclosure control techniques applied to original microdata and
present the results.
Abstract: Locality Sensitive Hashing (LSH) is one of the most
promising techniques for solving nearest neighbour search problem in
high dimensional space. Euclidean LSH is the most popular variation
of LSH that has been successfully applied in many multimedia
applications. However, the Euclidean LSH presents limitations that
affect structure and query performances. The main limitation of the
Euclidean LSH is the large memory consumption. In order to achieve
a good accuracy, a large number of hash tables is required. In this
paper, we propose a new hashing algorithm to overcome the storage
space problem and improve query time, while keeping a good
accuracy as similar to that achieved by the original Euclidean LSH.
The Experimental results on a real large-scale dataset show that the
proposed approach achieves good performances and consumes less
memory than the Euclidean LSH.
Abstract: Naive Bayes Nearest Neighbor (NBNN) and its variants, i,e., local NBNN and the NBNN kernels, are local feature-based classifiers that have achieved impressive performance in image classification. By exploiting instance-to-class (I2C) distances (instance means image/video in image/video classification), they avoid quantization errors of local image descriptors in the bag of words (BoW) model. However, the performances of NBNN, local NBNN and the NBNN kernels have not been validated on video analysis. In this paper, we introduce these three classifiers into human action recognition and conduct comprehensive experiments on the benchmark KTH and the realistic HMDB datasets. The results shows that those I2C based classifiers consistently outperform the SVM classifier with the BoW model.
Abstract: In this paper a one-dimension Self Organizing Map
algorithm (SOM) to perform feature selection is presented. The
algorithm is based on a first classification of the input dataset on a
similarity space. From this classification for each class a set of
positive and negative features is computed. This set of features is
selected as result of the procedure. The procedure is evaluated on an
in-house dataset from a Knowledge Discovery from Text (KDT)
application and on a set of publicly available datasets used in
international feature selection competitions. These datasets come
from KDT applications, drug discovery as well as other applications.
The knowledge of the correct classification available for the training
and validation datasets is used to optimize the parameters for positive
and negative feature extractions. The process becomes feasible for
large and sparse datasets, as the ones obtained in KDT applications,
by using both compression techniques to store the similarity matrix
and speed up techniques of the Kohonen algorithm that take
advantage of the sparsity of the input matrix. These improvements
make it feasible, by using the grid, the application of the
methodology to massive datasets.
Abstract: Today, money laundering (ML) poses a serious threat
not only to financial institutions but also to the nation. This criminal
activity is becoming more and more sophisticated and seems to have
moved from the cliché of drug trafficking to financing terrorism and
surely not forgetting personal gain. Most international financial
institutions have been implementing anti-money laundering solutions
(AML) to fight investment fraud. However, traditional investigative
techniques consume numerous man-hours. Recently, data mining
approaches have been developed and are considered as well-suited
techniques for detecting ML activities. Within the scope of a
collaboration project for the purpose of developing a new solution for
the AML Units in an international investment bank, we proposed a
data mining-based solution for AML. In this paper, we present a
heuristics approach to improve the performance for this solution. We
also show some preliminary results associated with this method on
analysing transaction datasets.
Abstract: This paper explores the scalability issues associated
with solving the Named Entity Recognition (NER) problem using
Support Vector Machines (SVM) and high-dimensional features. The
performance results of a set of experiments conducted using binary
and multi-class SVM with increasing training data sizes are
examined. The NER domain chosen for these experiments is the
biomedical publications domain, especially selected due to its
importance and inherent challenges. A simple machine learning
approach is used that eliminates prior language knowledge such as
part-of-speech or noun phrase tagging thereby allowing for its
applicability across languages. No domain-specific knowledge is
included. The accuracy measures achieved are comparable to those
obtained using more complex approaches, which constitutes a
motivation to investigate ways to improve the scalability of multiclass
SVM in order to make the solution more practical and useable.
Improving training time of multi-class SVM would make support
vector machines a more viable and practical machine learning
solution for real-world problems with large datasets. An initial
prototype results in great improvement of the training time at the
expense of memory requirements.
Abstract: In this paper, algorithms for the automatic localisation
of two anatomical soft tissue landmarks of the head the medial
canthus (inner corner of the eye) and the tragus (a small, pointed,
cartilaginous flap of the ear), in CT images are describet. These
landmarks are to be used as a basis for an automated image-to-patient
registration system we are developing. The landmarks are localised
on a surface model extracted from CT images, based on surface
curvature and a rule based system that incorporates prior knowledge
of the landmark characteristics. The approach was tested on a dataset
of near isotropic CT images of 95 patients. The position of the
automatically localised landmarks was compared to the position of
the manually localised landmarks. The average difference was 1.5
mm and 0.8 mm for the medial canthus and tragus, with a maximum
difference of 4.5 mm and 2.6 mm respectively.The medial canthus
and tragus can be automatically localised in CT images, with
performance comparable to manual localisation
Abstract: In large datasets, identifying exceptional or rare cases
with respect to a group of similar cases is considered very significant
problem. The traditional problem (Outlier Mining) is to find
exception or rare cases in a dataset irrespective of the class label of
these cases, they are considered rare events with respect to the whole
dataset. In this research, we pose the problem that is Class Outliers
Mining and a method to find out those outliers. The general
definition of this problem is “given a set of observations with class
labels, find those that arouse suspicions, taking into account the
class labels". We introduce a novel definition of Outlier that is Class
Outlier, and propose the Class Outlier Factor (COF) which measures
the degree of being a Class Outlier for a data object. Our work
includes a proposal of a new algorithm towards mining of the Class
Outliers, presenting experimental results applied on various domains
of real world datasets and finally a comparison study with other
related methods is performed.
Abstract: In this paper, we propose an adaptation of the Patricia-Tree for sparse datasets to generate non redundant rule associations. Using this adaptation, we can generate frequent closed itemsets that are more compact than frequent itemsets used in Apriori approach. This adaptation has been experimented on a set of datasets benchmarks.
Abstract: The ability to recognize humans and their activities by computer vision is a very important task, with many potential application. Study of human motion analysis is related to several research areas of computer vision such as the motion capture, detection, tracking and segmentation of people. In this paper, we describe a segmentation method for extracting human body contour in modified HLS color space. To estimate a background, the modified HLS color space is proposed, and the background features are estimated by using the HLS color components. Here, the large amount of human dataset, which was collected from DV cameras, is pre-processed. The human body and its contour is successfully extracted from the image sequences.
Abstract: Tumor classification is a key area of research in the
field of bioinformatics. Microarray technology is commonly used in
the study of disease diagnosis using gene expression levels. The
main drawback of gene expression data is that it contains thousands
of genes and a very few samples. Feature selection methods are used
to select the informative genes from the microarray. These methods
considerably improve the classification accuracy. In the proposed
method, Genetic Algorithm (GA) is used for effective feature
selection. Informative genes are identified based on the T-Statistics,
Signal-to-Noise Ratio (SNR) and F-Test values. The initial candidate
solutions of GA are obtained from top-m informative genes. The
classification accuracy of k-Nearest Neighbor (kNN) method is used
as the fitness function for GA. In this work, kNN and Support Vector
Machine (SVM) are used as the classifiers. The experimental results
show that the proposed work is suitable for effective feature
selection. With the help of the selected genes, GA-kNN method
achieves 100% accuracy in 4 datasets and GA-SVM method
achieves in 5 out of 10 datasets. The GA with kNN and SVM
methods are demonstrated to be an accurate method for microarray
based tumor classification.
Abstract: It has become crucial over the years for nations to
improve their credit scoring methods and techniques in light of the
increasing volatility of the global economy. Statistical methods or
tools have been the favoured means for this; however artificial
intelligence or soft computing based techniques are becoming
increasingly preferred due to their proficient and precise nature and
relative simplicity. This work presents a comparison between Support
Vector Machines and Artificial Neural Networks two popular soft
computing models when applied to credit scoring. Amidst the
different criteria-s that can be used for comparisons; accuracy,
computational complexity and processing times are the selected
criteria used to evaluate both models. Furthermore the German credit
scoring dataset which is a real world dataset is used to train and test
both developed models. Experimental results obtained from our study
suggest that although both soft computing models could be used with
a high degree of accuracy, Artificial Neural Networks deliver better
results than Support Vector Machines.
Abstract: Several combinations of the preprocessing algorithms,
feature selection techniques and classifiers can be applied to the data
classification tasks. This study introduces a new accurate classifier,
the proposed classifier consist from four components: Signal-to-
Noise as a feature selection technique, support vector machine,
Bayesian neural network and AdaBoost as an ensemble algorithm.
To verify the effectiveness of the proposed classifier, seven well
known classifiers are applied to four datasets. The experiments show
that using the suggested classifier enhances the classification rates for
all datasets.
Abstract: This paper presents a supervised clustering algorithm,
namely Grid-Based Supervised Clustering (GBSC), which is able to
identify clusters of any shapes and sizes without presuming any
canonical form for data distribution. The GBSC needs no prespecified
number of clusters, is insensitive to the order of the input
data objects, and is capable of handling outliers. Built on the
combination of grid-based clustering and density-based clustering,
under the assistance of the downward closure property of density
used in bottom-up subspace clustering, the GBSC can notably reduce
its search space to avoid the memory confinement situation during its
execution. On two-dimension synthetic datasets, the GBSC can
identify clusters with different shapes and sizes correctly. The GBSC
also outperforms other five supervised clustering algorithms when
the experiments are performed on some UCI datasets.
Abstract: Since dealing with high dimensional data is
computationally complex and sometimes even intractable, recently
several feature reductions methods have been developed to reduce
the dimensionality of the data in order to simplify the calculation
analysis in various applications such as text categorization, signal
processing, image retrieval, gene expressions and etc. Among feature
reduction techniques, feature selection is one the most popular
methods due to the preservation of the original features.
In this paper, we propose a new unsupervised feature selection
method which will remove redundant features from the original
feature space by the use of probability density functions of various
features. To show the effectiveness of the proposed method, popular
feature selection methods have been implemented and compared.
Experimental results on the several datasets derived from UCI
repository database, illustrate the effectiveness of our proposed
methods in comparison with the other compared methods in terms of
both classification accuracy and the number of selected features.
Abstract: This paper proposes to use ETM+ multispectral data
and panchromatic band as well as texture features derived from the
panchromatic band for land cover classification. Four texture features
including one 'internal texture' and three GLCM based textures
namely correlation, entropy, and inverse different moment were used
in combination with ETM+ multispectral data. Two data sets
involving combination of multispectral, panchromatic band and its
texture were used and results were compared with those obtained by
using multispectral data alone. A decision tree classifier with and
without boosting were used to classify different datasets. Results
from this study suggest that the dataset consisting of panchromatic
band, four of its texture features and multispectral data was able to
increase the classification accuracy by about 2%. In comparison, a
boosted decision tree was able to increase the classification accuracy
by about 3% with the same dataset.