Abstract: The backpropagation algorithm in general employs quadratic error function. In fact, most of the problems that involve minimization employ the Quadratic error function. With alternative error functions the performance of the optimization scheme can be improved. The new error functions help in suppressing the ill-effects of the outliers and have shown good performance to noise. In this paper we have tried to evaluate and compare the relative performance of complex valued neural network using different error functions. During first simulation for complex XOR gate it is observed that some error functions like Absolute error, Cauchy error function can replace Quadratic error function. In the second simulation it is observed that for some error functions the performance of the complex valued neural network depends on the architecture of the network whereas with few other error functions convergence speed of the network is independent of architecture of the neural network.
Abstract: This paper presents a supervised clustering algorithm,
namely Grid-Based Supervised Clustering (GBSC), which is able to
identify clusters of any shapes and sizes without presuming any
canonical form for data distribution. The GBSC needs no prespecified
number of clusters, is insensitive to the order of the input
data objects, and is capable of handling outliers. Built on the
combination of grid-based clustering and density-based clustering,
under the assistance of the downward closure property of density
used in bottom-up subspace clustering, the GBSC can notably reduce
its search space to avoid the memory confinement situation during its
execution. On two-dimension synthetic datasets, the GBSC can
identify clusters with different shapes and sizes correctly. The GBSC
also outperforms other five supervised clustering algorithms when
the experiments are performed on some UCI datasets.
Abstract: Modern spatial database management systems require a unique Spatial Access Method (SAM) in order solve complex spatial quires efficiently. In this case the spatial data structure takes a prominent place in the SAM. Inadequate data structure leads forming poor algorithmic choices and forging deficient understandings of algorithm behavior on the spatial database. A key step in developing a better semantic spatial object data structure is to quantify the performance effects of semantic and outlier detections that are not reflected in the previous tree structures (R-Tree and its variants). This paper explores a novel SSRO-Tree on SAM to the Topo-Semantic approach. The paper shows how to identify and handle the semantic spatial objects with outlier objects during page overflow/underflow, using gain/loss metrics. We introduce a new SSRO-Tree algorithm which facilitates the achievement of better performance in practice over algorithms that are superior in the R*-Tree and RO-Tree by considering selection queries.
Abstract: An algorithm for learning an overcomplete dictionary
using a Cauchy mixture model for sparse decomposition of an underdetermined
mixing system is introduced. The mixture density
function is derived from a ratio sample of the observed mixture
signals where 1) there are at least two but not necessarily more
mixture signals observed, 2) the source signals are statistically
independent and 3) the sources are sparse. The basis vectors of the
dictionary are learned via the optimization of the location parameters
of the Cauchy mixture components, which is shown to be more
accurate and robust than the conventional data mining methods
usually employed for this task. Using a well known sparse
decomposition algorithm, we extract three speech signals from two
mixtures based on the estimated dictionary. Further tests with
additive Gaussian noise are used to demonstrate the proposed
algorithm-s robustness to outliers.
Abstract: Data clustering is an important data exploration technique
with many applications in data mining. We present an enhanced
version of the well known single link clustering algorithm. We will
refer to this algorithm as DCBOR. The proposed algorithm alleviates
the chain effect by removing the outliers from the given dataset.
So this algorithm provides outlier detection and data clustering
simultaneously. This algorithm does not need to update the distance
matrix, since the algorithm depends on merging the most k-nearest
objects in one step and the cluster continues grow as long as possible
under specified condition. So the algorithm consists of two phases;
at the first phase, it removes the outliers from the input dataset. At
the second phase, it performs the clustering process. This algorithm
discovers clusters of different shapes, sizes, densities and requires
only one input parameter; this parameter represents a threshold for
outlier points. The value of the input parameter is ranging from 0 to
1. The algorithm supports the user in determining an appropriate
value for it. We have tested this algorithm on different datasets
contain outlier and connecting clusters by chain of density points,
and the algorithm discovers the correct clusters. The results of
our experiments demonstrate the effectiveness and the efficiency of
DCBOR.
Abstract: One of the purposes of the robust method of
estimation is to reduce the influence of outliers in the data, on the
estimates. The outliers arise from gross errors or contamination from
distributions with long tails. The trimmed mean is a robust estimate.
This means that it is not sensitive to violation of distributional
assumptions of the data. It is called an adaptive estimate when the
trimming proportion is determined from the data rather than being
fixed a “priori-.
The main objective of this study is to find out the robustness
properties of the adaptive trimmed means in terms of efficiency, high
breakdown point and influence function. Specifically, it seeks to find
out the magnitude of the trimming proportion of the adaptive
trimmed mean which will yield efficient and robust estimates of the
parameter for data which follow a modified Weibull distribution with
parameter λ = 1/2 , where the trimming proportion is determined by a
ratio of two trimmed means defined as the tail length. Secondly, the
asymptotic properties of the tail length and the trimmed means are
also investigated. Finally, a comparison is made on the efficiency of
the adaptive trimmed means in terms of the standard deviation for the
trimming proportions and when these were fixed a “priori".
The asymptotic tail lengths defined as the ratio of two trimmed
means and the asymptotic variances were computed by using the
formulas derived. While the values of the standard deviations for the
derived tail lengths for data of size 40 simulated from a Weibull
distribution were computed for 100 iterations using a computer
program written in Pascal language.
The findings of the study revealed that the tail lengths of the
Weibull distribution increase in magnitudes as the trimming
proportions increase, the measure of the tail length and the adaptive
trimmed mean are asymptotically independent as the number of
observations n becomes very large or approaching infinity, the tail
length is asymptotically distributed as the ratio of two independent
normal random variables, and the asymptotic variances decrease as
the trimming proportions increase. The simulation study revealed
empirically that the standard error of the adaptive trimmed mean
using the ratio of tail lengths is relatively smaller for different values
of trimming proportions than its counterpart when the trimming
proportions were fixed a 'priori'.
Abstract: The detection of outliers is very essential because of
their responsibility for producing huge interpretative problem in
linear as well as in nonlinear regression analysis. Much work has
been accomplished on the identification of outlier in linear
regression, but not in nonlinear regression. In this article we propose
several outlier detection techniques for nonlinear regression. The
main idea is to use the linear approximation of a nonlinear model and
consider the gradient as the design matrix. Subsequently, the
detection techniques are formulated. Six detection measures are
developed that combined with three estimation techniques such as the
Least-Squares, M and MM-estimators. The study shows that among
the six measures, only the studentized residual and Cook Distance
which combined with the MM estimator, consistently capable of
identifying the correct outliers.
Abstract: This paper presents a procedure for estimating VAR
using Sequential Discounting VAR (SDVAR) algorithm for online
model learning to detect fraudulent acts using the telecommunications
call detailed records (CDR). The volatility of the VAR is observed
allowing for non-linearity, outliers and change points based on the
works of [1]. This paper extends their procedure from univariate
to multivariate time series. A simulation and a case study for
detecting telecommunications fraud using CDR illustrate the use of
the algorithm in the bivariate setting.
Abstract: To solve the problem of multisensor data fusion under
non-Gaussian channel noise. The advanced M-estimates are known
to be robust solution while trading off some accuracy. In order to
improve the estimation accuracy while still maintaining the equivalent
robustness, a two-stage robust fusion algorithm is proposed using
preliminary rejection of outliers then an optimal linear fusion. The
numerical experiments show that the proposed algorithm is equivalent
to the M-estimates in the case of uncorrelated local estimates and
significantly outperforms the M-estimates when local estimates are
correlated.
Abstract: The one-class support vector machine “support vector
data description” (SVDD) is an ideal approach for anomaly or outlier
detection. However, for the applicability of SVDD in real-world
applications, the ease of use is crucial. The results of SVDD are
massively determined by the choice of the regularisation parameter C
and the kernel parameter of the widely used RBF kernel. While for
two-class SVMs the parameters can be tuned using cross-validation
based on the confusion matrix, for a one-class SVM this is not
possible, because only true positives and false negatives can occur
during training. This paper proposes an approach to find the optimal
set of parameters for SVDD solely based on a training set from
one class and without any user parameterisation. Results on artificial
and real data sets are presented, underpinning the usefulness of the
approach.
Abstract: Wrist pulse analysis for identification of health status
is found in Ancient Indian as well as Chinese literature. The preprocessing
of wrist pulse is necessary to remove outlier pulses and
fluctuations prior to the analysis of pulse pressure signal. This paper
discusses the identification of irregular pulses present in the pulse
series and intricacies associated with the extraction of time domain
pulse features. An approach of Dynamic Time Warping (DTW) has
been utilized for the identification of outlier pulses in the wrist pulse
series. The ambiguity present in the identification of pulse features is
resolved with the help of first derivative of Ensemble Average of
wrist pulse series. An algorithm for detecting tidal and dicrotic notch
in individual wrist pulse segment is proposed.
Abstract: Corner detection and optical flow are common techniques for feature-based video stabilization. However, these algorithms are computationally expensive and should be performed at a reasonable rate. This paper presents an algorithm for discarding irrelevant feature points and maintaining them for future use so as to improve the computational cost. The algorithm starts by initializing a maintained set. The feature points in the maintained set are examined against its accuracy for modeling. Corner detection is required only when the feature points are insufficiently accurate for future modeling. Then, optical flows are computed from the maintained feature points toward the consecutive frame. After that, a motion model is estimated based on the simplified affine motion model and least square method, with outliers belonging to moving objects presented. Studentized residuals are used to eliminate such outliers. The model estimation and elimination processes repeat until no more outliers are identified. Finally, the entire algorithm repeats along the video sequence with the points remaining from the previous iteration used as the maintained set. As a practical application, an efficient video stabilization can be achieved by exploiting the computed motion models. Our study shows that the number of times corner detection needs to perform is greatly reduced, thus significantly improving the computational cost. Moreover, optical flow vectors are computed for only the maintained feature points, not for outliers, thus also reducing the computational cost. In addition, the feature points after reduction can sufficiently be used for background objects tracking as demonstrated in the simple video stabilizer based on our proposed algorithm.
Abstract: On-line (near infrared) spectroscopy is widely used to support the operation of complex process systems. Information extracted from spectral database can be used to estimate unmeasured product properties and monitor the operation of the process. These techniques are based on looking for similar spectra by nearest neighborhood algorithms and distance based searching methods. Search for nearest neighbors in the spectral space is an NP-hard problem, the computational complexity increases by the number of points in the discrete spectrum and the number of samples in the database. To reduce the calculation time some kind of indexing could be used. The main idea presented in this paper is to combine indexing and visualization techniques to reduce the computational requirement of estimation algorithms by providing a two dimensional indexing that can also be used to visualize the structure of the spectral database. This 2D visualization of spectral database does not only support application of distance and similarity based techniques but enables the utilization of advanced clustering and prediction algorithms based on the Delaunay tessellation of the mapped spectral space. This means the prediction has not to use the high dimension space but can be based on the mapped space too. The results illustrate that the proposed method is able to segment (cluster) spectral databases and detect outliers that are not suitable for instance based learning algorithms.
Abstract: Preprocessing of speech signals is considered a crucial step in the development of a robust and efficient speech or speaker recognition system. In this paper, we present some popular statistical outlier-detection based strategies to segregate the silence/unvoiced part of the speech signal from the voiced portion. The proposed methods are based on the utilization of the 3 σ edit rule, and the Hampel Identifier which are compared with the conventional techniques: (i) short-time energy (STE) based methods, and (ii) distribution based methods. The results obtained after applying the proposed strategies on some test voice signals are encouraging.
Abstract: In the last decades, a number of robust fuzzy clustering algorithms have been proposed to partition data sets affected by noise and outliers. Robust fuzzy C-means (robust-FCM) is certainly one of the most known among these algorithms. In robust-FCM, noise is modeled as a separate cluster and is characterized by a prototype that has a constant distance δ from all data points. Distance δ determines the boundary of the noise cluster and therefore is a critical parameter of the algorithm. Though some approaches have been proposed to automatically determine the most suitable δ for the specific application, up to today an efficient and fully satisfactory solution does not exist. The aim of this paper is to propose a novel method to compute the optimal δ based on the analysis of the distribution of the percentage of objects assigned to the noise cluster in repeated executions of the robust-FCM with decreasing values of δ . The extremely encouraging results obtained on some data sets found in the literature are shown and discussed.
Abstract: This paper describes a new supervised fusion (hybrid)
electrocardiogram (ECG) classification solution consisting of a new
QRS complex geometrical feature extraction as well as a new version
of the learning vector quantization (LVQ) classification algorithm
aimed for overcoming the stability-plasticity dilemma. Toward this
objective, after detection and delineation of the major events of ECG
signal via an appropriate algorithm, each QRS region and also its
corresponding discrete wavelet transform (DWT) are supposed as
virtual images and each of them is divided into eight polar sectors.
Then, the curve length of each excerpted segment is calculated
and is used as the element of the feature space. To increase the
robustness of the proposed classification algorithm versus noise,
artifacts and arrhythmic outliers, a fusion structure consisting of
five different classifiers namely as Support Vector Machine (SVM),
Modified Learning Vector Quantization (MLVQ) and three Multi
Layer Perceptron-Back Propagation (MLP–BP) neural networks with
different topologies were designed and implemented. The new proposed
algorithm was applied to all 48 MIT–BIH Arrhythmia Database
records (within–record analysis) and the discrimination power of the
classifier in isolation of different beat types of each record was
assessed and as the result, the average accuracy value Acc=98.51%
was obtained. Also, the proposed method was applied to 6 number
of arrhythmias (Normal, LBBB, RBBB, PVC, APB, PB) belonging
to 20 different records of the aforementioned database (between–
record analysis) and the average value of Acc=95.6% was achieved.
To evaluate performance quality of the new proposed hybrid learning
machine, the obtained results were compared with similar peer–
reviewed studies in this area.
Abstract: This paper discusses the classification process for medical data. In this paper, we use the data from ACM KDDCup 2008 to demonstrate our classification process based on latent topic discovery. In this data set, the target set and outliers are quite different in their nature: target set is only 0.6% size in total, while the outliers consist of 99.4% of the data set. We use this data set as an example to show how we dealt with this extremely biased data set with latent topic discovery and noise reduction techniques. Our experiment faces two major challenge: (1) extremely distributed outliers, and (2) positive samples are far smaller than negative ones. We try to propose a suitable process flow to deal with these issues and get a best AUC result of 0.98.
Abstract: Intelligent systems based on machine learning
techniques, such as classification, clustering, are gaining wide spread
popularity in real world applications. This paper presents work on
developing a software system for predicting crop yield, for example
oil-palm yield, from climate and plantation data. At the core of our
system is a method for unsupervised partitioning of data for finding
spatio-temporal patterns in climate data using kernel methods which
offer strength to deal with complex data. This work gets inspiration
from the notion that a non-linear data transformation into some high
dimensional feature space increases the possibility of linear
separability of the patterns in the transformed space. Therefore, it
simplifies exploration of the associated structure in the data. Kernel
methods implicitly perform a non-linear mapping of the input data
into a high dimensional feature space by replacing the inner products
with an appropriate positive definite function. In this paper we
present a robust weighted kernel k-means algorithm incorporating
spatial constraints for clustering the data. The proposed algorithm
can effectively handle noise, outliers and auto-correlation in the
spatial data, for effective and efficient data analysis by exploring
patterns and structures in the data, and thus can be used for
predicting oil-palm yield by analyzing various factors affecting the
yield.
Abstract: With the enormous growth on the web, users get easily
lost in the rich hyper structure. Thus developing user friendly and
automated tools for providing relevant information without any
redundant links to the users to cater to their needs is the primary task
for the website owners. Most of the existing web mining algorithms
have concentrated on finding frequent patterns while neglecting the
less frequent one that are likely to contain the outlying data such as
noise, irrelevant and redundant data. This paper proposes new
algorithm for mining the web content by detecting the redundant
links from the web documents using set theoretical(classical
mathematics) such as subset, union, intersection etc,. Then the
redundant links is removed from the original web content to get the
required information by the user..
Abstract: Video Mosaicing is the stitching of selected frames of
a video by estimating the camera motion between the frames and
thereby registering successive frames of the video to arrive at the
mosaic. Different techniques have been proposed in the literature for
video mosaicing. Despite of the large number of papers dealing with
techniques to generate mosaic, only a few authors have investigated
conditions under which these techniques generate good estimate of
motion parameters. In this paper, these techniques are studied under
different videos, and the reasons for failures are found. We propose
algorithms with incorporation of outlier removal algorithms for better
estimation of motion parameters.