Abstract: This research is aimed to describe the application of robust regression and its advantages over the least square regression method in analyzing financial data. To do this, relationship between earning per share, book value of equity per share and share price as price model and earning per share, annual change of earning per share and return of stock as return model is discussed using both robust and least square regressions, and finally the outcomes are compared. Comparing the results from the robust regression and the least square regression shows that the former can provide the possibility of a better and more realistic analysis owing to eliminating or reducing the contribution of outliers and influential data. Therefore, robust regression is recommended for getting more precise results in financial data analysis.
Abstract: In this paper, a method to detect multiple ellipses is presented. The technique is efficient and robust against incomplete ellipses due to partial occlusion, noise or missing edges and outliers. It is an iterative technique that finds and removes the best ellipse until no reasonable ellipse is found. At each run, the best ellipse is extracted from randomly selected edge patches, its fitness calculated and compared to a fitness threshold. RANSAC algorithm is applied as a sampling process together with the Direct Least Square fitting of ellipses (DLS) as the fitting algorithm. In our experiment, the method performs very well and is robust against noise and spurious edges on both synthetic and real-world image data.
Abstract: This paper describes a new method for affine parameter
estimation between image sequences. Usually, the parameter
estimation techniques can be done by least squares in a quadratic
way. However, this technique can be sensitive to the presence
of outliers. Therefore, parameter estimation techniques for various
image processing applications are robust enough to withstand the
influence of outliers. Progressively, some robust estimation functions
demanding non-quadratic and perhaps non-convex potentials adopted
from statistics literature have been used for solving these. Addressing
the optimization of the error function in a factual framework for
finding a global optimal solution, the minimization can begin with
the convex estimator at the coarser level and gradually introduce nonconvexity
i.e., from soft to hard redescending non-convex estimators
when the iteration reaches finer level of multiresolution pyramid.
Comparison has been made to find the performance of the results
of proposed method with the results found individually using two
different estimators.
Abstract: A clustering is process to identify a homogeneous
groups of object called as cluster. Clustering is one interesting topic
on data mining. A group or class behaves similarly characteristics.
This paper discusses a robust clustering process for data images with
two reduction dimension approaches; i.e. the two dimensional
principal component analysis (2DPCA) and principal component
analysis (PCA). A standard approach to overcome this problem is
dimension reduction, which transforms a high-dimensional data into
a lower-dimensional space with limited loss of information. One of
the most common forms of dimensionality reduction is the principal
components analysis (PCA). The 2DPCA is often called a variant of
principal component (PCA), the image matrices were directly treated
as 2D matrices; they do not need to be transformed into a vector so
that the covariance matrix of image can be constructed directly using
the original image matrices. The decomposed classical covariance
matrix is very sensitive to outlying observations. The objective of
paper is to compare the performance of robust minimizing vector
variance (MVV) in the two dimensional projection PCA (2DPCA)
and the PCA for clustering on an arbitrary data image when outliers
are hiden in the data set. The simulation aspects of robustness and
the illustration of clustering images are discussed in the end of
paper
Abstract: Outlier detection in streaming data is very challenging because streaming data cannot be scanned multiple times and also new concepts may keep evolving. Irrelevant attributes can be termed as noisy attributes and such attributes further magnify the challenge of working with data streams. In this paper, we propose an unsupervised outlier detection scheme for streaming data. This scheme is based on clustering as clustering is an unsupervised data mining task and it does not require labeled data, both density based and partitioning clustering are combined for outlier detection. In this scheme partitioning clustering is also used to assign weights to attributes depending upon their respective relevance and weights are adaptive. Weighted attributes are helpful to reduce or remove the effect of noisy attributes. Keeping in view the challenges of streaming data, the proposed scheme is incremental and adaptive to concept evolution. Experimental results on synthetic and real world data sets show that our proposed approach outperforms other existing approach (CORM) in terms of outlier detection rate, false alarm rate, and increasing percentages of outliers.
Abstract: The emergence of the Internet has brewed the
revolution of information storage and retrieval. As most of the
data in the web is unstructured, and contains a mix of text,
video, audio etc, there is a need to mine information to cater to
the specific needs of the users without loss of important
hidden information. Thus developing user friendly and
automated tools for providing relevant information quickly
becomes a major challenge in web mining research. Most of
the existing web mining algorithms have concentrated on
finding frequent patterns while neglecting the less frequent
ones that are likely to contain outlying data such as noise,
irrelevant and redundant data. This paper mainly focuses on
Signed approach and full word matching on the organized
domain dictionary for mining web content outliers. This
Signed approach gives the relevant web documents as well as
outlying web documents. As the dictionary is organized based
on the number of characters in a word, searching and retrieval
of documents takes less time and less space.
Abstract: The stereophotogrammetry modality is gaining more widespread use in the clinical setting. Registration and visualization of this data, in conjunction with conventional 3D volumetric image modalities, provides virtual human data with textured soft tissue and internal anatomical and structural information. In this investigation computed tomography (CT) and stereophotogrammetry data is acquired from 4 anatomical phantoms and registered using the trimmed iterative closest point (TrICP) algorithm. This paper fully addresses the issue of imaging artifacts around the stereophotogrammetry surface edge using the registered CT data as a reference. Several iterative algorithms are implemented to automatically identify and remove stereophotogrammetry surface edge outliers, improving the overall visualization of the combined stereophotogrammetry and CT data. This paper shows that outliers at the surface edge of stereophotogrammetry data can be successfully removed automatically.
Abstract: The weighting exponent m is called the fuzzifier that
can have influence on the clustering performance of fuzzy c-means
(FCM) and mÎ[1.5,2.5] is suggested by Pal and Bezdek [13]. In this
paper, we will discuss the robust properties of FCM and show that the
parameter m will have influence on the robustness of FCM. According
to our analysis, we find that a large m value will make FCM more
robust to noise and outliers. However, if m is larger than the theoretical
upper bound proposed by Yu et al. [14], the sample mean will become
the unique optimizer. Here, we suggest to implement the FCM
algorithm with mÎ[1.5,4] under the restriction when m is smaller
than the theoretical upper bound.
Abstract: In large datasets, identifying exceptional or rare cases
with respect to a group of similar cases is considered very significant
problem. The traditional problem (Outlier Mining) is to find
exception or rare cases in a dataset irrespective of the class label of
these cases, they are considered rare events with respect to the whole
dataset. In this research, we pose the problem that is Class Outliers
Mining and a method to find out those outliers. The general
definition of this problem is “given a set of observations with class
labels, find those that arouse suspicions, taking into account the
class labels". We introduce a novel definition of Outlier that is Class
Outlier, and propose the Class Outlier Factor (COF) which measures
the degree of being a Class Outlier for a data object. Our work
includes a proposal of a new algorithm towards mining of the Class
Outliers, presenting experimental results applied on various domains
of real world datasets and finally a comparison study with other
related methods is performed.
Abstract: The backpropagation algorithm in general employs quadratic error function. In fact, most of the problems that involve minimization employ the Quadratic error function. With alternative error functions the performance of the optimization scheme can be improved. The new error functions help in suppressing the ill-effects of the outliers and have shown good performance to noise. In this paper we have tried to evaluate and compare the relative performance of complex valued neural network using different error functions. During first simulation for complex XOR gate it is observed that some error functions like Absolute error, Cauchy error function can replace Quadratic error function. In the second simulation it is observed that for some error functions the performance of the complex valued neural network depends on the architecture of the network whereas with few other error functions convergence speed of the network is independent of architecture of the neural network.
Abstract: This paper presents a supervised clustering algorithm,
namely Grid-Based Supervised Clustering (GBSC), which is able to
identify clusters of any shapes and sizes without presuming any
canonical form for data distribution. The GBSC needs no prespecified
number of clusters, is insensitive to the order of the input
data objects, and is capable of handling outliers. Built on the
combination of grid-based clustering and density-based clustering,
under the assistance of the downward closure property of density
used in bottom-up subspace clustering, the GBSC can notably reduce
its search space to avoid the memory confinement situation during its
execution. On two-dimension synthetic datasets, the GBSC can
identify clusters with different shapes and sizes correctly. The GBSC
also outperforms other five supervised clustering algorithms when
the experiments are performed on some UCI datasets.
Abstract: An algorithm for learning an overcomplete dictionary
using a Cauchy mixture model for sparse decomposition of an underdetermined
mixing system is introduced. The mixture density
function is derived from a ratio sample of the observed mixture
signals where 1) there are at least two but not necessarily more
mixture signals observed, 2) the source signals are statistically
independent and 3) the sources are sparse. The basis vectors of the
dictionary are learned via the optimization of the location parameters
of the Cauchy mixture components, which is shown to be more
accurate and robust than the conventional data mining methods
usually employed for this task. Using a well known sparse
decomposition algorithm, we extract three speech signals from two
mixtures based on the estimated dictionary. Further tests with
additive Gaussian noise are used to demonstrate the proposed
algorithm-s robustness to outliers.
Abstract: Data clustering is an important data exploration technique
with many applications in data mining. We present an enhanced
version of the well known single link clustering algorithm. We will
refer to this algorithm as DCBOR. The proposed algorithm alleviates
the chain effect by removing the outliers from the given dataset.
So this algorithm provides outlier detection and data clustering
simultaneously. This algorithm does not need to update the distance
matrix, since the algorithm depends on merging the most k-nearest
objects in one step and the cluster continues grow as long as possible
under specified condition. So the algorithm consists of two phases;
at the first phase, it removes the outliers from the input dataset. At
the second phase, it performs the clustering process. This algorithm
discovers clusters of different shapes, sizes, densities and requires
only one input parameter; this parameter represents a threshold for
outlier points. The value of the input parameter is ranging from 0 to
1. The algorithm supports the user in determining an appropriate
value for it. We have tested this algorithm on different datasets
contain outlier and connecting clusters by chain of density points,
and the algorithm discovers the correct clusters. The results of
our experiments demonstrate the effectiveness and the efficiency of
DCBOR.
Abstract: One of the purposes of the robust method of
estimation is to reduce the influence of outliers in the data, on the
estimates. The outliers arise from gross errors or contamination from
distributions with long tails. The trimmed mean is a robust estimate.
This means that it is not sensitive to violation of distributional
assumptions of the data. It is called an adaptive estimate when the
trimming proportion is determined from the data rather than being
fixed a “priori-.
The main objective of this study is to find out the robustness
properties of the adaptive trimmed means in terms of efficiency, high
breakdown point and influence function. Specifically, it seeks to find
out the magnitude of the trimming proportion of the adaptive
trimmed mean which will yield efficient and robust estimates of the
parameter for data which follow a modified Weibull distribution with
parameter λ = 1/2 , where the trimming proportion is determined by a
ratio of two trimmed means defined as the tail length. Secondly, the
asymptotic properties of the tail length and the trimmed means are
also investigated. Finally, a comparison is made on the efficiency of
the adaptive trimmed means in terms of the standard deviation for the
trimming proportions and when these were fixed a “priori".
The asymptotic tail lengths defined as the ratio of two trimmed
means and the asymptotic variances were computed by using the
formulas derived. While the values of the standard deviations for the
derived tail lengths for data of size 40 simulated from a Weibull
distribution were computed for 100 iterations using a computer
program written in Pascal language.
The findings of the study revealed that the tail lengths of the
Weibull distribution increase in magnitudes as the trimming
proportions increase, the measure of the tail length and the adaptive
trimmed mean are asymptotically independent as the number of
observations n becomes very large or approaching infinity, the tail
length is asymptotically distributed as the ratio of two independent
normal random variables, and the asymptotic variances decrease as
the trimming proportions increase. The simulation study revealed
empirically that the standard error of the adaptive trimmed mean
using the ratio of tail lengths is relatively smaller for different values
of trimming proportions than its counterpart when the trimming
proportions were fixed a 'priori'.
Abstract: The detection of outliers is very essential because of
their responsibility for producing huge interpretative problem in
linear as well as in nonlinear regression analysis. Much work has
been accomplished on the identification of outlier in linear
regression, but not in nonlinear regression. In this article we propose
several outlier detection techniques for nonlinear regression. The
main idea is to use the linear approximation of a nonlinear model and
consider the gradient as the design matrix. Subsequently, the
detection techniques are formulated. Six detection measures are
developed that combined with three estimation techniques such as the
Least-Squares, M and MM-estimators. The study shows that among
the six measures, only the studentized residual and Cook Distance
which combined with the MM estimator, consistently capable of
identifying the correct outliers.
Abstract: This paper presents a procedure for estimating VAR
using Sequential Discounting VAR (SDVAR) algorithm for online
model learning to detect fraudulent acts using the telecommunications
call detailed records (CDR). The volatility of the VAR is observed
allowing for non-linearity, outliers and change points based on the
works of [1]. This paper extends their procedure from univariate
to multivariate time series. A simulation and a case study for
detecting telecommunications fraud using CDR illustrate the use of
the algorithm in the bivariate setting.
Abstract: To solve the problem of multisensor data fusion under
non-Gaussian channel noise. The advanced M-estimates are known
to be robust solution while trading off some accuracy. In order to
improve the estimation accuracy while still maintaining the equivalent
robustness, a two-stage robust fusion algorithm is proposed using
preliminary rejection of outliers then an optimal linear fusion. The
numerical experiments show that the proposed algorithm is equivalent
to the M-estimates in the case of uncorrelated local estimates and
significantly outperforms the M-estimates when local estimates are
correlated.
Abstract: Corner detection and optical flow are common techniques for feature-based video stabilization. However, these algorithms are computationally expensive and should be performed at a reasonable rate. This paper presents an algorithm for discarding irrelevant feature points and maintaining them for future use so as to improve the computational cost. The algorithm starts by initializing a maintained set. The feature points in the maintained set are examined against its accuracy for modeling. Corner detection is required only when the feature points are insufficiently accurate for future modeling. Then, optical flows are computed from the maintained feature points toward the consecutive frame. After that, a motion model is estimated based on the simplified affine motion model and least square method, with outliers belonging to moving objects presented. Studentized residuals are used to eliminate such outliers. The model estimation and elimination processes repeat until no more outliers are identified. Finally, the entire algorithm repeats along the video sequence with the points remaining from the previous iteration used as the maintained set. As a practical application, an efficient video stabilization can be achieved by exploiting the computed motion models. Our study shows that the number of times corner detection needs to perform is greatly reduced, thus significantly improving the computational cost. Moreover, optical flow vectors are computed for only the maintained feature points, not for outliers, thus also reducing the computational cost. In addition, the feature points after reduction can sufficiently be used for background objects tracking as demonstrated in the simple video stabilizer based on our proposed algorithm.
Abstract: On-line (near infrared) spectroscopy is widely used to support the operation of complex process systems. Information extracted from spectral database can be used to estimate unmeasured product properties and monitor the operation of the process. These techniques are based on looking for similar spectra by nearest neighborhood algorithms and distance based searching methods. Search for nearest neighbors in the spectral space is an NP-hard problem, the computational complexity increases by the number of points in the discrete spectrum and the number of samples in the database. To reduce the calculation time some kind of indexing could be used. The main idea presented in this paper is to combine indexing and visualization techniques to reduce the computational requirement of estimation algorithms by providing a two dimensional indexing that can also be used to visualize the structure of the spectral database. This 2D visualization of spectral database does not only support application of distance and similarity based techniques but enables the utilization of advanced clustering and prediction algorithms based on the Delaunay tessellation of the mapped spectral space. This means the prediction has not to use the high dimension space but can be based on the mapped space too. The results illustrate that the proposed method is able to segment (cluster) spectral databases and detect outliers that are not suitable for instance based learning algorithms.
Abstract: In the last decades, a number of robust fuzzy clustering algorithms have been proposed to partition data sets affected by noise and outliers. Robust fuzzy C-means (robust-FCM) is certainly one of the most known among these algorithms. In robust-FCM, noise is modeled as a separate cluster and is characterized by a prototype that has a constant distance δ from all data points. Distance δ determines the boundary of the noise cluster and therefore is a critical parameter of the algorithm. Though some approaches have been proposed to automatically determine the most suitable δ for the specific application, up to today an efficient and fully satisfactory solution does not exist. The aim of this paper is to propose a novel method to compute the optimal δ based on the analysis of the distribution of the percentage of objects assigned to the noise cluster in repeated executions of the robust-FCM with decreasing values of δ . The extremely encouraging results obtained on some data sets found in the literature are shown and discussed.