Abstract: t-SNE is an embedding method that the data science community has widely used. It helps two main tasks: to display results by coloring items according to the item class or feature value; and for forensic, giving a first overview of the dataset distribution. Two interesting characteristics of t-SNE are the structure preservation property and the answer to the crowding problem, where all neighbors in high dimensional space cannot be represented correctly in low dimensional space. t-SNE preserves the local neighborhood, and similar items are nicely spaced by adjusting to the local density. These two characteristics produce a meaningful representation, where the cluster area is proportional to its size in number, and relationships between clusters are materialized by closeness on the embedding. This algorithm is non-parametric. The transformation from a high to low dimensional space is described but not learned. Two initializations of the algorithm would lead to two different embedding. In a forensic approach, analysts would like to compare two or more datasets using their embedding. A naive approach would be to embed all datasets together. However, this process is costly as the complexity of t-SNE is quadratic, and would be infeasible for too many datasets. Another approach would be to learn a parametric model over an embedding built with a subset of data. While this approach is highly scalable, points could be mapped at the same exact position, making them indistinguishable. This type of model would be unable to adapt to new outliers nor concept drift. This paper presents a methodology to reuse an embedding to create a new one, where cluster positions are preserved. The optimization process minimizes two costs, one relative to the embedding shape and the second relative to the support embedding’ match. The embedding with the support process can be repeated more than once, with the newly obtained embedding. The successive embedding can be used to study the impact of one variable over the dataset distribution or monitor changes over time. This method has the same complexity as t-SNE per embedding, and memory requirements are only doubled. For a dataset of n elements sorted and split into k subsets, the total embedding complexity would be reduced from O(n2) to O(n2/k), and the memory requirement from n2 to 2(n/k)2 which enables computation on recent laptops. The method showed promising results on a real-world dataset, allowing to observe the birth, evolution and death of clusters. The proposed approach facilitates identifying significant trends and changes, which empowers the monitoring high dimensional datasets’ dynamics.
Abstract: In this study, a multivariate analysis of potato spectroscopic data was presented to detect the presence of brown rot disease or not. Near-Infrared (NIR) spectroscopy (1,350-2,500 nm) combined with multivariate analysis was used as a rapid, non-destructive technique for the detection of brown rot disease in potatoes. Spectral measurements were performed in 565 samples, which were chosen randomly at the infection place in the potato slice. In this study, 254 infected and 311 uninfected (brown rot-free) samples were analyzed using different advanced statistical analysis techniques. The discrimination performance of different multivariate analysis techniques, including classification, pre-processing, and dimension reduction, were compared. Applying a random forest algorithm classifier with different pre-processing techniques to raw spectra had the best performance as the total classification accuracy of 98.7% was achieved in discriminating infected potatoes from control.
Abstract: Dimensionality reduction and feature extraction are of
crucial importance for achieving high efficiency in manipulating
the high dimensional data. Two-dimensional discriminant locality
preserving projection (2D-DLPP) and two-dimensional discriminant
supervised LPP (2D-DSLPP) are two effective two-dimensional
projection methods for dimensionality reduction and feature
extraction of face image matrices. Since 2D-DLPP and 2D-DSLPP
preserve the local structure information of the original data and
exploit the discriminant information, they usually have good
recognition performance. However, 2D-DLPP and 2D-DSLPP
only employ single-sided projection, and thus the generated low
dimensional data matrices have still many features. In this paper,
by combining the discriminant supervised LPP with the bidirectional
projection, we propose the bidirectional discriminant supervised LPP
(BDSLPP). The left and right projection matrices for BDSLPP can
be computed iteratively. Experimental results show that the proposed
BDSLPP achieves higher recognition accuracy than 2D-DLPP,
2D-DSLPP, and bidirectional discriminant LPP (BDLPP).
Abstract: One of the biggest challenges in nonparametric
regression is the curse of dimensionality. Additive models are known
to overcome this problem by estimating only the individual additive
effects of each covariate. However, if the model is misspecified, the
accuracy of the estimator compared to the fully nonparametric one
is unknown. In this work the efficiency of completely nonparametric
regression estimators such as the Loess is compared to the estimators
that assume additivity in several situations, including additive and
non-additive regression scenarios. The comparison is done by
computing the oracle mean square error of the estimators with regards
to the true nonparametric regression function. Then, a backward
elimination selection procedure based on the Akaike Information
Criteria is proposed, which is computed from either the additive or
the nonparametric model. Simulations show that if the additive model
is misspecified, the percentage of time it fails to select important
variables can be higher than that of the fully nonparametric approach.
A dimension reduction step is included when nonparametric estimator
cannot be computed due to the curse of dimensionality. Finally, the
Boston housing dataset is analyzed using the proposed backward
elimination procedure and the selected variables are identified.
Abstract: A clustering is process to identify a homogeneous
groups of object called as cluster. Clustering is one interesting topic
on data mining. A group or class behaves similarly characteristics.
This paper discusses a robust clustering process for data images with
two reduction dimension approaches; i.e. the two dimensional
principal component analysis (2DPCA) and principal component
analysis (PCA). A standard approach to overcome this problem is
dimension reduction, which transforms a high-dimensional data into
a lower-dimensional space with limited loss of information. One of
the most common forms of dimensionality reduction is the principal
components analysis (PCA). The 2DPCA is often called a variant of
principal component (PCA), the image matrices were directly treated
as 2D matrices; they do not need to be transformed into a vector so
that the covariance matrix of image can be constructed directly using
the original image matrices. The decomposed classical covariance
matrix is very sensitive to outlying observations. The objective of
paper is to compare the performance of robust minimizing vector
variance (MVV) in the two dimensional projection PCA (2DPCA)
and the PCA for clustering on an arbitrary data image when outliers
are hiden in the data set. The simulation aspects of robustness and
the illustration of clustering images are discussed in the end of
paper
Abstract: In this paper, a new face recognition method based on
PCA (principal Component Analysis), LDA (Linear Discriminant
Analysis) and neural networks is proposed. This method consists of
four steps: i) Preprocessing, ii) Dimension reduction using PCA, iii)
feature extraction using LDA and iv) classification using neural
network. Combination of PCA and LDA is used for improving the
capability of LDA when a few samples of images are available and
neural classifier is used to reduce number misclassification caused by
not-linearly separable classes. The proposed method was tested on
Yale face database. Experimental results on this database
demonstrated the effectiveness of the proposed method for face
recognition with less misclassification in comparison with previous
methods.
Abstract: Analysis and visualization of microarraydata is veryassistantfor biologists and clinicians in the field of diagnosis and treatment of patients. It allows Clinicians to better understand the structure of microarray and facilitates understanding gene expression in cells. However, microarray dataset is a complex data set and has thousands of features and a very small number of observations. This very high dimensional data set often contains some noise, non-useful information and a small number of relevant features for disease or genotype. This paper proposes a non-linear dimensionality reduction algorithm Local Principal Component (LPC) which aims to maps high dimensional data to a lower dimensional space. The reduced data represents the most important variables underlying the original data. Experimental results and comparisons are presented to show the quality of the proposed algorithm. Moreover, experiments also show how this algorithm reduces high dimensional data whilst preserving the neighbourhoods of the points in the low dimensional space as in the high dimensional space.
Abstract: Cosmic showers, from their places of origin in space,
after entering earth generate secondary particles called Extensive Air
Shower (EAS). Detection and analysis of EAS and similar High
Energy Particle Showers involve a plethora of experimental setups
with certain constraints for which soft-computational tools like
Artificial Neural Network (ANN)s can be adopted. The optimality
of ANN classifiers can be enhanced further by the use of Multiple
Classifier System (MCS) and certain data - dimension reduction
techniques. This work describes the performance of certain data
dimension reduction techniques like Principal Component Analysis
(PCA), Independent Component Analysis (ICA) and Self Organizing
Map (SOM) approximators for application with an MCS formed
using Multi Layer Perceptron (MLP), Recurrent Neural Network
(RNN) and Probabilistic Neural Network (PNN). The data inputs are
obtained from an array of detectors placed in a circular arrangement
resembling a practical detector grid which have a higher dimension
and greater correlation among themselves. The PCA, ICA and SOM
blocks reduce the correlation and generate a form suitable for real
time practical applications for prediction of primary energy and
location of EAS from density values captured using detectors in a
circular grid.
Abstract: In this paper a new approach to face recognition is presented that achieves double dimension reduction making the system computationally efficient with better recognition results. In pattern recognition techniques, discriminative information of image increases with increase in resolution to a certain extent, consequently face recognition results improve with increase in face image resolution and levels off when arriving at a certain resolution level. In the proposed model of face recognition, first image decimation algorithm is applied on face image for dimension reduction to a certain resolution level which provides best recognition results. Due to better computational speed and feature extraction potential of Discrete Cosine Transform (DCT) it is applied on face image. A subset of coefficients of DCT from low to mid frequencies that represent the face adequately and provides best recognition results is retained. A trade of between decimation factor, number of DCT coefficients retained and recognition rate with minimum computation is obtained. Preprocessing of the image is carried out to increase its robustness against variations in poses and illumination level. This new model has been tested on different databases which include ORL database, Yale database and a color database. The proposed technique has performed much better compared to other techniques. The significance of the model is two fold: (1) dimension reduction up to an effective and suitable face image resolution (2) appropriate DCT coefficients are retained to achieve best recognition results with varying image poses, intensity and illumination level.
Abstract: The early diagnostic decision making in industrial processes is absolutely necessary to produce high quality final products. It helps to provide early warning for a special event in a process, and finding its assignable cause can be obtained. This work presents a hybrid diagnostic schmes for batch processes. Nonlinear representation of raw process data is combined with classification tree techniques. The nonlinear kernel-based dimension reduction is executed for nonlinear classification decision boundaries for fault classes. In order to enhance diagnosis performance for batch processes, filtering of the data is performed to get rid of the irrelevant information of the process data. For the diagnosis performance of several representation, filtering, and future observation estimation methods, four diagnostic schemes are evaluated. In this work, the performance of the presented diagnosis schemes is demonstrated using batch process data.
Abstract: The self-organizing map (SOM) model is a well-known neural network model with wide spread of applications. The main characteristics of SOM are two-fold, namely dimension reduction and topology preservation. Using SOM, a high-dimensional data space will be mapped to some low-dimensional space. Meanwhile, the topological relations among data will be preserved. With such characteristics, the SOM was usually applied on data clustering and visualization tasks. However, the SOM has main disadvantage of the need to know the number and structure of neurons prior to training, which are difficult to be determined. Several schemes have been proposed to tackle such deficiency. Examples are growing/expandable SOM, hierarchical SOM, and growing hierarchical SOM. These schemes could dynamically expand the map, even generate hierarchical maps, during training. Encouraging results were reported. Basically, these schemes adapt the size and structure of the map according to the distribution of training data. That is, they are data-driven or dataoriented SOM schemes. In this work, a topic-oriented SOM scheme which is suitable for document clustering and organization will be developed. The proposed SOM will automatically adapt the number as well as the structure of the map according to identified topics. Unlike other data-oriented SOMs, our approach expands the map and generates the hierarchies both according to the topics and their characteristics of the neurons. The preliminary experiments give promising result and demonstrate the plausibility of the method.
Abstract: In this paper a new approach to face recognition is
presented that achieves double dimension reduction, making the
system computationally efficient with better recognition results and
out perform common DCT technique of face recognition. In pattern
recognition techniques, discriminative information of image
increases with increase in resolution to a certain extent, consequently
face recognition results change with change in face image resolution
and provide optimal results when arriving at a certain resolution
level. In the proposed model of face recognition, initially image
decimation algorithm is applied on face image for dimension
reduction to a certain resolution level which provides best
recognition results. Due to increased computational speed and feature
extraction potential of Discrete Cosine Transform (DCT), it is
applied on face image. A subset of coefficients of DCT from low to
mid frequencies that represent the face adequately and provides best
recognition results is retained. A tradeoff between decimation factor,
number of DCT coefficients retained and recognition rate with
minimum computation is obtained. Preprocessing of the image is
carried out to increase its robustness against variations in poses and
illumination level. This new model has been tested on different
databases which include ORL , Yale and EME color database.