Abstract: Recently, the issue of machine condition monitoring
and fault diagnosis as a part of maintenance system became global
due to the potential advantages to be gained from reduced
maintenance costs, improved productivity and increased machine
availability. The aim of this work is to investigate the effectiveness
of a new fault diagnosis method based on power spectral density
(PSD) of vibration signals in combination with decision trees and
fuzzy inference system (FIS). To this end, a series of studies was
conducted on an external gear hydraulic pump. After a test under
normal condition, a number of different machine defect conditions
were introduced for three working levels of pump speed (1000, 1500,
and 2000 rpm), corresponding to (i) Journal-bearing with inner face
wear (BIFW), (ii) Gear with tooth face wear (GTFW), and (iii)
Journal-bearing with inner face wear plus Gear with tooth face wear
(B&GW). The features of PSD values of vibration signal were
extracted using descriptive statistical parameters. J48 algorithm is
used as a feature selection procedure to select pertinent features from
data set. The output of J48 algorithm was employed to produce the
crisp if-then rule and membership function sets. The structure of FIS
classifier was then defined based on the crisp sets. In order to
evaluate the proposed PSD-J48-FIS model, the data sets obtained
from vibration signals of the pump were used. Results showed that
the total classification accuracy for 1000, 1500, and 2000 rpm
conditions were 96.42%, 100%, and 96.42% respectively. The results
indicate that the combined PSD-J48-FIS model has the potential for
fault diagnosis of hydraulic pumps.
Abstract: This paper describes an automated event detection and location system for water distribution pipelines which is based upon low-cost sensor technology and signature analysis by an Artificial
Neural Network (ANN). The development of a low cost failure sensor which measures the opacity or cloudiness of the local water
flow has been designed, developed and validated, and an ANN based system is then described which uses time series data produced by
sensors to construct an empirical model for time series prediction and
classification of events. These two components have been installed,
tested and verified in an experimental site in a UK water distribution
system. Verification of the system has been achieved from a series of
simulated burst trials which have provided real data sets. It is concluded that the system has potential in water distribution network
management.
Abstract: In this paper we present a technique to speed up
ICA based on the idea of reducing the dimensionality of the data
set preserving the quality of the results. In particular we refer to
FastICA algorithm which uses the Kurtosis as statistical property
to be maximized. By performing a particular Johnson-Lindenstrauss
like projection of the data set, we find the minimum dimensionality
reduction rate ¤ü, defined as the ratio between the size k of the reduced
space and the original one d, which guarantees a narrow confidence
interval of such estimator with high confidence level. The derived
dimensionality reduction rate depends on a system control parameter
β easily computed a priori on the basis of the observations only.
Extensive simulations have been done on different sets of real world
signals. They show that actually the dimensionality reduction is very
high, it preserves the quality of the decomposition and impressively
speeds up FastICA. On the other hand, a set of signals, on which the
estimated reduction rate is greater than 1, exhibits bad decomposition
results if reduced, thus validating the reliability of the parameter β.
We are confident that our method will lead to a better approach to
real time applications.
Abstract: This study used positivist quantitative approach to examine the mathematical concepts acquisition of- KS4 (14-16) Special Education Needs (SENs) students within the school sector education in England. The research is based on a pilot study and the design is completely holistic in its approach with mixing methodologies. The study combines the qualitative and quantitative methods of approach in gathering formative data for the design process. Although, the approach could best be described as a mix method, fundamentally with a strong positivist paradigm, hence my earlier understanding of the differentiation of the students, student – teacher body and the various elements of indicators that is being measured which will require an attenuated description of individual research subjects. The design process involves four phases with five key stages which are; literature review and document analysis, the survey, interview, and observation; then finally the analysis of data set. The research identified the need for triangulation with Reid-s phases of data management providing scaffold for the study. The study clearly identified the ideological and philosophical aspects of educational research design for the study of mathematics by the special education needs (SENs) students in England using the virtual learning environment (VLE) platform.
Abstract: The Tropical Data Hub (TDH) is a virtual research environment that provides researchers with an e-research infrastructure to congregate significant tropical data sets for data reuse, integration, searching, and correlation. However, researchers often require data and metadata synthesis across disciplines for crossdomain analyses and knowledge discovery. A triplestore offers a semantic layer to achieve a more intelligent method of search to support the synthesis requirements by automating latent linkages in the data and metadata. Presently, the benchmarks to aid the decision of which triplestore is best suited for use in an application environment like the TDH are limited to performance. This paper describes a new evaluation tool developed to analyze both features and performance. The tool comprises a weighted decision matrix to evaluate the interoperability, functionality, performance, and support availability of a range of integrated and native triplestores to rank them according to requirements of the TDH.
Abstract: The density estimates considered in this paper comprise
a base density and an adjustment component consisting of a linear
combination of orthogonal polynomials. It is shown that, in the
context of density approximation, the coefficients of the linear combination
can be determined either from a moment-matching technique
or a weighted least-squares approach. A kernel representation of
the corresponding density estimates is obtained. Additionally, two
refinements of the Kronmal-Tarter stopping criterion are proposed
for determining the degree of the polynomial adjustment. By way of
illustration, the density estimation methodology advocated herein is
applied to two data sets.
Abstract: Instead of traditional (nominal) classification we investigate
the subject of ordinal classification or ranking. An enhanced
method based on an ensemble of Support Vector Machines (SVM-s)
is proposed. Each binary classifier is trained with specific weights
for each object in the training data set. Experiments on benchmark
datasets and synthetic data indicate that the performance of our
approach is comparable to state of the art kernel methods for
ordinal regression. The ensemble method, which is straightforward
to implement, provides a very good sensitivity-specificity trade-off
for the highest and lowest rank.
Abstract: Data mining incorporates a group of statistical
methods used to analyze a set of information, or a data set. It operates
with models and algorithms, which are powerful tools with the great
potential. They can help people to understand the patterns in certain
chunk of information so it is obvious that the data mining tools have
a wide area of applications. For example in the theoretical chemistry
data mining tools can be used to predict moleculeproperties or
improve computer-assisted drug design. Classification analysis is one
of the major data mining methodologies. The aim of thecontribution
is to create a classification model, which would be able to deal with a
huge data set with high accuracy. For this purpose logistic regression,
Bayesian logistic regression and random forest models were built
using R software. TheBayesian logistic regression in Latent GOLD
software was created as well. These classification methods belong to
supervised learning methods.
It was necessary to reduce data matrix dimension before construct
models and thus the factor analysis (FA) was used. Those models
were applied to predict the biological activity of molecules, potential
new drug candidates.
Abstract: Clustering is a very well known technique in data mining. One of the most widely used clustering techniques is the kmeans algorithm. Solutions obtained from this technique depend on the initialization of cluster centers and the final solution converges to local minima. In order to overcome K-means algorithm shortcomings, this paper proposes a hybrid evolutionary algorithm based on the combination of PSO, SA and K-means algorithms, called PSO-SA-K, which can find better cluster partition. The performance is evaluated through several benchmark data sets. The simulation results show that the proposed algorithm outperforms previous approaches, such as PSO, SA and K-means for partitional clustering problem.
Abstract: Software developed for a specific customer under contract
typically undergoes a period of testing by the customer before
acceptance. This is known as user acceptance testing and the process
can reveal both defects in the system and requests for changes to
the product. This paper uses nonhomogeneous Poisson processes to
model a real user acceptance data set from a recently developed
system. In particular a split Poisson process is shown to provide an
excellent fit to the data. The paper explains how this model can be
used to aid the allocation of resources through the accurate prediction
of occurrences both during the acceptance testing phase and before
this activity begins.
Abstract: Currently, web usage make a huge data from a lot of
user attention. In general, proxy server is a system to support web
usage from user and can manage system by using hit rates. This
research tries to improve hit rates in proxy system by applying data
mining technique. The data set are collected from proxy servers in the
university and are investigated relationship based on several features.
The model is used to predict the future access websites. Association
rule technique is applied to get the relation among Date, Time, Main
Group web, Sub Group web, and Domain name for created model.
The results showed that this technique can predict web content for the
next day, moreover the future accesses of websites increased from
38.15% to 85.57 %.
This model can predict web page access which tends to increase
the efficient of proxy servers as a result. In additional, the
performance of internet access will be improved and help to reduce
traffic in networks.
Abstract: Recurrent event data is a special type of multivariate
survival data. Dynamic and frailty models are one of the approaches
that dealt with this kind of data. A comparison between these two
models is studied using the empirical standard deviation of the
standardized martingale residual processes as a way of assessing the
fit of the two models based on the Aalen additive regression model.
Here we found both approaches took heterogeneity into account and
produce residual standard deviations close to each other both in the
simulation study and in the real data set.
Abstract: In recent years, a number of works proposing the
combination of multiple classifiers to produce a single
classification have been reported in remote sensing literature. The
resulting classifier, referred to as an ensemble classifier, is
generally found to be more accurate than any of the individual
classifiers making up the ensemble. As accuracy is the primary
concern, much of the research in the field of land cover
classification is focused on improving classification accuracy. This
study compares the performance of four ensemble approaches
(boosting, bagging, DECORATE and random subspace) with a
univariate decision tree as base classifier. Two training datasets,
one without ant noise and other with 20 percent noise was used to
judge the performance of different ensemble approaches. Results
with noise free data set suggest an improvement of about 4% in
classification accuracy with all ensemble approaches in
comparison to the results provided by univariate decision tree
classifier. Highest classification accuracy of 87.43% was achieved
by boosted decision tree. A comparison of results with noisy data
set suggests that bagging, DECORATE and random subspace
approaches works well with this data whereas the performance of
boosted decision tree degrades and a classification accuracy of
79.7% is achieved which is even lower than that is achieved (i.e.
80.02%) by using unboosted decision tree classifier.
Abstract: The approach of subset selection in polynomial
regression model building assumes that the chosen fixed full set of
predefined basis functions contains a subset that is sufficient to
describe the target relation sufficiently well. However, in most cases
the necessary set of basis functions is not known and needs to be
guessed – a potentially non-trivial (and long) trial and error process.
In our research we consider a potentially more efficient approach –
Adaptive Basis Function Construction (ABFC). It lets the model
building method itself construct the basis functions necessary for
creating a model of arbitrary complexity with adequate predictive
performance. However, there are two issues that to some extent
plague the methods of both the subset selection and the ABFC,
especially when working with relatively small data samples: the
selection bias and the selection instability. We try to correct these
issues by model post-evaluation using Cross-Validation and model
ensembling. To evaluate the proposed method, we empirically
compare it to ABFC methods without ensembling, to a widely used
method of subset selection, as well as to some other well-known
regression modeling methods, using publicly available data sets.
Abstract: Cancers could normally be marked by a number of
differentially expressed genes which show enormous potential as
biomarkers for a certain disease. Recent years, cancer classification
based on the investigation of gene expression profiles derived by
high-throughput microarrays has widely been used. The selection of
discriminative genes is, therefore, an essential preprocess step in
carcinogenesis studies. In this paper, we have proposed a novel gene
selector using information-theoretic measures for biological
discovery. This multivariate filter is a four-stage framework through
the analyses of feature relevance, feature interdependence, feature
redundancy-dependence and subset rankings, and having been
examined on the colon cancer data set. Our experimental result show
that the proposed method outperformed other information theorem
based filters in all aspect of classification errors and classification
performance.
Abstract: This article outlines conceptualization and
implementation of an intelligent system capable of extracting
knowledge from databases. Use of hybridized features of both the
Rough and Fuzzy Set theory render the developed system flexibility
in dealing with discreet as well as continuous datasets. A raw data set
provided to the system, is initially transformed in a computer legible
format followed by pruning of the data set. The refined data set is
then processed through various Rough Set operators which enable
discovery of parameter relationships and interdependencies. The
discovered knowledge is automatically transformed into a rule base
expressed in Fuzzy terms. Two exemplary cancer repository datasets
(for Breast and Lung Cancer) have been used to test and implement
the proposed framework.
Abstract: C-control chart assumes that process nonconformities follow a Poisson distribution. In actuality, however, this Poisson distribution does not always occur. A process control for semiconductor based on a Poisson distribution always underestimates the true average amount of nonconformities and the process variance. Quality is described more accurately if a compound Poisson process is used for process control at this time. A cumulative sum (CUSUM) control chart is much better than a C control chart when a small shift will be detected. This study calculates one-sided CUSUM ARLs using a Markov chain approach to construct a CUSUM control chart with an underlying Poisson-Gamma compound distribution for the failure mechanism. Moreover, an actual data set from a wafer plant is used to demonstrate the operation of the proposed model. The results show that a CUSUM control chart realizes significantly better performance than EWMA.
Abstract: The prediction of Software quality during development life cycle of software project helps the development organization to make efficient use of available resource to produce the product of highest quality. “Whether a module is faulty or not" approach can be used to predict quality of a software module. There are numbers of software quality prediction models described in the literature based upon genetic algorithms, artificial neural network and other data mining algorithms. One of the promising aspects for quality prediction is based on clustering techniques. Most quality prediction models that are based on clustering techniques make use of K-means, Mixture-of-Guassians, Self-Organizing Map, Neural Gas and fuzzy K-means algorithm for prediction. In all these techniques a predefined structure is required that is number of neurons or clusters should be known before we start clustering process. But in case of Growing Neural Gas there is no need of predetermining the quantity of neurons and the topology of the structure to be used and it starts with a minimal neurons structure that is incremented during training until it reaches a maximum number user defined limits for clusters. Hence, in this work we have used Growing Neural Gas as underlying cluster algorithm that produces the initial set of labeled cluster from training data set and thereafter this set of clusters is used to predict the quality of test data set of software modules. The best testing results shows 80% accuracy in evaluating the quality of software modules. Hence, the proposed technique can be used by programmers in evaluating the quality of modules during software development.
Abstract: In the last few years, three multivariate spectral
analysis techniques namely, Principal Component Analysis (PCA),
Independent Component Analysis (ICA) and Non-negative Matrix
Factorization (NMF) have emerged as effective tools for oscillation
detection and isolation. While the first method is used in determining
the number of oscillatory sources, the latter two methods
are used to identify source signatures by formulating the detection
problem as a source identification problem in the spectral domain.
In this paper, we present a critical drawback of the underlying linear
(mixing) model which strongly limits the ability of the associated
source separation methods to determine the number of sources
and/or identify the physical source signatures. It is shown that the
assumed mixing model is only valid if each unit of the process gives
equal weighting (all-pass filter) to all oscillatory components in its
inputs. This is in contrast to the fact that each unit, in general, acts
as a filter with non-uniform frequency response. Thus, the model
can only facilitate correct identification of a source with a single
frequency component, which is again unrealistic. To overcome
this deficiency, an iterative post-processing algorithm that correctly
identifies the physical source(s) is developed. An additional issue
with the existing methods is that they lack a procedure to pre-screen
non-oscillatory/noisy measurements which obscure the identification
of oscillatory sources. In this regard, a pre-screening procedure
is prescribed based on the notion of sparseness index to eliminate
the noisy and non-oscillatory measurements from the data set used
for analysis.