Abstract: The purpose of this study is to explore the characteristics of developing a machine learning application using synthetic data. The study is structured to develop the application for the purpose of deploying the computer vision model. The findings discuss the realities of attempting to develop a computer vision model for practical purpose, and detail the processes, tools and techniques that were used to meet accuracy requirements. The research reveals that synthetic data represent another variable that can be adjusted to improve the performance of a computer vision model. Further, a suite of tools and tuning recommendations are provided.
Abstract: With the widespread adoption of the Internet-connected
devices, and with the prevalence of the Internet of Things (IoT)
applications, there is an increased interest in machine learning
techniques that can provide useful and interesting services in the
smart home domain. The areas that machine learning techniques
can help advance are varied and ever-evolving. Classifying smart
home inhabitants’ Activities of Daily Living (ADLs), is one
prominent example. The ability of machine learning technique to find
meaningful spatio-temporal relations of high-dimensional data is an
important requirement as well. This paper presents a comparative
evaluation of state-of-the-art machine learning techniques to classify
ADLs in the smart home domain. Forty-two synthetic datasets and
two real-world datasets with multiple inhabitants are used to evaluate
and compare the performance of the identified machine learning
techniques. Our results show significant performance differences
between the evaluated techniques. Such as AdaBoost, Cortical
Learning Algorithm (CLA), Decision Trees, Hidden Markov Model
(HMM), Multi-layer Perceptron (MLP), Structured Perceptron and
Support Vector Machines (SVM). Overall, neural network based
techniques have shown superiority over the other tested techniques.
Abstract: Spatial, Temporal, and Spectral Resolution (STSR) are three key characteristics of Earth observation satellite sensors; however, any single satellite sensor cannot provide Earth observations with high STSR simultaneously because of the hardware technology limitations of satellite sensors. On the other hand, a conflicting circumstance is that the demand for high STSR has been growing with the remote sensing application development. Although image fusion technology provides a feasible means to overcome the limitations of the current Earth observation data, the current fusion technologies cannot enhance all STSR simultaneously and provide high enough resolution improvement level. This study proposes a Hybrid Spatial-Temporal-Spectral image Fusion Model (HSTSFM) to generate synthetic satellite data with high STSR simultaneously, which blends the high spatial resolution from the panchromatic image of Landsat-8 Operational Land Imager (OLI), the high temporal resolution from the multi-spectral image of Moderate Resolution Imaging Spectroradiometer (MODIS), and the high spectral resolution from the hyper-spectral image of Hyperion to produce high STSR images. The proposed HSTSFM contains three fusion modules: (1) spatial-spectral image fusion; (2) spatial-temporal image fusion; (3) temporal-spectral image fusion. A set of test data with both phenological and land cover type changes in Beijing suburb area, China is adopted to demonstrate the performance of the proposed method. The experimental results indicate that HSTSFM can produce fused image that has good spatial and spectral fidelity to the reference image, which means it has the potential to generate synthetic data to support the studies that require high STSR satellite imagery.
Abstract: Traditionally in sensor networks and recently in the
Internet of Things, numerous heterogeneous sensors are deployed
in distributed manner to monitor a phenomenon that often can be
model by an underlying stochastic process. The big time-series
data collected by the sensors must be analyzed to detect change
in the stochastic process as quickly as possible with tolerable
false alarm rate. However, sensors may have different accuracy
and sensitivity range, and they decay along time. As a result,
the big time-series data collected by the sensors will contain
uncertainties and sometimes they are conflicting. In this study, we
present a framework to take advantage of Evidence Theory (a.k.a.
Dempster-Shafer and Dezert-Smarandache Theories) capabilities of
representing and managing uncertainty and conflict to fast change
detection and effectively deal with complementary hypotheses.
Specifically, Kullback-Leibler divergence is used as the similarity
metric to calculate the distances between the estimated current
distribution with the pre- and post-change distributions. Then mass
functions are calculated and related combination rules are applied to
combine the mass values among all sensors. Furthermore, we applied
the method to estimate the minimum number of sensors needed to
combine, so computational efficiency could be improved. Cumulative
sum test is then applied on the ratio of pignistic probability to detect
and declare the change for decision making purpose. Simulation
results using both synthetic data and real data from experimental
setup demonstrate the effectiveness of the presented schemes.
Abstract: This study extends the use of the Drainage Area Regionalization (DAR) method in generating synthetic data and calibrating PyTOPKAPI stream yield for an ungauged basin at a daily time scale. The generation of runoff in determining a river yield has been subjected to various topographic and spatial meteorological variables, which integers form the Catchment Characteristics Model (CCM). Many of the conventional CCM models adapted in Africa have been challenged with a paucity of adequate, relevance and accurate data to parameterize and validate the potential. The purpose of generating synthetic flow is to test a hydrological model, which will not suffer from the impact of very low flows or very high flows, thus allowing to check whether the model is structurally sound enough or not. The employed physically-based, watershed-scale hydrologic model (PyTOPKAPI) was parameterized with GIS-pre-processing parameters and remote sensing hydro-meteorological variables. The validation with mean annual runoff ratio proposes a decent graphical understanding between observed and the simulated discharge. The Nash-Sutcliffe efficiency and coefficient of determination (R²) values of 0.704 and 0.739 proves strong model efficiency. Given the current climate variability impact, water planner can now assert a tool for flow quantification and sustainable planning purposes.
Abstract: For recognizing coins, the graved release date is important information to identify precisely its monetary type. However, reading characters in coins meets much more obstacles than traditional character recognition tasks in the other fields, such as reading scanned documents or license plates. To address this challenging issue in a numismatic context, we propose a training-free approach dedicated to detection and recognition of the release date of the coin. In the first step, the date zone is detected by comparing histogram features; in the second step, a topology-based algorithm is introduced to recognize coin numbers with various font types represented by binary gradient map. Our method obtained a recognition rate of 92% on synthetic data and of 44% on real noised data.
Abstract: This paper describes the use of artificial neural
networks (ANN) for predicting non-linear layer moduli of flexible
airfield pavements subjected to new generation aircraft (NGA)
loading, based on the deflection profiles obtained from Heavy
Weight Deflectometer (HWD) test data. The HWD test is one of the
most widely used tests for routinely assessing the structural integrity
of airport pavements in a non-destructive manner. The elastic moduli
of the individual pavement layers backcalculated from the HWD
deflection profiles are effective indicators of layer condition and are
used for estimating the pavement remaining life. HWD tests were
periodically conducted at the Federal Aviation Administration-s
(FAA-s) National Airport Pavement Test Facility (NAPTF) to
monitor the effect of Boeing 777 (B777) and Beoing 747 (B747) test
gear trafficking on the structural condition of flexible pavement
sections. In this study, a multi-layer, feed-forward network which
uses an error-backpropagation algorithm was trained to approximate
the HWD backcalculation function. The synthetic database generated
using an advanced non-linear pavement finite-element program was
used to train the ANN to overcome the limitations associated with
conventional pavement moduli backcalculation. The changes in
ANN-based backcalculated pavement moduli with trafficking were
used to compare the relative severity effects of the aircraft landing
gears on the NAPTF test pavements.
Abstract: Bloom filter is a probabilistic and memory efficient
data structure designed to answer rapidly whether an element is
present in a set. It tells that the element is definitely not in the set but
its presence is with certain probability. The trade-off to use Bloom
filter is a certain configurable risk of false positives. The odds of a
false positive can be made very low if the number of hash function is
sufficiently large. For spam detection, weight is attached to each set
of elements. The spam weight for a word is a measure used to rate the
e-mail. Each word is assigned to a Bloom filter based on its weight.
The proposed work introduces an enhanced concept in Bloom filter
called Bin Bloom Filter (BBF). The performance of BBF over
conventional Bloom filter is evaluated under various optimization
techniques. Real time data set and synthetic data sets are used for
experimental analysis and the results are demonstrated for bin sizes 4,
5, 6 and 7. Finally analyzing the results, it is found that the BBF
which uses heuristic techniques performs better than the traditional
Bloom filter in spam detection.
Abstract: In this paper we are interested in classification problems
with a performance constraint on error probability. In such
problems if the constraint cannot be satisfied, then a rejection option
is introduced. For binary labelled classification, a number of SVM
based methods with rejection option have been proposed over the
past few years. All of these methods use two thresholds on the SVM
output. However, in previous works, we have shown on synthetic data
that using thresholds on the output of the optimal SVM may lead to
poor results for classification tasks with performance constraint. In
this paper a new method for supervised classification with rejection
option is proposed. It consists in two different classifiers jointly
optimized to minimize the rejection probability subject to a given
constraint on error rate. This method uses a new kernel based linear
learning machine that we have recently presented. This learning
machine is characterized by its simplicity and high training speed
which makes the simultaneous optimization of the two classifiers
computationally reasonable. The proposed classification method with
rejection option is compared to a SVM based rejection method
proposed in recent literature. Experiments show the superiority of
the proposed method.
Abstract: Camera calibration plays an important role in the domain of the analysis of sports video. Considering soccer video, in most cases, the cross-points can be used for calibration at the center of the soccer field are not sufficient, so this paper introduces a new automatic camera calibration algorithm focus on solving this problem by using the properties of images of the center circle, halfway line and a touch line. After the theoretical analysis, a practicable automatic algorithm is proposed. Very little information used though, results of experiments with both synthetic data and real data show that the algorithm is applicable.
Abstract: Most of the biclustering/projected clustering algorithms are based either on the Euclidean distance or correlation coefficient which capture only linear relationships. However, in many applications, like gene expression data and word-document data, non linear relationships may exist between the objects. Mutual Information between two variables provides a more general criterion to investigate dependencies amongst variables. In this paper, we improve upon our previous algorithm that uses mutual information for biclustering in terms of computation time and also the type of clusters identified. The algorithm is able to find biclusters with mixed relationships and is faster than the previous one. To the best of our knowledge, none of the other existing algorithms for biclustering have used mutual information as a similarity measure. We present the experimental results on synthetic data as well as on the yeast expression data. Biclusters on the yeast data were found to be biologically and statistically significant using GO Tool Box and FuncAssociate.
Abstract: This paper presents a supervised clustering algorithm,
namely Grid-Based Supervised Clustering (GBSC), which is able to
identify clusters of any shapes and sizes without presuming any
canonical form for data distribution. The GBSC needs no prespecified
number of clusters, is insensitive to the order of the input
data objects, and is capable of handling outliers. Built on the
combination of grid-based clustering and density-based clustering,
under the assistance of the downward closure property of density
used in bottom-up subspace clustering, the GBSC can notably reduce
its search space to avoid the memory confinement situation during its
execution. On two-dimension synthetic datasets, the GBSC can
identify clusters with different shapes and sizes correctly. The GBSC
also outperforms other five supervised clustering algorithms when
the experiments are performed on some UCI datasets.
Abstract: Using Dynamic Bayesian Networks (DBN) to model genetic regulatory networks from gene expression data is one of the major paradigms for inferring the interactions among genes. Averaging a collection of models for predicting network is desired, rather than relying on a single high scoring model. In this paper, two kinds of model searching approaches are compared, which are Greedy hill-climbing Search with Restarts (GSR) and Markov Chain Monte Carlo (MCMC) methods. The GSR is preferred in many papers, but there is no such comparison study about which one is better for DBN models. Different types of experiments have been carried out to try to give a benchmark test to these approaches. Our experimental results demonstrated that on average the MCMC methods outperform the GSR in accuracy of predicted network, and having the comparable performance in time efficiency. By proposing the different variations of MCMC and employing simulated annealing strategy, the MCMC methods become more efficient and stable. Apart from comparisons between these approaches, another objective of this study is to investigate the feasibility of using DBN modeling approaches for inferring gene networks from few snapshots of high dimensional gene profiles. Through synthetic data experiments as well as systematic data experiments, the experimental results revealed how the performances of these approaches can be influenced as the target gene network varies in the network size, data size, as well as system complexity.
Abstract: Instead of traditional (nominal) classification we investigate
the subject of ordinal classification or ranking. An enhanced
method based on an ensemble of Support Vector Machines (SVM-s)
is proposed. Each binary classifier is trained with specific weights
for each object in the training data set. Experiments on benchmark
datasets and synthetic data indicate that the performance of our
approach is comparable to state of the art kernel methods for
ordinal regression. The ensemble method, which is straightforward
to implement, provides a very good sensitivity-specificity trade-off
for the highest and lowest rank.