Abstract: The most important process of the water treatment plant process is coagulation, which uses alum and poly aluminum chloride (PACL). Therefore, determining the dosage of alum and PACL is the most important factor to be prescribed. This research applies an artificial neural network (ANN), which uses the Levenberg–Marquardt algorithm to create a mathematical model (Soft Jar Test) for chemical dose prediction, as used for coagulation, such as alum and PACL, with input data consisting of turbidity, pH, alkalinity, conductivity, and, oxygen consumption (OC) of the Bangkhen Water Treatment Plant (BKWTP), under the authority of the Metropolitan Waterworks Authority of Thailand. The data were collected from 1 January 2019 to 31 December 2019 in order to cover the changing seasons of Thailand. The input data of ANN are divided into three groups: training set, test set, and validation set. The coefficient of determination and the mean absolute errors of the alum model are 0.73, 3.18 and the PACL model are 0.59, 3.21, respectively.
Abstract: In sports, individuals and teams are typically interested
in final rankings. Final results, such as times or distances, dictate
these rankings, also known as places. Places can be further associated
with ordered random variables, commonly referred to as order
statistics. In this work, we introduce a simple, yet accurate order
statistical ordinal regression function that predicts relay race places
with changeover-times. We call this function the Fenton-Wilkinson
Order Statistics model. This model is built on the following educated
assumption: individual leg-times follow log-normal distributions.
Moreover, our key idea is to utilize Fenton-Wilkinson approximations
of changeover-times alongside an estimator for the total number
of teams as in the notorious German tank problem. This original
place regression function is sigmoidal and thus correctly predicts
the existence of a small number of elite teams that significantly
outperform the rest of the teams. Our model also describes how place
increases linearly with changeover-time at the inflection point of the
log-normal distribution function. With real-world data from Jukola
2019, a massive orienteering relay race, the model is shown to be
highly accurate even when the size of the training set is only 5%
of the whole data set. Numerical results also show that our model
exhibits smaller place prediction root-mean-square-errors than linear
regression, mord regression and Gaussian process regression.
Abstract: Ancient books are significant culture inheritors and their background textures convey the potential history information. However, multi-style texture recovery of ancient books has received little attention. Restricted by insufficient ancient textures and complex handling process, the generation of ancient textures confronts with new challenges. For instance, training without sufficient data usually brings about overfitting or mode collapse, so some of the outputs are prone to be fake. Recently, image generation and style transfer based on deep learning are widely applied in computer vision. Breakthroughs within the field make it possible to conduct research upon multi-style texture recovery of ancient books. Under the circumstances, we proposed a network of layout analysis and image fusion system. Firstly, we trained models by using Deep Convolution Generative against Networks (DCGAN) to synthesize multi-style ancient textures; then, we analyzed layouts based on the Position Rearrangement (PR) algorithm that we proposed to adjust the layout structure of foreground content; at last, we realized our goal by fusing rearranged foreground texts and generated background. In experiments, diversified samples such as ancient Yi, Jurchen, Seal were selected as our training sets. Then, the performances of different fine-turning models were gradually improved by adjusting DCGAN model in parameters as well as structures. In order to evaluate the results scientifically, cross entropy loss function and Fréchet Inception Distance (FID) are selected to be our assessment criteria. Eventually, we got model M8 with lowest FID score. Compared with DCGAN model proposed by Radford at el., the FID score of M8 improved by 19.26%, enhancing the quality of the synthetic images profoundly.
Abstract: In this work, a training algorithm for probabilistic neural networks (PNN) is presented. The algorithm addresses one of the major drawbacks of PNN, which is the size of the hidden layer in the network. By using a cross-validation training algorithm, the number of hidden neurons is shrunk to a smaller number consisting of the most representative samples of the training set. This is done without affecting the overall architecture of the network. Performance of the network is compared against performance of standard PNN for different databases from the UCI database repository. Results show an important gain in network size and performance.
Abstract: In this paper, we demonstrate how regression curves can be used to recognize 2D non-rigid handwritten shapes. Each shape is represented by a set of non-overlapping uniformly distributed landmarks. The underlying models utilize 2nd order of polynomials to model shapes within a training set. To estimate the regression models, we need to extract the required coefficients which describe the variations for a set of shape class. Hence, a least square method is used to estimate such modes. We then proceed by training these coefficients using the apparatus Expectation Maximization algorithm. Recognition is carried out by finding the least error landmarks displacement with respect to the model curves. Handwritten isolated Arabic characters are used to evaluate our approach.
Abstract: In this paper, we presented our innovative way of determining the driving context for a driving assistance system. We invoke the fusion of all parameters that describe the context of the environment, the vehicle and the driver to obtain the driving context. We created a training set that stores driving situation patterns and from which the system consults to determine the driving situation. A machine-learning algorithm predicts the driving situation. The driving situation is an input to the fission process that yields the action that must be implemented when the driver needs to be informed or assisted from the given the driving situation. The action may be directed towards the driver, the vehicle or both. This is an ongoing work whose goal is to offer an alternative driving assistance system for safe driving, green driving and comfortable driving. Here, ontologies are used for knowledge representation.
Abstract: Instance selection (IS) technique is used to reduce
the data size to improve the performance of data mining methods.
Recently, to process very large data set, several proposed methods
divide the training set into some disjoint subsets and apply IS
algorithms independently to each subset. In this paper, we analyze
the limitation of these methods and give our viewpoint about how to
divide and conquer in IS procedure. Then, based on fast condensed
nearest neighbor (FCNN) rule, we propose a large data sets instance
selection method with MapReduce framework. Besides ensuring the
prediction accuracy and reduction rate, it has two desirable properties:
First, it reduces the work load in the aggregation node; Second
and most important, it produces the same result with the sequential
version, which other parallel methods cannot achieve. We evaluate the
performance of FCNN-MR on one small data set and two large data
sets. The experimental results show that it is effective and practical.
Abstract: Recently, numerous documents including large
volumes of unstructured data and text have been created because of the
rapid increase in the use of social media and the Internet. Usually,
these documents are categorized for the convenience of users. Because
the accuracy of manual categorization is not guaranteed, and such
categorization requires a large amount of time and incurs huge costs.
Many studies on automatic categorization have been conducted to help
mitigate the limitations of manual categorization. Unfortunately, most
of these methods cannot be applied to categorize complex documents
with multiple topics because they work on the assumption that
individual documents can be categorized into single categories only.
Therefore, to overcome this limitation, some studies have attempted to
categorize each document into multiple categories. However, the
learning process employed in these studies involves training using a
multi-categorized document set. These methods therefore cannot be
applied to the multi-categorization of most documents unless
multi-categorized training sets using traditional multi-categorization
algorithms are provided. To overcome this limitation, in this study, we
review our novel methodology for extending the category of a
single-categorized document to multiple categorizes, and then
introduce a survey-based verification scenario for estimating the
accuracy of our automatic categorization methodology.
Abstract: The effects of hypertension are often lethal thus its
early detection and prevention is very important for everybody. In
this paper, a neural network (NN) model was developed and trained
based on a dataset of hypertension causative parameters in order to
forecast the likelihood of occurrence of hypertension in patients. Our
research goal was to analyze the potential of the presented NN to
predict, for a period of time, the risk of hypertension or the risk of
developing this disease for patients that are or not currently
hypertensive. The results of the analysis for a given patient can
support doctors in taking pro-active measures for averting the
occurrence of hypertension such as recommendations regarding the
patient behavior in order to lower his hypertension risk. Moreover,
the paper envisages a set of three example scenarios in order to
determine the age when the patient becomes hypertensive, i.e.
determine the threshold for hypertensive age, to analyze what
happens if the threshold hypertensive age is set to a certain age and
the weight of the patient if being varied, and, to set the ideal weight
for the patient and analyze what happens with the threshold of
hypertensive age.
Abstract: Near infrared (NIR) spectroscopy has always been of
great interest in the food and agriculture industries. The development
of prediction models has facilitated the estimation process in recent
years. In this study, 110 crude palm oil (CPO) samples were used to
build a free fatty acid (FFA) prediction model. 60% of the collected
data were used for training purposes and the remaining 40% used for
testing. The visible peaks on the NIR spectrum were at 1725 nm and
1760 nm, indicating the existence of the first overtone of C-H bands.
Principal component regression (PCR) was applied to the data in
order to build this mathematical prediction model. The optimal
number of principal components was 10. The results showed
R2=0.7147 for the training set and R2=0.6404 for the testing set.
Abstract: The acidity (citric acid) is the one of chemical content that can be refer to the internal quality and it’s a maturity index of tomato, The titratable acidity (%TA) can be predicted by a non-destructive method prediction by using the transmittance short wavelength (SW-NIR) spectroscopy in the wavelength range between 665-955 nm. The set of 167 tomato samples divided into groups of 117 tomatoes sample for training set and 50 tomatoes sample for test set were used to establish the calibration model to predict and measure %TA by partial least squares regression (PLSR) technique. The spectra were pretreated with MSC pretreatment and it gave the optimal result for calibration model as (R = 0.92, RMSEC = 0.03%) and this model obtained high accuracy result to use for %TA prediction in test set as (R = 0.81, RMSEP = 0.05%). From the result of prediction in test set shown that the transmittance SW-NIR spectroscopy technique can be used for a non-destructive method for %TA prediction of tomato.
Abstract: In this paper we report the quantitative structure activity relationship of novel bis-triazole derivatives for predicting the activity profile. The full model encompassed a dataset of 46 Bis- triazoles. Tripos Sybyl X 2.0 program was used to conduct CoMSIA QSAR modeling. The Partial Least-Squares (PLS) analysis method was used to conduct statistical analysis and to derive a QSAR model based on the field values of CoMSIA descriptor. The compounds were divided into test and training set. The compounds were evaluated by various CoMSIA parameters to predict the best QSAR model. An optimum numbers of components were first determined separately by cross-validation regression for CoMSIA model, which were then applied in the final analysis. A series of parameters were used for the study and the best fit model was obtained using donor, partition coefficient and steric parameters. The CoMSIA models demonstrated good statistical results with regression coefficient (r2) and the cross-validated coefficient (q2) of 0.575 and 0.830 respectively. The standard error for the predicted model was 0.16322. In the CoMSIA model, the steric descriptors make a marginally larger contribution than the electrostatic descriptors. The finding that the steric descriptor is the largest contributor for the CoMSIA QSAR models is consistent with the observation that more than half of the binding site area is occupied by steric regions.
Abstract: Red blood cells (RBCs) are among the most
commonly and intensively studied type of blood cells in cell biology.
Anemia is a lack of RBCs is characterized by its level compared to
the normal hemoglobin level. In this study, a system based image
processing methodology was developed to localize and extract RBCs
from microscopic images. Also, the machine learning approach is
adopted to classify the localized anemic RBCs images. Several
textural and geometrical features are calculated for each extracted
RBCs. The training set of features was analyzed using principal
component analysis (PCA). With the proposed method, RBCs were
isolated in 4.3secondsfrom an image containing 18 to 27 cells. The
reasons behind using PCA are its low computation complexity and
suitability to find the most discriminating features which can lead to
accurate classification decisions. Our classifier algorithm yielded
accuracy rates of 100%, 99.99%, and 96.50% for K-nearest neighbor
(K-NN) algorithm, support vector machine (SVM), and neural
network RBFNN, respectively. Classification was evaluated in highly
sensitivity, specificity, and kappa statistical parameters. In
conclusion, the classification results were obtained within short time
period, and the results became better when PCA was used.
Abstract: Short-Term Load Forecasting (STLF) plays an important role for the economic and secure operation of power systems. In this paper, Continuous Genetic Algorithm (CGA) is employed to evolve the optimum large neural networks structure and connecting weights for one-day ahead electric load forecasting problem. This study describes the process of developing three layer feed-forward large neural networks for load forecasting and then presents a heuristic search algorithm for performing an important task of this process, i.e. optimal networks structure design. The proposed method is applied to STLF of the local utility. Data are clustered due to the differences in their characteristics. Special days are extracted from the normal training sets and handled separately. In this way, a solution is provided for all load types, including working days and weekends and special days. We find good performance for the large neural networks. The proposed methodology gives lower percent errors all the time. Thus, it can be applied to automatically design an optimal load forecaster based on historical data.
Abstract: The goal of a network-based intrusion detection
system is to classify activities of network traffics into two major
categories: normal and attack (intrusive) activities. Nowadays, data
mining and machine learning plays an important role in many
sciences; including intrusion detection system (IDS) using both
supervised and unsupervised techniques. However, one of the
essential steps of data mining is feature selection that helps in
improving the efficiency, performance and prediction rate of
proposed approach. This paper applies unsupervised K-means
clustering algorithm with information gain (IG) for feature selection
and reduction to build a network intrusion detection system. For our
experimental analysis, we have used the new NSL-KDD dataset,
which is a modified dataset for KDDCup 1999 intrusion detection
benchmark dataset. With a split of 60.0% for the training set and the
remainder for the testing set, a 2 class classifications have been
implemented (Normal, Attack). Weka framework which is a java
based open source software consists of a collection of machine
learning algorithms for data mining tasks has been used in the testing
process. The experimental results show that the proposed approach is
very accurate with low false positive rate and high true positive rate
and it takes less learning time in comparison with using the full
features of the dataset with the same algorithm.
Abstract: The quality of short term load forecasting can improve the efficiency of planning and operation of electric utilities. Artificial Neural Networks (ANNs) are employed for nonlinear short term load forecasting owing to their powerful nonlinear mapping capabilities. At present, there is no systematic methodology for optimal design and training of an artificial neural network. One has often to resort to the trial and error approach. This paper describes the process of developing three layer feed-forward large neural networks for short-term load forecasting and then presents a heuristic search algorithm for performing an important task of this process, i.e. optimal networks structure design. Particle Swarm Optimization (PSO) is used to develop the optimum large neural network structure and connecting weights for one-day ahead electric load forecasting problem. PSO is a novel random optimization method based on swarm intelligence, which has more powerful ability of global optimization. Employing PSO algorithms on the design and training of ANNs allows the ANN architecture and parameters to be easily optimized. The proposed method is applied to STLF of the local utility. Data are clustered due to the differences in their characteristics. Special days are extracted from the normal training sets and handled separately. In this way, a solution is provided for all load types, including working days and weekends and special days. The experimental results show that the proposed method optimized by PSO can quicken the learning speed of the network and improve the forecasting precision compared with the conventional Back Propagation (BP) method. Moreover, it is not only simple to calculate, but also practical and effective. Also, it provides a greater degree of accuracy in many cases and gives lower percent errors all the time for STLF problem compared to BP method. Thus, it can be applied to automatically design an optimal load forecaster based on historical data.
Abstract: A new approach to promote the generalization ability
of neural networks is presented. It is based on the point of view of
fuzzy theory. This approach is implemented through shrinking or
magnifying the input vector, thereby reducing the difference between
training set and testing set. It is called “shrinking-magnifying
approach" (SMA). At the same time, a new algorithm; α-algorithm is
presented to find out the appropriate shrinking-magnifying-factor
(SMF) α and obtain better generalization ability of neural networks.
Quite a few simulation experiments serve to study the effect of SMA
and α-algorithm. The experiment results are discussed in detail, and
the function principle of SMA is analyzed in theory. The results of
experiments and analyses show that the new approach is not only
simpler and easier, but also is very effective to many neural networks
and many classification problems. In our experiments, the proportions
promoting the generalization ability of neural networks have even
reached 90%.
Abstract: Text Mining is around applying knowledge discovery
techniques to unstructured text is termed knowledge discovery in text
(KDT), or Text data mining or Text Mining. In decision tree
approach is most useful in classification problem. With this
technique, tree is constructed to model the classification process.
There are two basic steps in the technique: building the tree and
applying the tree to the database. This paper describes a proposed
C5.0 classifier that performs rulesets, cross validation and boosting
for original C5.0 in order to reduce the optimization of error ratio.
The feasibility and the benefits of the proposed approach are
demonstrated by means of medial data set like hypothyroid. It is
shown that, the performance of a classifier on the training cases from
which it was constructed gives a poor estimate by sampling or using a
separate test file, either way, the classifier is evaluated on cases that
were not used to build and evaluate the classifier are both are large. If
the cases in hypothyroid.data and hypothyroid.test were to be
shuffled and divided into a new 2772 case training set and a 1000
case test set, C5.0 might construct a different classifier with a lower
or higher error rate on the test cases. An important feature of see5 is
its ability to classifiers called rulesets. The ruleset has an error rate
0.5 % on the test cases. The standard errors of the means provide an
estimate of the variability of results. One way to get a more reliable
estimate of predictive is by f-fold –cross- validation. The error rate of
a classifier produced from all the cases is estimated as the ratio of the
total number of errors on the hold-out cases to the total number of
cases. The Boost option with x trials instructs See5 to construct up to
x classifiers in this manner. Trials over numerous datasets, large and
small, show that on average 10-classifier boosting reduces the error
rate for test cases by about 25%.
Abstract: In this paper back-propagation artificial neural
network (BPANN) is employed to predict the limiting drawing ratio
(LDR) of the deep drawing process. To prepare a training set for
BPANN, some finite element simulations were carried out. die and
punch radius, die arc radius, friction coefficient, thickness, yield
strength of sheet and strain hardening exponent were used as the
input data and the LDR as the specified output used in the training of
neural network. As a result of the specified parameters, the program
will be able to estimate the LDR for any new given condition.
Comparing FEM and BPANN results, an acceptable correlation was
found.
Abstract: Traditional principal components analysis (PCA)
techniques for face recognition are based on batch-mode training
using a pre-available image set. Real world applications require that
the training set be dynamic of evolving nature where within the
framework of continuous learning, new training images are
continuously added to the original set; this would trigger a costly
continuous re-computation of the eigen space representation via
repeating an entire batch-based training that includes the old and new
images. Incremental PCA methods allow adding new images and
updating the PCA representation. In this paper, two incremental
PCA approaches, CCIPCA and IPCA, are examined and compared.
Besides, different learning and testing strategies are proposed and
applied to the two algorithms. The results suggest that batch PCA is
inferior to both incremental approaches, and that all CCIPCAs are
practically equivalent.