Abstract: The problems arising from unbalanced data sets
generally appear in real world applications. Due to unequal class
distribution, many researchers have found that the performance of
existing classifiers tends to be biased towards the majority class. The
k-nearest neighbors’ nonparametric discriminant analysis is a method
that was proposed for classifying unbalanced classes with good
performance. In this study, the methods of discriminant analysis are
of interest in investigating misclassification error rates for classimbalanced
data of three diabetes risk groups. The purpose of this
study was to compare the classification performance between
parametric discriminant analysis and nonparametric discriminant
analysis in a three-class classification of class-imbalanced data of
diabetes risk groups. Data from a project maintaining healthy
conditions for 599 employees of a government hospital in Bangkok
were obtained for the classification problem. The employees were
divided into three diabetes risk groups: non-risk (90%), risk (5%),
and diabetic (5%). The original data including the variables of
diabetes risk group, age, gender, blood glucose, and BMI were
analyzed and bootstrapped for 50 and 100 samples, 599 observations
per sample, for additional estimation of the misclassification error
rate. Each data set was explored for the departure of multivariate
normality and the equality of covariance matrices of the three risk
groups. Both the original data and the bootstrap samples showed nonnormality
and unequal covariance matrices. The parametric linear
discriminant function, quadratic discriminant function, and the
nonparametric k-nearest neighbors’ discriminant function were
performed over 50 and 100 bootstrap samples and applied to the
original data. Searching the optimal classification rule, the choices of
prior probabilities were set up for both equal proportions (0.33: 0.33:
0.33) and unequal proportions of (0.90:0.05:0.05), (0.80: 0.10: 0.10)
and (0.70, 0.15, 0.15). The results from 50 and 100 bootstrap samples
indicated that the k-nearest neighbors approach when k=3 or k=4 and
the defined prior probabilities of non-risk: risk: diabetic as 0.90:
0.05:0.05 or 0.80:0.10:0.10 gave the smallest error rate of
misclassification. The k-nearest neighbors approach would be
suggested for classifying a three-class-imbalanced data of diabetes
risk groups.
Abstract: This paper presents the methodology from machine
learning approaches for short-term rain forecasting system. Decision
Tree, Artificial Neural Network (ANN), and Support Vector Machine
(SVM) were applied to develop classification and prediction models
for rainfall forecasts. The goals of this presentation are to
demonstrate (1) how feature selection can be used to identify the
relationships between rainfall occurrences and other weather
conditions and (2) what models can be developed and deployed for
predicting the accurate rainfall estimates to support the decisions to
launch the cloud seeding operations in the northeastern part of
Thailand. Datasets collected during 2004-2006 from the
Chalermprakiat Royal Rain Making Research Center at Hua Hin,
Prachuap Khiri khan, the Chalermprakiat Royal Rain Making
Research Center at Pimai, Nakhon Ratchasima and Thai
Meteorological Department (TMD). A total of 179 records with 57
features was merged and matched by unique date. There are three
main parts in this work. Firstly, a decision tree induction algorithm
(C4.5) was used to classify the rain status into either rain or no-rain.
The overall accuracy of classification tree achieves 94.41% with the
five-fold cross validation. The C4.5 algorithm was also used to
classify the rain amount into three classes as no-rain (0-0.1 mm.),
few-rain (0.1- 10 mm.), and moderate-rain (>10 mm.) and the overall
accuracy of classification tree achieves 62.57%. Secondly, an ANN
was applied to predict the rainfall amount and the root mean square
error (RMSE) were used to measure the training and testing errors of
the ANN. It is found that the ANN yields a lower RMSE at 0.171 for
daily rainfall estimates, when compared to next-day and next-2-day
estimation. Thirdly, the ANN and SVM techniques were also used to
classify the rain amount into three classes as no-rain, few-rain, and
moderate-rain as above. The results achieved in 68.15% and 69.10%
of overall accuracy of same-day prediction for the ANN and SVM
models, respectively. The obtained results illustrated the comparison
of the predictive power of different methods for rainfall estimation.