Abstract: The problems arising from unbalanced data sets
generally appear in real world applications. Due to unequal class
distribution, many researchers have found that the performance of
existing classifiers tends to be biased towards the majority class. The
k-nearest neighbors’ nonparametric discriminant analysis is a method
that was proposed for classifying unbalanced classes with good
performance. In this study, the methods of discriminant analysis are
of interest in investigating misclassification error rates for classimbalanced
data of three diabetes risk groups. The purpose of this
study was to compare the classification performance between
parametric discriminant analysis and nonparametric discriminant
analysis in a three-class classification of class-imbalanced data of
diabetes risk groups. Data from a project maintaining healthy
conditions for 599 employees of a government hospital in Bangkok
were obtained for the classification problem. The employees were
divided into three diabetes risk groups: non-risk (90%), risk (5%),
and diabetic (5%). The original data including the variables of
diabetes risk group, age, gender, blood glucose, and BMI were
analyzed and bootstrapped for 50 and 100 samples, 599 observations
per sample, for additional estimation of the misclassification error
rate. Each data set was explored for the departure of multivariate
normality and the equality of covariance matrices of the three risk
groups. Both the original data and the bootstrap samples showed nonnormality
and unequal covariance matrices. The parametric linear
discriminant function, quadratic discriminant function, and the
nonparametric k-nearest neighbors’ discriminant function were
performed over 50 and 100 bootstrap samples and applied to the
original data. Searching the optimal classification rule, the choices of
prior probabilities were set up for both equal proportions (0.33: 0.33:
0.33) and unequal proportions of (0.90:0.05:0.05), (0.80: 0.10: 0.10)
and (0.70, 0.15, 0.15). The results from 50 and 100 bootstrap samples
indicated that the k-nearest neighbors approach when k=3 or k=4 and
the defined prior probabilities of non-risk: risk: diabetic as 0.90:
0.05:0.05 or 0.80:0.10:0.10 gave the smallest error rate of
misclassification. The k-nearest neighbors approach would be
suggested for classifying a three-class-imbalanced data of diabetes
risk groups.
Abstract: The two common approaches to Structural Equation Modeling (SEM) are the Covariance-Based SEM (CB-SEM) and Partial Least Squares SEM (PLS-SEM). There is much debate on the performance of CB-SEM and PLS-SEM for small sample size and when distributions are nonnormal. This study evaluates the performance of CB-SEM and PLS-SEM under normality and nonnormality conditions via a simulation. Monte Carlo Simulation in R programming language was employed to generate data based on the theoretical model with one endogenous and four exogenous variables. Each latent variable has three indicators. For normal distributions, CB-SEM estimates were found to be inaccurate for small sample size while PLS-SEM could produce the path estimates. Meanwhile, for a larger sample size, CB-SEM estimates have lower variability compared to PLS-SEM. Under nonnormality, CB-SEM path estimates were inaccurate for small sample size. However, CB-SEM estimates are more accurate than those of PLS-SEM for sample size of 50 and above. The PLS-SEM estimates are not accurate unless sample size is very large.