Abstract: In this study, a multivariate analysis of potato spectroscopic data was presented to detect the presence of brown rot disease or not. Near-Infrared (NIR) spectroscopy (1,350-2,500 nm) combined with multivariate analysis was used as a rapid, non-destructive technique for the detection of brown rot disease in potatoes. Spectral measurements were performed in 565 samples, which were chosen randomly at the infection place in the potato slice. In this study, 254 infected and 311 uninfected (brown rot-free) samples were analyzed using different advanced statistical analysis techniques. The discrimination performance of different multivariate analysis techniques, including classification, pre-processing, and dimension reduction, were compared. Applying a random forest algorithm classifier with different pre-processing techniques to raw spectra had the best performance as the total classification accuracy of 98.7% was achieved in discriminating infected potatoes from control.
Abstract: A major challenge in medical studies, especially those that are longitudinal, is the problem of missing measurements which hinders the effective application of many machine learning algorithms. Furthermore, recent Alzheimer's Disease studies have focused on the delineation of Early Mild Cognitive Impairment (EMCI) and Late Mild Cognitive Impairment (LMCI) from cognitively normal controls (CN) which is essential for developing effective and early treatment methods. To address the aforementioned challenges, this paper explores the potential of using the eXtreme Gradient Boosting (XGBoost) algorithm in handling missing values in multiclass classification. We seek a generalized classification scheme where all prodromal stages of the disease are considered simultaneously in the classification and decision-making processes. Given the large number of subjects (1631) included in this study and in the presence of almost 28% missing values, we investigated the performance of XGBoost on the classification of the four classes of AD, NC, EMCI, and LMCI. Using 10-fold cross validation technique, XGBoost is shown to outperform other state-of-the-art classification algorithms by 3% in terms of accuracy and F-score. Our model achieved an accuracy of 80.52%, a precision of 80.62% and recall of 80.51%, supporting the more natural and promising multiclass classification.
Abstract: Crop yield prediction is a paramount issue in
agriculture. The main idea of this paper is to find out efficient
way to predict the yield of corn based meteorological records.
The prediction models used in this paper can be classified into
model-driven approaches and data-driven approaches, according to
the different modeling methodologies. The model-driven approaches are based on crop mechanistic
modeling. They describe crop growth in interaction with their
environment as dynamical systems. But the calibration process of
the dynamic system comes up with much difficulty, because it
turns out to be a multidimensional non-convex optimization problem.
An original contribution of this paper is to propose a statistical
methodology, Multi-Scenarios Parameters Estimation (MSPE), for the
parametrization of potentially complex mechanistic models from a
new type of datasets (climatic data, final yield in many situations).
It is tested with CORNFLO, a crop model for maize growth. On the other hand, the data-driven approach for yield prediction
is free of the complex biophysical process. But it has some strict
requirements about the dataset.
A second contribution of the paper is the comparison of these
model-driven methods with classical data-driven methods. For this
purpose, we consider two classes of regression methods, methods
derived from linear regression (Ridge and Lasso Regression, Principal
Components Regression or Partial Least Squares Regression) and
machine learning methods (Random Forest, k-Nearest Neighbor,
Artificial Neural Network and SVM regression).
The dataset consists of 720 records of corn yield at county scale
provided by the United States Department of Agriculture (USDA) and
the associated climatic data. A 5-folds cross-validation process and
two accuracy metrics: root mean square error of prediction(RMSEP),
mean absolute error of prediction(MAEP) were used to evaluate the
crop prediction capacity.
The results show that among the data-driven approaches, Random
Forest is the most robust and generally achieves the best prediction
error (MAEP 4.27%). It also outperforms our model-driven approach
(MAEP 6.11%). However, the method to calibrate the mechanistic
model from dataset easy to access offers several side-perspectives.
The mechanistic model can potentially help to underline the stresses
suffered by the crop or to identify the biological parameters of interest
for breeding purposes. For this reason, an interesting perspective is
to combine these two types of approaches.
Abstract: As smartphones are equipped with various sensors,
there have been many studies focused on using these sensors to create
valuable applications. Human activity recognition is one such
application motivated by various welfare applications, such as the
support for the elderly, measurement of calorie consumption, lifestyle
and exercise patterns analyses, and so on. One of the challenges one
faces when using smartphone sensors for activity recognition is that
the number of sensors should be minimized to save battery power. In
this paper, we show that a fairly accurate classifier can be built that
can distinguish ten different activities by using only a single sensor
data, i.e., the smartphone accelerometer data. The approach that we
adopt to deal with this twelve-class problem uses various methods.
The features used for classifying these activities include not only the
magnitude of acceleration vector at each time point, but also the
maximum, the minimum, and the standard deviation of vector
magnitude within a time window. The experiments compared the
performance of four kinds of basic multi-class classifiers and the
performance of four kinds of ensemble learning methods based on
three kinds of basic multi-class classifiers. The results show that
while the method with the highest accuracy is ECOC based on
Random forest.
Abstract: Pulmonary Function Tests are important non-invasive
diagnostic tests to assess respiratory impairments and provides
quantifiable measures of lung function. Spirometry is the most
frequently used measure of lung function and plays an essential role
in the diagnosis and management of pulmonary diseases. However,
the test requires considerable patient effort and cooperation,
markedly related to the age of patients resulting in incomplete data
sets. This paper presents, a nonlinear model built using Multivariate
adaptive regression splines and Random forest regression model to
predict the missing spirometric features. Random forest based feature
selection is used to enhance both the generalization capability and the
model interpretability. In the present study, flow-volume data are
recorded for N= 198 subjects. The ranked order of feature importance
index calculated by the random forests model shows that the
spirometric features FVC, FEF25, PEF, FEF25-75, FEF50 and the
demographic parameter height are the important descriptors. A
comparison of performance assessment of both models prove that, the
prediction ability of MARS with the `top two ranked features namely
the FVC and FEF25 is higher, yielding a model fit of R2= 0.96 and
R2= 0.99 for normal and abnormal subjects. The Root Mean Square
Error analysis of the RF model and the MARS model also shows that
the latter is capable of predicting the missing values of FEV1 with a
notably lower error value of 0.0191 (normal subjects) and 0.0106
(abnormal subjects) with the aforementioned input features. It is
concluded that combining feature selection with a prediction model
provides a minimum subset of predominant features to train the
model, as well as yielding better prediction performance. This
analysis can assist clinicians with a intelligence support system in the
medical diagnosis and improvement of clinical care.