Abstract: Machine Learning and Data Mining are the two important tools for extracting useful information and knowledge from large datasets. In machine learning, classification is a wildly used technique to predict qualitative variables and is generally preferred over regression from an operational point of view. Due to the enormous increase in air pollution in various countries especially China, Air Quality Classification has become one of the most important topics in air quality research and modelling. This study aims at introducing a hybrid classification model based on information theory and Support Vector Machine (SVM) using the air quality data of four cities in China namely Beijing, Guangzhou, Shanghai and Tianjin from Jan 1, 2014 to April 30, 2016. China's Ministry of Environmental Protection has classified the daily air quality into 6 levels namely Serious Pollution, Severe Pollution, Moderate Pollution, Light Pollution, Good and Excellent based on their respective Air Quality Index (AQI) values. Using the information theory, information gain (IG) is calculated and feature selection is done for both categorical features and continuous numeric features. Then SVM Machine Learning algorithm is implemented on the selected features with cross-validation. The final evaluation reveals that the IG and SVM hybrid model performs better than SVM (alone), Artificial Neural Network (ANN) and K-Nearest Neighbours (KNN) models in terms of accuracy as well as complexity.
Abstract: As DNA microarray data contain relatively small
sample size compared to the number of genes, high dimensional
models are often employed. In high dimensional models, the selection
of tuning parameter (or, penalty parameter) is often one of the crucial
parts of the modeling. Cross-validation is one of the most common
methods for the tuning parameter selection, which selects a parameter
value with the smallest cross-validated score. However, selecting a
single value as an ‘optimal’ value for the parameter can be very
unstable due to the sampling variation since the sample sizes of
microarray data are often small. Our approach is to choose multiple candidates of tuning parameter
first, then average the candidates with different weights depending
on their performance. The additional step of estimating the weights
and averaging the candidates rarely increase the computational cost,
while it can considerably improve the traditional cross-validation. We
show that the selected value from the suggested methods often lead to
stable parameter selection as well as improved detection of significant
genetic variables compared to the tradition cross-validation via real
data and simulated data sets.