Abstract: Today, a consistent segment of the world’s population lives in urban areas, and this proportion will vastly increase in the next decades. Therefore, understanding the key trends in urbanization, likely to unfold over the coming years, is crucial to the implementation of sustainable urban strategies. In parallel, the daily amount of digital data produced will be expanding at an exponential rate during the following years. The analysis of various types of data sets and its derived applications have incredible potential across different crucial sectors such as healthcare, housing, transportation, energy, and education. Nevertheless, in city development, architects and urban planners appear to rely mostly on traditional and analogical techniques of data collection. This paper investigates the prospective of the data science field, appearing to be a formidable resource to assist city managers in identifying strategies to enhance the social, economic, and environmental sustainability of our urban areas. The collection of different new layers of information would definitely enhance planners' capabilities to comprehend more in-depth urban phenomena such as gentrification, land use definition, mobility, or critical infrastructural issues. Specifically, the research results correlate economic, commercial, demographic, and housing data with the purpose of defining the youth economic discomfort index. The statistical composite index provides insights regarding the economic disadvantage of citizens aged between 18 years and 29 years, and results clearly display that central urban zones and more disadvantaged than peripheral ones. The experimental set up selected the city of Rome as the testing ground of the whole investigation. The methodology aims at applying statistical and spatial analysis to construct a composite index supporting informed data-driven decisions for urban planning.
Abstract: Predictive data analysis and modeling involving machine learning techniques become challenging in presence of too many explanatory variables or features. Presence of too many features in machine learning is known to not only cause algorithms to slow down, but they can also lead to decrease in model prediction accuracy. This study involves housing dataset with 79 quantitative and qualitative features that describe various aspects people consider while buying a new house. Boruta algorithm that supports feature selection using a wrapper approach build around random forest is used in this study. This feature selection process leads to 49 confirmed features which are then used for developing predictive random forest models. The study also explores five different data partitioning ratios and their impact on model accuracy are captured using coefficient of determination (r-square) and root mean square error (rsme).
Abstract: One of the biggest challenges in nonparametric
regression is the curse of dimensionality. Additive models are known
to overcome this problem by estimating only the individual additive
effects of each covariate. However, if the model is misspecified, the
accuracy of the estimator compared to the fully nonparametric one
is unknown. In this work the efficiency of completely nonparametric
regression estimators such as the Loess is compared to the estimators
that assume additivity in several situations, including additive and
non-additive regression scenarios. The comparison is done by
computing the oracle mean square error of the estimators with regards
to the true nonparametric regression function. Then, a backward
elimination selection procedure based on the Akaike Information
Criteria is proposed, which is computed from either the additive or
the nonparametric model. Simulations show that if the additive model
is misspecified, the percentage of time it fails to select important
variables can be higher than that of the fully nonparametric approach.
A dimension reduction step is included when nonparametric estimator
cannot be computed due to the curse of dimensionality. Finally, the
Boston housing dataset is analyzed using the proposed backward
elimination procedure and the selected variables are identified.