Using Statistical Significance and Prediction to Test Long/Short Term Public Services and Patients Cohorts: A Case Study in Scotland

Health and Social care (HSc) services planning and scheduling are facing unprecedented challenges, due to the pandemic pressure and also suffer from unplanned spending that is negatively impacted by the global financial crisis. Data-driven approaches can help to improve policies, plan and design services provision schedules using algorithms that assist healthcare managers to face unexpected demands using fewer resources. The paper discusses services packing using statistical significance tests and machine learning (ML) to evaluate demands similarity and coupling. This is achieved by predicting the range of the demand (class) using ML methods such as Classification and Regression Trees (CART), Random Forests (RF), and Logistic Regression (LGR). The significance tests Chi-Squared and Student’s test are used on data over a 39 years span for which data exist for services delivered in Scotland. The demands are associated using probabilities and are parts of statistical hypotheses. These hypotheses, as their NULL part, assume that the target demand is statistically dependent on other services’ demands. This linking is checked using the data. In addition, ML methods are used to linearly predict the above target demands from the statistically found associations and extend the linear dependence of the target’s demand to independent demands forming, thus, groups of services. Statistical tests confirmed ML coupling and made the prediction statistically meaningful and proved that a target service can be matched reliably to other services while ML showed that such marked relationships can also be linear ones. Zero padding was used for missing years records and illustrated better such relationships both for limited years and for the entire span offering long-term data visualizations while limited years periods explained how well patients numbers can be related in short periods of time or that they can change over time as opposed to behaviours across more years. The prediction performance of the associations were measured using metrics such as Receiver Operating Characteristic (ROC), Area Under Curve (AUC) and Accuracy (ACC) as well as the statistical tests Chi-Squared and Student. Co-plots and comparison tables for the RF, CART, and LGR methods as well as the p-value from tests and Information Exchange (IE/MIE) measures are provided showing the relative performance of ML methods and of the statistical tests as well as the behaviour using different learning ratios. The impact of k-neighbours classification (k-NN), Cross-Correlation (CC) and C-Means (CM) first groupings was also studied over limited years and for the entire span. It was found that CART was generally behind RF and LGR but in some interesting cases, LGR reached an AUC = 0 falling below CART, while the ACC was as high as 0.912 showing that ML methods can be confused by zero-padding or by data’s irregularities or by the outliers. On average, 3 linear predictors were sufficient, LGR was found competing well RF and CART followed with the same performance at higher learning ratios. Services were packed only when a significance level (p-value) of their association coefficient was more than 0.05. Social factors relationships were observed between home care services and treatment of old people, low birth weights, alcoholism, drug abuse, and emergency admissions. The work found  that different HSc services can be well packed as plans of limited duration, across various services sectors, learning configurations, as confirmed by using statistical hypotheses.

Index t-SNE: Tracking Dynamics of High-Dimensional Datasets with Coherent Embeddings

t-SNE is an embedding method that the data science community has widely used. It helps two main tasks: to display results by coloring items according to the item class or feature value; and for forensic, giving a first overview of the dataset distribution. Two interesting characteristics of t-SNE are the structure preservation property and the answer to the crowding problem, where all neighbors in high dimensional space cannot be represented correctly in low dimensional space. t-SNE preserves the local neighborhood, and similar items are nicely spaced by adjusting to the local density. These two characteristics produce a meaningful representation, where the cluster area is proportional to its size in number, and relationships between clusters are materialized by closeness on the embedding. This algorithm is non-parametric. The transformation from a high to low dimensional space is described but not learned. Two initializations of the algorithm would lead to two different embedding. In a forensic approach, analysts would like to compare two or more datasets using their embedding. A naive approach would be to embed all datasets together. However, this process is costly as the complexity of t-SNE is quadratic, and would be infeasible for too many datasets. Another approach would be to learn a parametric model over an embedding built with a subset of data. While this approach is highly scalable, points could be mapped at the same exact position, making them indistinguishable. This type of model would be unable to adapt to new outliers nor concept drift. This paper presents a methodology to reuse an embedding to create a new one, where cluster positions are preserved. The optimization process minimizes two costs, one relative to the embedding shape and the second relative to the support embedding’ match. The embedding with the support process can be repeated more than once, with the newly obtained embedding. The successive embedding can be used to study the impact of one variable over the dataset distribution or monitor changes over time. This method has the same complexity as t-SNE per embedding, and memory requirements are only doubled. For a dataset of n elements sorted and split into k subsets, the total embedding complexity would be reduced from O(n2) to O(n2/k), and the memory requirement from n2 to 2(n/k)2 which enables computation on recent laptops. The method showed promising results on a real-world dataset, allowing to observe the birth, evolution and death of clusters. The proposed approach facilitates identifying significant trends and changes, which empowers the monitoring high dimensional datasets’ dynamics.

Thresholding Approach for Automatic Detection of Pseudomonas aeruginosa Biofilms from Fluorescence in situ Hybridization Images

Pseudomonas aeruginosa is an opportunistic pathogen that forms surface-associated microbial communities (biofilms) on artificial implant devices and on human tissue. Biofilm infections are difficult to treat with antibiotics, in part, because the bacteria in biofilms are physiologically heterogeneous. One measure of biological heterogeneity in a population of cells is to quantify the cellular concentrations of ribosomes, which can be probed with fluorescently labeled nucleic acids. The fluorescent signal intensity following fluorescence in situ hybridization (FISH) analysis correlates to the cellular level of ribosomes. The goals here are to provide computationally and statistically robust approaches to automatically quantify cellular heterogeneity in biofilms from a large library of epifluorescent microscopy FISH images. In this work, the initial steps were developed toward these goals by developing an automated biofilm detection approach for use with FISH images. The approach allows rapid identification of biofilm regions from FISH images that are counterstained with fluorescent dyes. This methodology provides advances over other computational methods, allowing subtraction of spurious signals and non-biological fluorescent substrata. This method will be a robust and user-friendly approach which will enable users to semi-automatically detect biofilm boundaries and extract intensity values from fluorescent images for quantitative analysis of biofilm heterogeneity.

Social Semantic Web-Based Analytics Approach to Support Lifelong Learning

The purpose of this paper is to describe how learning analytics approaches based on social semantic web techniques can be applied to enhance the lifelong learning experiences in a connectivist perspective. For this reason, a prototype of a system called SoLearn (Social Learning Environment) that supports this approach. We observed and studied literature related to lifelong learning systems, social semantic web and ontologies, connectivism theory, learning analytics approaches and reviewed implemented systems based on these fields to extract and draw conclusions about necessary features for enhancing the lifelong learning process. The semantic analytics of learning can be used for viewing, studying and analysing the massive data generated by learners, which helps them to understand through recommendations, charts and figures their learning and behaviour, and to detect where they have weaknesses or limitations. This paper emphasises that implementing a learning analytics approach based on social semantic web representations can enhance the learning process. From one hand, the analysis process leverages the meaning expressed by semantics presented in the ontology (relationships between concepts). From the other hand, the analysis process exploits the discovery of new knowledge by means of inferring mechanism of the semantic web.

Localization of Geospatial Events and Hoax Prediction in the UFO Database

Unidentified Flying Objects (UFOs) have been an interesting topic for most enthusiasts and hence people all over the United States report such findings online at the National UFO Report Center (NUFORC). Some of these reports are a hoax and among those that seem legitimate, our task is not to establish that these events confirm that they indeed are events related to flying objects from aliens in outer space. Rather, we intend to identify if the report was a hoax as was identified by the UFO database team with their existing curation criterion. However, the database provides a wealth of information that can be exploited to provide various analyses and insights such as social reporting, identifying real-time spatial events and much more. We perform analysis to localize these time-series geospatial events and correlate with known real-time events. This paper does not confirm any legitimacy of alien activity, but rather attempts to gather information from likely legitimate reports of UFOs by studying the online reports. These events happen in geospatial clusters and also are time-based. We look at cluster density and data visualization to search the space of various cluster realizations to decide best probable clusters that provide us information about the proximity of such activity. A random forest classifier is also presented that is used to identify true events and hoax events, using the best possible features available such as region, week, time-period and duration. Lastly, we show the performance of the scheme on various days and correlate with real-time events where one of the UFO reports strongly correlates to a missile test conducted in the United States.

Simulation Data Summarization Based on Spatial Histograms

In order to analyze large-scale scientific data, research on data exploration and visualization has gained popularity. In this paper, we focus on the exploration and visualization of scientific simulation data, and define a spatial V-Optimal histogram for data summarization. We propose histogram construction algorithms based on a general binary hierarchical partitioning as well as a more specific one, the l-grid partitioning. For effective data summarization and efficient data visualization in scientific data analysis, we propose an optimal algorithm as well as a heuristic algorithm for histogram construction. To verify the effectiveness and efficiency of the proposed methods, we conduct experiments on the massive evacuation simulation data.

A Formal Approach for Instructional Design Integrated with Data Visualization for Learning Analytics

Most Virtual Learning Environments do not provide support mechanisms for the integrated planning, construction and follow-up of Instructional Design supported by Learning Analytic results. The present work aims to present an authoring tool that will be responsible for constructing the structure of an Instructional Design (ID), without the data being altered during the execution of the course. The visual interface aims to present the critical situations present in this ID, serving as a support tool for the course follow-up and possible improvements, which can be made during its execution or in the planning of a new edition of this course. The model for the ID is based on High-Level Petri Nets and the visualization forms are determined by the specific kind of the data generated by an e-course, a population of students generating sequentially dependent data.

Reinforced Concrete Bridge Deck Condition Assessment Methods Using Ground Penetrating Radar and Infrared Thermography

Reinforced concrete bridge deck condition assessments primarily use visual inspection methods, where an inspector looks for and records locations of cracks, potholes, efflorescence and other signs of probable deterioration. Sounding is another technique used to diagnose the condition of a bridge deck, however this method listens for damage within the subsurface as the surface is struck with a hammer or chain. Even though extensive procedures are in place for using these inspection techniques, neither one provides the inspector with a comprehensive understanding of the internal condition of a bridge deck – the location where damage originates from.  In order to make accurate estimates of repair locations and quantities, in addition to allocating the necessary funding, a total understanding of the deck’s deteriorated state is key. The research presented in this paper collected infrared thermography and ground penetrating radar data from reinforced concrete bridge decks without an asphalt overlay. These decks were of various ages and their condition varied from brand new, to in need of replacement. The goals of this work were to first verify that these nondestructive evaluation methods could identify similar areas of healthy and damaged concrete, and then to see if combining the results of both methods would provide a higher confidence than if the condition assessment was completed using only one method. The results from each method were presented as plan view color contour plots. The results from one of the decks assessed as a part of this research, including these plan view plots, are presented in this paper. Furthermore, in order to answer the interest of transportation agencies throughout the United States, this research developed a step-by-step guide which demonstrates how to collect and assess a bridge deck using these nondestructive evaluation methods. This guide addresses setup procedures on the deck during the day of data collection, system setups and settings for different bridge decks, data post-processing for each method, and data visualization and quantification.

Virtual 3D Environments for Image-Based Navigation Algorithms

This paper applies to the creation of virtual 3D environments for the study and development of mobile robot image based navigation algorithms and techniques, which need to operate robustly and efficiently. The test of these algorithms can be performed in a physical way, from conducting experiments on a prototype, or by numerical simulations. Current simulation platforms for robotic applications do not have flexible and updated models for image rendering, being unable to reproduce complex light effects and materials. Thus, it is necessary to create a test platform that integrates sophisticated simulated applications of real environments for navigation, with data and image processing. This work proposes the development of a high-level platform for building 3D model’s environments and the test of image-based navigation algorithms for mobile robots. Techniques were used for applying texture and lighting effects in order to accurately represent the generation of rendered images regarding the real world version. The application will integrate image processing scripts, trajectory control, dynamic modeling and simulation techniques for physics representation and picture rendering with the open source 3D creation suite - Blender.

Visual Text Analytics Technologies for Real-Time Big Data: Chronological Evolution and Issues

New approaches to analyze and visualize data stream in real-time basis is important in making a prompt decision by the decision maker. Financial market trading and surveillance, large-scale emergency response and crowd control are some example scenarios that require real-time analytic and data visualization. This situation has led to the development of techniques and tools that support humans in analyzing the source data. With the emergence of Big Data and social media, new techniques and tools are required in order to process the streaming data. Today, ranges of tools which implement some of these functionalities are available. In this paper, we present chronological evolution evaluation of technologies for supporting of real-time analytic and visualization of the data stream. Based on the past research papers published from 2002 to 2014, we gathered the general information, main techniques, challenges and open issues. The techniques for streaming text visualization are identified based on Text Visualization Browser in chronological order. This paper aims to review the evolution of streaming text visualization techniques and tools, as well as to discuss the problems and challenges for each of identified tools.

The Use of Network Theory in Heritage Cities

This paper aims to demonstrate how the use of Network Theory can be applied to a very interesting and complex urban situation: The parts of a city which may have some patrimonial value, but because of their lack of relevant architectural elements, they are not considered to be historic in a conventional sense. In this paper, we use the suburb of La Villaflora in the city of Quito, Ecuador as our case study. We first propose a system of indicators as a tool to characterize and quantify the historic value of a geographic area. Then, we apply these indicators to the suburb of La Villaflora and use Network Theory to understand and propose actions.

Modeling Bessel Beams and Their Discrete Superpositions from the Generalized Lorenz-Mie Theory to Calculate Optical Forces over Spherical Dielectric Particles

In this work, we propose an algorithm developed under Python language for the modeling of ordinary scalar Bessel beams and their discrete superpositions and subsequent calculation of optical forces exerted over dielectric spherical particles. The mathematical formalism, based on the generalized Lorenz-Mie theory, is implemented in Python for its large number of free mathematical (as SciPy and NumPy), data visualization (Matplotlib and PyJamas) and multiprocessing libraries. We also propose an approach, provided by a synchronized Software as Service (SaaS) in cloud computing, to develop a user interface embedded on a mobile application, thus providing users with the necessary means to easily introduce desired unknowns and parameters and see the graphical outcomes of the simulations right at their mobile devices. Initially proposed as a free Android-based application, such an App enables data post-processing in cloud-based architectures and visualization of results, figures and numerical tables.

Comparative Analysis of Diverse Collection of Big Data Analytics Tools

Over the past era, there have been a lot of efforts and studies are carried out in growing proficient tools for performing various tasks in big data. Recently big data have gotten a lot of publicity for their good reasons. Due to the large and complex collection of datasets it is difficult to process on traditional data processing applications. This concern turns to be further mandatory for producing various tools in big data. Moreover, the main aim of big data analytics is to utilize the advanced analytic techniques besides very huge, different datasets which contain diverse sizes from terabytes to zettabytes and diverse types such as structured or unstructured and batch or streaming. Big data is useful for data sets where their size or type is away from the capability of traditional relational databases for capturing, managing and processing the data with low-latency. Thus the out coming challenges tend to the occurrence of powerful big data tools. In this survey, a various collection of big data tools are illustrated and also compared with the salient features.

Cloud Computing Support for Diagnosing Researches

One of the main biomedical problem lies in detecting dependencies in semi structured data. Solution includes biomedical portal and algorithms (integral rating health criteria, multidimensional data visualization methods). Biomedical portal allows to process diagnostic and research data in parallel mode using Microsoft System Center 2012, Windows HPC Server cloud technologies. Service does not allow user to see internal calculations instead it provides practical interface. When data is sent for processing user may track status of task and will achieve results as soon as computation is completed. Service includes own algorithms and allows diagnosing and predicating medical cases. Approved methods are based on complex system entropy methods, algorithms for determining the energy patterns of development and trajectory models of biological systems and logical–probabilistic approach with the blurring of images.

New Findings on the User’s Preferences about Data Visualization of Online Reviews

The information visualization is still a knowledge field that lacks from a solid theory to support it and there is a myriad of existing methodologies and taxonomies that can be combined and adopted as guidelines. In this context, it is necessary to pre-evaluate as much as possible all the assumptions that are considered for its design and development. We present an exploratory study (n = 123) to detect the graphical preferences of travelers using accommodation portals of Web 2.0 (e.g. tripadvisor.com). We took into account some of the most relevant ground rules applied in the field to map visually data and design end-user interaction. Moreover, the evaluation process was completely data visualization oriented. We found out that people tend to refuse more advanced types of visualization and that a hybrid combination between radial graphs and stacked bars should be explored. In sum, this paper introduces new findings about the visual model and the cognitive response of users of accommodation booking websites.

Assessing and Visualizing the Stability of Feature Selectors: A Case Study with Spectral Data

Feature selection plays an important role in applications with high dimensional data. The assessment of the stability of feature selection/ranking algorithms becomes an important issue when the dataset is small and the aim is to gain insight into the underlying process by analyzing the most relevant features. In this work, we propose a graphical approach that enables to analyze the similarity between feature ranking techniques as well as their individual stability. Moreover, it works with whatever stability metric (Canberra distance, Spearman's rank correlation coefficient, Kuncheva's stability index,...). We illustrate this visualization technique evaluating the stability of several feature selection techniques on a spectral binary dataset. Experimental results with a neural-based classifier show that stability and ranking quality may not be linked together and both issues have to be studied jointly in order to offer answers to the domain experts.

Addressing Data Security in the Cloud

The development of information and communication technology, the increased use of the internet, as well as the effects of the recession within the last years, have lead to the increased use of cloud computing based solutions, also called on-demand solutions. These solutions offer a large number of benefits to organizations as well as challenges and risks, mainly determined by data visualization in different geographic locations on the internet. As far as the specific risks of cloud environment are concerned, data security is still considered a peak barrier in adopting cloud computing. The present study offers an approach upon ensuring the security of cloud data, oriented towards the whole data life cycle. The final part of the study focuses on the assessment of data security in the cloud, this representing the bases in determining the potential losses and the premise for subsequent improvements and continuous learning.

Determining Cluster Boundaries Using Particle Swarm Optimization

Self-organizing map (SOM) is a well known data reduction technique used in data mining. Data visualization can reveal structure in data sets that is otherwise hard to detect from raw data alone. However, interpretation through visual inspection is prone to errors and can be very tedious. There are several techniques for the automatic detection of clusters of code vectors found by SOMs, but they generally do not take into account the distribution of code vectors; this may lead to unsatisfactory clustering and poor definition of cluster boundaries, particularly where the density of data points is low. In this paper, we propose the use of a generic particle swarm optimization (PSO) algorithm for finding cluster boundaries directly from the code vectors obtained from SOMs. The application of our method to unlabeled call data for a mobile phone operator demonstrates its feasibility. PSO algorithm utilizes U-matrix of SOMs to determine cluster boundaries; the results of this novel automatic method correspond well to boundary detection through visual inspection of code vectors and k-means algorithm.

Thailand National Biodiversity Database System with webMathematica and Google Earth

National Biodiversity Database System (NBIDS) has been developed for collecting Thai biodiversity data. The goal of this project is to provide advanced tools for querying, analyzing, modeling, and visualizing patterns of species distribution for researchers and scientists. NBIDS data record two types of datasets: biodiversity data and environmental data. Biodiversity data are specie presence data and species status. The attributes of biodiversity data can be further classified into two groups: universal and projectspecific attributes. Universal attributes are attributes that are common to all of the records, e.g. X/Y coordinates, year, and collector name. Project-specific attributes are attributes that are unique to one or a few projects, e.g., flowering stage. Environmental data include atmospheric data, hydrology data, soil data, and land cover data collecting by using GLOBE protocols. We have developed webbased tools for data entry. Google Earth KML and ArcGIS were used as tools for map visualization. webMathematica was used for simple data visualization and also for advanced data analysis and visualization, e.g., spatial interpolation, and statistical analysis. NBIDS will be used by park rangers at Khao Nan National Park, and researchers.

An Intelligent Human-Computer Interaction System for Decision Support

This paper proposes a novel architecture for developing decision support systems. Unlike conventional decision support systems, the proposed architecture endeavors to reveal the decision-making process such that humans' subjectivity can be incorporated into a computerized system and, at the same time, to preserve the capability of the computerized system in processing information objectively. A number of techniques used in developing the decision support system are elaborated to make the decisionmarking process transparent. These include procedures for high dimensional data visualization, pattern classification, prediction, and evolutionary computational search. An artificial data set is first employed to compare the proposed approach with other methods. A simulated handwritten data set and a real data set on liver disease diagnosis are then employed to evaluate the efficacy of the proposed approach. The results are analyzed and discussed. The potentials of the proposed architecture as a useful decision support system are demonstrated.