Index t-SNE: Tracking Dynamics of High-Dimensional Datasets with Coherent Embeddings

t-SNE is an embedding method that the data science community has widely used. It helps two main tasks: to display results by coloring items according to the item class or feature value; and for forensic, giving a first overview of the dataset distribution. Two interesting characteristics of t-SNE are the structure preservation property and the answer to the crowding problem, where all neighbors in high dimensional space cannot be represented correctly in low dimensional space. t-SNE preserves the local neighborhood, and similar items are nicely spaced by adjusting to the local density. These two characteristics produce a meaningful representation, where the cluster area is proportional to its size in number, and relationships between clusters are materialized by closeness on the embedding. This algorithm is non-parametric. The transformation from a high to low dimensional space is described but not learned. Two initializations of the algorithm would lead to two different embedding. In a forensic approach, analysts would like to compare two or more datasets using their embedding. A naive approach would be to embed all datasets together. However, this process is costly as the complexity of t-SNE is quadratic, and would be infeasible for too many datasets. Another approach would be to learn a parametric model over an embedding built with a subset of data. While this approach is highly scalable, points could be mapped at the same exact position, making them indistinguishable. This type of model would be unable to adapt to new outliers nor concept drift. This paper presents a methodology to reuse an embedding to create a new one, where cluster positions are preserved. The optimization process minimizes two costs, one relative to the embedding shape and the second relative to the support embedding’ match. The embedding with the support process can be repeated more than once, with the newly obtained embedding. The successive embedding can be used to study the impact of one variable over the dataset distribution or monitor changes over time. This method has the same complexity as t-SNE per embedding, and memory requirements are only doubled. For a dataset of n elements sorted and split into k subsets, the total embedding complexity would be reduced from O(n2) to O(n2/k), and the memory requirement from n2 to 2(n/k)2 which enables computation on recent laptops. The method showed promising results on a real-world dataset, allowing to observe the birth, evolution and death of clusters. The proposed approach facilitates identifying significant trends and changes, which empowers the monitoring high dimensional datasets’ dynamics.

6D Posture Estimation of Road Vehicles from Color Images

Currently, in the field of object posture estimation, there is research on estimating the position and angle of an object by storing a 3D model of the object to be estimated in advance in a computer and matching it with the model. However, in this research, we have succeeded in creating a module that is much simpler, smaller in scale, and faster in operation. Our 6D pose estimation model consists of two different networks – a classification network and a regression network. From a single RGB image, the trained model estimates the class of the object in the image, the coordinates of the object, and its rotation angle in 3D space. In addition, we compared the estimation accuracy of each camera position, i.e., the angle from which the object was captured. The highest accuracy was recorded when the camera position was 75°, the accuracy of the classification was about 87.3%, and that of regression was about 98.9%.

HaskellFL: A Tool for Detecting Logical Errors in Haskell

Understanding and using the functional paradigm is a challenge for many programmers. Looking for logical errors in code may take a lot of a developer’s time when a program grows in size. In order to facilitate both processes, this paper presents HaskellFL, a tool that uses fault localization techniques to locate a logical error in Haskell code. The Haskell subset used in this work is sufficiently expressive for those studying Functional Programming to get immediate help debugging their code and to answer questions about key concepts associated with the functional paradigm. HaskellFL was tested against Functional Programming assignments submitted by students enrolled at the Functional Programming class at the Federal University of Minas Gerais and against exercises from the Exercism Haskell track that are publicly available in GitHub. This work also evaluated the effectiveness of two fault localization techniques, Tarantula and Ochiai, in the Haskell context. Furthermore, the EXAM score was chosen to evaluate the tool’s effectiveness, and results showed that HaskellFL reduced the effort needed to locate an error for all tested scenarios. The results also showed that the Ochiai method was more effective than Tarantula.

A Convolutional Deep Neural Network Approach for Skin Cancer Detection Using Skin Lesion Images

Malignant Melanoma, known simply as Melanoma, is a type of skin cancer that appears as a mole on the skin. It is critical to detect this cancer at an early stage because it can spread across the body and may lead to the patient death. When detected early, Melanoma is curable. In this paper we propose a deep learning model (Convolutional Neural Networks) in order to automatically classify skin lesion images as Malignant or Benign. Images underwent certain pre-processing steps to diminish the effect of the normal skin region on the model. The result of the proposed model showed a significant improvement over previous work, achieving an accuracy of 97%.

Dimensionality Reduction in Modal Analysis for Structural Health Monitoring

Autonomous structural health monitoring (SHM) of many structures and bridges became a topic of paramount importance for maintenance purposes and safety reasons. This paper proposes a set of machine learning (ML) tools to perform automatic feature selection and detection of anomalies in a bridge from vibrational data and compare different feature extraction schemes to increase the accuracy and reduce the amount of data collected. As a case study, the Z-24 bridge is considered because of the extensive database of accelerometric data in both standard and damaged conditions. The proposed framework starts from the first four fundamental frequencies extracted through operational modal analysis (OMA) and clustering, followed by time-domain filtering (tracking). The fundamental frequencies extracted are then fed to a dimensionality reduction block implemented through two different approaches: feature selection (intelligent multiplexer) that tries to estimate the most reliable frequencies based on the evaluation of some statistical features (i.e., entropy, variance, kurtosis), and feature extraction (auto-associative neural network (ANN)) that combine the fundamental frequencies to extract new damage sensitive features in a low dimensional feature space. Finally, one-class classification (OCC) algorithms perform anomaly detection, trained with standard condition points, and tested with normal and anomaly ones. In particular, principal component analysis (PCA), kernel principal component analysis (KPCA), and autoassociative neural network (ANN) are presented and their performance are compared. It is also shown that, by evaluating the correct features, the anomaly can be detected with accuracy and an F1 score greater than 95%.

A Medical Vulnerability Scoring System Incorporating Health and Data Sensitivity Metrics

With the advent of complex software and increased connectivity, security of life-critical medical devices is becoming an increasing concern, particularly with their direct impact to human safety. Security is essential, but it is impossible to develop completely secure and impenetrable systems at design time. Therefore, it is important to assess the potential impact on security and safety of exploiting a vulnerability in such critical medical systems. The common vulnerability scoring system (CVSS) calculates the severity of exploitable vulnerabilities. However, for medical devices, it does not consider the unique challenges of impacts to human health and privacy. Thus, the scoring of a medical device on which a human life depends (e.g., pacemakers, insulin pumps) can score very low, while a system on which a human life does not depend (e.g., hospital archiving systems) might score very high. In this paper, we present a Medical Vulnerability Scoring System (MVSS) that extends CVSS to address the health and privacy concerns of medical devices. We propose incorporating two new parameters, namely health impact and sensitivity impact. Sensitivity refers to the type of information that can be stolen from the device, and health represents the impact to the safety of the patient if the vulnerability is exploited (e.g., potential harm, life threatening). We evaluate 15 different known vulnerabilities in medical devices and compare MVSS against two state-of-the-art medical device-oriented vulnerability scoring system and the foundational CVSS.

The Application of Fuzzy Set Theory to Mobile Internet Advertisement Fraud Detection

This paper presents the application of fuzzy set theory to implement of mobile advertisement anti-fraud systems. Mobile anti-fraud is a method aiming to identify mobile advertisement fraudsters. One of the main problems of mobile anti-fraud is the lack of evidence to prove a user to be a fraudster. In this paper, we implement an application by using fuzzy set theory to demonstrate how to detect cheaters. The advantage of our method is that the hardship in detecting fraudsters in small data samples has been avoided. We achieved this by giving each user a suspicious degree showing how likely the user is cheating and decide whether a group of users (like all users of a certain APP) together to be fraudsters according to the average suspicious degree. This makes the process more accurate as the data of a single user is too small to be predictable.

Reference Architecture for Intelligent Enterprise Solutions

Data in IT systems in enterprises have been growing at phenomenal pace. This has provided opportunities to run analytics to gather intelligence on key business parameters that enable them to provide better products and services to customers. While there are several Artificial Intelligence/Machine Learning (AI/ML) and Business Intelligence (BI) tools and technologies available in marketplace to run analytics, there is a need for an integrated view when developing intelligent solutions in enterprises. This paper progressively elaborates a reference model for enterprise solutions, builds an integrated view of data, information and intelligence components and presents a reference architecture for intelligent enterprise solutions. Finally, it applies the reference architecture to an insurance organization. The reference architecture is the outcome of experience and insights gathered from developing intelligent solutions for several organizations.

1/Sigma Term Weighting Scheme for Sentiment Analysis

Large amounts of data on the web can provide valuable information. For example, product reviews help business owners measure customer satisfaction. Sentiment analysis classifies texts into two polarities: positive and negative. This paper examines movie reviews and tweets using a new term weighting scheme, called one-over-sigma (1/sigma), on benchmark datasets for sentiment classification. The proposed method aims to improve the performance of sentiment classification. The results show that 1/sigma is more accurate than the popular term weighting schemes. In order to verify if the entropy reflects the discriminating power of terms, we report a comparison of entropy values for different term weighting schemes.

Lamb Wave Wireless Communication in Healthy Plates Using Coherent Demodulation

Guided ultrasonic waves are used in Non-Destructive Testing and Structural Health Monitoring for inspection and damage detection. Recently, wireless data transmission using ultrasonic waves in solid metallic channels has gained popularity in some industrial applications such as nuclear, aerospace and smart vehicles. The idea is to find a good substitute for electromagnetic waves since they are highly attenuated near metallic components due to Faraday shielding. The proposed solution is to use ultrasonic guided waves such as Lamb waves as an information carrier due to their capability of propagation for long distances. In addition to this, valuable information about the health of the structure could be extracted simultaneously. In this work, the reliable frequency bandwidth for communication is extracted experimentally from dispersion curves at first. Then, an experimental platform for wireless communication using Lamb waves is described and built. After this, coherent demodulation algorithm used in telecommunications is tested for Amplitude Shift Keying, On-Off Keying and Binary Phase Shift Keying modulation techniques. Signal processing parameters such as threshold choice, number of cycles per bit and Bit Rate are optimized. Experimental results are compared based on the average bit error percentage. Results has shown high sensitivity to threshold selection for Amplitude Shift Keying and On-Off Keying techniques resulting a Bit Rate decrease. Binary Phase Shift Keying technique shows the highest stability and data rate between all tested modulation techniques.

Speedup Breadth-First Search by Graph Ordering

Breadth-First Search (BFS) is a core graph algorithm that is widely used for graph analysis. As it is frequently used in many graph applications, improving the BFS performance is essential. In this paper, we present a graph ordering method that could reorder the graph nodes to achieve better data locality, thus, improving the BFS performance. Our method is based on an observation that the sibling relationships will dominate the cache access pattern during the BFS traversal. Therefore, we propose a frequency-based model to construct the graph order. First, we optimize the graph order according to the nodes’ visit frequency. Nodes with high visit frequency will be processed in priority. Second, we try to maximize the child nodes’ overlap layer by layer. As it is proved to be NP-hard, we propose a heuristic method that could greatly reduce the preprocessing overheads.We conduct extensive experiments on 16 real-world datasets. The result shows that our method could achieve comparable performance with the state-of-the-art methods while the graph ordering overheads are only about 1/15.

Machine Learning for Music Aesthetic Annotation Using MIDI Format: A Harmony-Based Classification Approach

Swimming with the tide of deep learning, the field of music information retrieval (MIR) experiences parallel development and a sheer variety of feature-learning models has been applied to music classification and tagging tasks. Among those learning techniques, the deep convolutional neural networks (CNNs) have been widespreadly used with better performance than the traditional approach especially in music genre classification and prediction. However, regarding the music recommendation, there is a large semantic gap between the corresponding audio genres and the various aspects of a song that influence user preference. In our study, aiming to bridge the gap, we strive to construct an automatic music aesthetic annotation model with MIDI format for better comparison and measurement of the similarity between music pieces in the way of harmonic analysis. We use the matrix of qualification converted from MIDI files as input to train two different classifiers, support vector machine (SVM) and Decision Tree (DT). Experimental results in performance of a tag prediction task have shown that both learning algorithms are capable of extracting high-level properties in an end-to end manner from music information. The proposed model is helpful to learn the audience taste and then the resulting recommendations are likely to appeal to a niche consumer.

Fast and Robust Long-term Tracking with Effective Searching Model

Kernelized Correlation Filter (KCF) based trackers have gained a lot of attention recently because of their accuracy and fast calculation speed. However, this algorithm is not robust in cases where the object is lost by a sudden change of direction, being obscured or going out of view. In order to improve KCF performance in long-term tracking, this paper proposes an anomaly detection method for target loss warning by analyzing the response map of each frame, and a classification algorithm for reliable target re-locating mechanism by using Random fern. Being tested with Visual Tracker Benchmark and Visual Object Tracking datasets, the experimental results indicated that the precision and success rate of the proposed algorithm were 2.92 and 2.61 times higher than that of the original KCF algorithm, respectively. Moreover, the proposed tracker handles occlusion better than many state-of-the-art long-term tracking methods while running at 60 frames per second.

Image Processing Approach for Detection of Three-Dimensional Tree-Rings from X-Ray Computed Tomography

Tree-ring analysis is an important part of the quality assessment and the dating of (archaeological) wood samples. It provides quantitative data about the whole anatomical ring structure, which can be used, for example, to measure the impact of the fluctuating environment on the tree growth, for the dendrochronological analysis of archaeological wooden artefacts and to estimate the wood mechanical properties. Despite advances in computer vision and edge recognition algorithms, detection and counting of annual rings are still limited to 2D datasets and performed in most cases manually, which is a time consuming, tedious task and depends strongly on the operator’s experience. This work presents an image processing approach to detect the whole 3D tree-ring structure directly from X-ray computed tomography imaging data. The approach relies on a modified Canny edge detection algorithm, which captures fully connected tree-ring edges throughout the measured image stack and is validated on X-ray computed tomography data taken from six wood species.

Platform-as-a-Service Sticky Policies for Privacy Classification in the Cloud

In this paper, we present a Platform-as-a-Service (PaaS) model for controlling the privacy enforcement mechanisms applied on user data when stored and processed in Cloud data centers. The proposed architecture consists of establishing user configurable ‘sticky’ policies on the Graphical User Interface (GUI) data-bound components during the application development phase to specify the details of privacy enforcement on the contents of these components. Various privacy classification classes on the data components are formally defined to give the user full control on the degree and scope of privacy enforcement including the type of execution containers to process the data in the Cloud. This not only enhances the privacy-awareness of the developed Cloud services, but also results in major savings in performance and energy efficiency due to the fact that the privacy mechanisms are solely applied on sensitive data units and not on all the user content. The proposed design is implemented in a real PaaS cloud computing environment on the Microsoft Azure platform.

Enhancing the Effectiveness of Air Defense Systems through Simulation Analysis

Air Defense Systems contain high-value assets that are expected to fulfill their mission for several years - in many cases, even decades - while operating in a fast-changing, technology-driven environment. Thus, it is paramount that decision-makers can assess how effective an Air Defense System is in the face of new developing threats, as well as to identify the bottlenecks that could jeopardize the security of the airspace of a country. Given the broad extent of activities and the great variety of assets necessary to achieve the strategic objectives, a systems approach was taken in order to delineate the core requirements and the physical architecture of an Air Defense System. Then, value-focused thinking helped in the definition of the measures of effectiveness. Furthermore, analytical methods were applied to create a formal structure that preliminarily assesses such measures. To validate the proposed methodology, a powerful simulation was also used to determine the measures of effectiveness, now in more complex environments that incorporate both uncertainty and multiple interactions of the entities. The results regarding the validity of this methodology suggest that the approach can support decisions aimed at enhancing the capabilities of Air Defense Systems. In conclusion, this paper sheds some light on how consolidated approaches of Systems Engineering and Operations Research can be used as valid techniques for solving problems regarding a complex and yet vital matter.

A Review and Comparative Analysis on Cluster Ensemble Methods

Clustering is an unsupervised learning technique for aggregating data objects into meaningful classes so that intra cluster similarity is maximized and inter cluster similarity is minimized in data mining. However, no single clustering algorithm proves to be the most effective in producing the best result. As a result, a new challenging technique known as the cluster ensemble approach has blossomed in order to determine the solution to this problem. For the cluster analysis issue, this new technique is a successful approach. The cluster ensemble's main goal is to combine similar clustering solutions in a way that achieves the precision while also improving the quality of individual data clustering. Because of the massive and rapid creation of new approaches in the field of data mining, the ongoing interest in inventing novel algorithms necessitates a thorough examination of current techniques and future innovation. This paper presents a comparative analysis of various cluster ensemble approaches, including their methodologies, formal working process, and standard accuracy and error rates. As a result, the society of clustering practitioners will benefit from this exploratory and clear research, which will aid in determining the most appropriate solution to the problem at hand.

Bit Error Rate Monitoring for Automatic Bias Control of Quadrature Amplitude Modulators

The most common quadrature amplitude modulator (QAM) applies two Mach-Zehnder Modulators (MZM) and one phase shifter to generate high order modulation format. The bias of MZM changes over time due to temperature, vibration, and aging factors. The change in the biasing causes distortion to the generated QAM signal which leads to deterioration of bit error rate (BER) performance. Therefore, it is critical to be able to lock MZM’s Q point to the required operating point for good performance. We propose a technique for automatic bias control (ABC) of QAM transmitter using BER measurements and gradient descent optimization algorithm. The proposed technique is attractive because it uses the pertinent metric, BER, which compensates for bias drifting independently from other system variations such as laser source output power. The proposed scheme performance and its operating principles are simulated using OptiSystem simulation software for 4-QAM and 16-QAM transmitters.

A Real-Time Bayesian Decision-Support System for Predicting Suspect Vehicle’s Intended Target Using a Sparse Camera Network

We present a decision-support tool to assist an operator in the detection and tracking of a suspect vehicle traveling to an unknown target destination. Multiple data sources, such as traffic cameras, traffic information, weather, etc., are integrated and processed in real-time to infer a suspect’s intended destination chosen from a list of pre-determined high-value targets. Previously, we presented our work in the detection and tracking of vehicles using traffic and airborne cameras. Here, we focus on the fusion and processing of that information to predict a suspect’s behavior. The network of cameras is represented by a directional graph, where the edges correspond to direct road connections between the nodes and the edge weights are proportional to the average time it takes to travel from one node to another. For our experiments, we construct our graph based on the greater Los Angeles subset of the Caltrans’s “Performance Measurement System” (PeMS) dataset. We propose a Bayesian approach where a posterior probability for each target is continuously updated based on detections of the suspect in the live video feeds. Additionally, we introduce the concept of ‘soft interventions’, inspired by the field of Causal Inference. Soft interventions are herein defined as interventions that do not immediately interfere with the suspect’s movements; rather, a soft intervention may induce the suspect into making a new decision, ultimately making their intent more transparent. For example, a soft intervention could be temporarily closing a road a few blocks from the suspect’s current location, which may require the suspect to change their current course. The objective of these interventions is to gain the maximum amount of information about the suspect’s intent in the shortest possible time. Our system currently operates in a human-on-the-loop mode where at each step, a set of recommendations are presented to the operator to aid in decision-making. In principle, the system could operate autonomously, only prompting the operator for critical decisions, allowing the system to significantly scale up to larger areas and multiple suspects. Once the intended target is identified with sufficient confidence, the vehicle is reported to the authorities to take further action. Other recommendations include a selection of road closures, i.e., soft interventions, or to continue monitoring. We evaluate the performance of the proposed system using simulated scenarios where the suspect, starting at random locations, takes a noisy shortest path to their intended target. In all scenarios, the suspect’s intended target is unknown to our system. The decision thresholds are selected to maximize the chances of determining the suspect’s intended target in the minimum amount of time and with the smallest number of interventions. We conclude by discussing the limitations of our current approach to motivate a machine learning approach, based on reinforcement learning in order to relax some of the current limiting assumptions.

Methodology for the Multi-Objective Analysis of Data Sets in Freight Delivery

Data flow and the purpose of reporting the data are different and dependent on business needs. Different parameters are reported and transferred regularly during freight delivery. This business practices form the dataset constructed for each time point and contain all required information for freight moving decisions. As a significant amount of these data is used for various purposes, an integrating methodological approach must be developed to respond to the indicated problem. The proposed methodology contains several steps: (1) collecting context data sets and data validation; (2) multi-objective analysis for optimizing freight transfer services. For data validation, the study involves Grubbs outliers analysis, particularly for data cleaning and the identification of statistical significance of data reporting event cases. The Grubbs test is often used as it measures one external value at a time exceeding the boundaries of standard normal distribution. In the study area, the test was not widely applied by authors, except when the Grubbs test for outlier detection was used to identify outsiders in fuel consumption data. In the study, the authors applied the method with a confidence level of 99%. For the multi-objective analysis, the authors would like to select the forms of construction of the genetic algorithms, which have more possibilities to extract the best solution. For freight delivery management, the schemas of genetic algorithms' structure are used as a more effective technique. Due to that, the adaptable genetic algorithm is applied for the description of choosing process of the effective transportation corridor. In this study, the multi-objective genetic algorithm methods are used to optimize the data evaluation and select the appropriate transport corridor. The authors suggest a methodology for the multi-objective analysis, which evaluates collected context data sets and uses this evaluation to determine a delivery corridor for freight transfer service in the multi-modal transportation network. In the multi-objective analysis, authors include safety components, the number of accidents a year, and freight delivery time in the multi-modal transportation network. The proposed methodology has practical value in the management of multi-modal transportation processes.