Optimizing Data Evaluation Metrics for Fraud Detection Using Machine Learning

The use of technology has benefited society in more ways than one ever thought possible. Unfortunately, as society’s knowledge of technology has advanced, so has its knowledge of ways to use technology to manipulate others. This has led to a simultaneous advancement in the world of fraud. Machine learning techniques can offer a possible solution to help decrease these advancements. This research explores how the use of various machine learning techniques can aid in detecting fraudulent activity across two different types of fraudulent datasets, and the accuracy, precision, recall, and F1 were recorded for each method. Each machine learning model was also tested across five different training and testing splits in order to discover which split and technique would lead to the most optimal results.

Index t-SNE: Tracking Dynamics of High-Dimensional Datasets with Coherent Embeddings

t-SNE is an embedding method that the data science community has widely used. It helps two main tasks: to display results by coloring items according to the item class or feature value; and for forensic, giving a first overview of the dataset distribution. Two interesting characteristics of t-SNE are the structure preservation property and the answer to the crowding problem, where all neighbors in high dimensional space cannot be represented correctly in low dimensional space. t-SNE preserves the local neighborhood, and similar items are nicely spaced by adjusting to the local density. These two characteristics produce a meaningful representation, where the cluster area is proportional to its size in number, and relationships between clusters are materialized by closeness on the embedding. This algorithm is non-parametric. The transformation from a high to low dimensional space is described but not learned. Two initializations of the algorithm would lead to two different embedding. In a forensic approach, analysts would like to compare two or more datasets using their embedding. A naive approach would be to embed all datasets together. However, this process is costly as the complexity of t-SNE is quadratic, and would be infeasible for too many datasets. Another approach would be to learn a parametric model over an embedding built with a subset of data. While this approach is highly scalable, points could be mapped at the same exact position, making them indistinguishable. This type of model would be unable to adapt to new outliers nor concept drift. This paper presents a methodology to reuse an embedding to create a new one, where cluster positions are preserved. The optimization process minimizes two costs, one relative to the embedding shape and the second relative to the support embedding’ match. The embedding with the support process can be repeated more than once, with the newly obtained embedding. The successive embedding can be used to study the impact of one variable over the dataset distribution or monitor changes over time. This method has the same complexity as t-SNE per embedding, and memory requirements are only doubled. For a dataset of n elements sorted and split into k subsets, the total embedding complexity would be reduced from O(n2) to O(n2/k), and the memory requirement from n2 to 2(n/k)2 which enables computation on recent laptops. The method showed promising results on a real-world dataset, allowing to observe the birth, evolution and death of clusters. The proposed approach facilitates identifying significant trends and changes, which empowers the monitoring high dimensional datasets’ dynamics.

Attribute Analysis of Quick Response Code Payment Users Using Discriminant Non-negative Matrix Factorization

Recently, the system of quick response (QR) code is getting popular. Many companies introduce new QR code payment services and the services are competing with each other to increase the number of users. For increasing the number of users, we should grasp the difference of feature of the demographic information, usage information, and value of users between services. In this study, we conduct an analysis of real-world data provided by Nomura Research Institute including the demographic data of users and information of users’ usages of two services; LINE Pay, and PayPay. For analyzing such data and interpret the feature of them, Nonnegative Matrix Factorization (NMF) is widely used; however, in case of the target data, there is a problem of the missing data. EM-algorithm NMF (EMNMF) to complete unknown values for understanding the feature of the given data presented by matrix shape. Moreover, for comparing the result of the NMF analysis of two matrices, there is Discriminant NMF (DNMF) shows the difference of users features between two matrices. In this study, we combine EMNMF and DNMF and also analyze the target data. As the interpretation, we show the difference of the features of users between LINE Pay and Paypay.

Spatial Data Science for Data Driven Urban Planning: The Youth Economic Discomfort Index for Rome

Today, a consistent segment of the world’s population lives in urban areas, and this proportion will vastly increase in the next decades. Therefore, understanding the key trends in urbanization, likely to unfold over the coming years, is crucial to the implementation of sustainable urban strategies. In parallel, the daily amount of digital data produced will be expanding at an exponential rate during the following years. The analysis of various types of data sets and its derived applications have incredible potential across different crucial sectors such as healthcare, housing, transportation, energy, and education. Nevertheless, in city development, architects and urban planners appear to rely mostly on traditional and analogical techniques of data collection. This paper investigates the prospective of the data science field, appearing to be a formidable resource to assist city managers in identifying strategies to enhance the social, economic, and environmental sustainability of our urban areas. The collection of different new layers of information would definitely enhance planners' capabilities to comprehend more in-depth urban phenomena such as gentrification, land use definition, mobility, or critical infrastructural issues. Specifically, the research results correlate economic, commercial, demographic, and housing data with the purpose of defining the youth economic discomfort index. The statistical composite index provides insights regarding the economic disadvantage of citizens aged between 18 years and 29 years, and results clearly display that central urban zones and more disadvantaged than peripheral ones. The experimental set up selected the city of Rome as the testing ground of the whole investigation. The methodology aims at applying statistical and spatial analysis to construct a composite index supporting informed data-driven decisions for urban planning.

AniMoveMineR: Animal Behavior Exploratory Analysis Using Association Rules Mining

Environmental changes and major natural disasters are most prevalent in the world due to the damage that humanity has caused to nature and these damages directly affect the lives of animals. Thus, the study of animal behavior and their interactions with the environment can provide knowledge that guides researchers and public agencies in preservation and conservation actions. Exploratory analysis of animal movement can determine the patterns of animal behavior and with technological advances the ability of animals to be tracked and, consequently, behavioral studies have been expanded. There is a lot of research on animal movement and behavior, but we note that a proposal that combines resources and allows for exploratory analysis of animal movement and provide statistical measures on individual animal behavior and its interaction with the environment is missing. The contribution of this paper is to present the framework AniMoveMineR, a unified solution that aggregates trajectory analysis and data mining techniques to explore animal movement data and provide a first step in responding questions about the animal individual behavior and their interactions with other animals over time and space. We evaluated the framework through the use of monitored jaguar data in the city of Miranda Pantanal, Brazil, in order to verify if the use of AniMoveMineR allows to identify the interaction level between these jaguars. The results were positive and provided indications about the individual behavior of jaguars and about which jaguars have the highest or lowest correlation.

Systematic Mapping Study of Digitization and Analysis of Manufacturing Data

The manufacturing industry is currently undergoing a digital transformation as part of the mega-trend Industry 4.0. As part of this phase of the industrial revolution, traditional manufacturing processes are being combined with digital technologies to achieve smarter and more efficient production. To successfully digitally transform a manufacturing facility, the processes must first be digitized. This is the conversion of information from an analogue format to a digital format. The objective of this study was to explore the research area of digitizing manufacturing data as part of the worldwide paradigm, Industry 4.0. The formal methodology of a systematic mapping study was utilized to capture a representative sample of the research area and assess its current state. Specific research questions were defined to assess the key benefits and limitations associated with the digitization of manufacturing data. Research papers were classified according to the type of research and type of contribution to the research area. Upon analyzing 54 papers identified in this area, it was noted that 23 of the papers originated in Germany. This is an unsurprising finding as Industry 4.0 is originally a German strategy with supporting strong policy instruments being utilized in Germany to support its implementation. It was also found that the Fraunhofer Institute for Mechatronic Systems Design, in collaboration with the University of Paderborn in Germany, was the most frequent contributing Institution of the research papers with three papers published. The literature suggested future research directions and highlighted one specific gap in the area. There exists an unresolved gap between the data science experts and the manufacturing process experts in the industry. The data analytics expertise is not useful unless the manufacturing process information is utilized. A legitimate understanding of the data is crucial to perform accurate analytics and gain true, valuable insights into the manufacturing process. There lies a gap between the manufacturing operations and the information technology/data analytics departments within enterprises, which was borne out by the results of many of the case studies reviewed as part of this work. To test the concept of this gap existing, the researcher initiated an industrial case study in which they embedded themselves between the subject matter expert of the manufacturing process and the data scientist. Of the papers resulting from the systematic mapping study, 12 of the papers contributed a framework, another 12 of the papers were based on a case study, and 11 of the papers focused on theory. However, there were only three papers that contributed a methodology. This provides further evidence for the need for an industry-focused methodology for digitizing and analyzing manufacturing data, which will be developed in future research.

Omni: Data Science Platform for Evaluate Performance of a LoRaWAN Network

Nowadays, physical processes are becoming digitized by the evolution of communication, sensing and storage technologies which promote the development of smart cities. The evolution of this technology has generated multiple challenges related to the generation of big data and the active participation of electronic devices in society. Thus, devices can send information that is captured and processed over large areas, but there is no guarantee that all the obtained data amount will be effectively stored and correctly persisted. Because, depending on the technology which is used, there are parameters that has huge influence on the full delivery of information. This article aims to characterize the project, currently under development, of a platform that based on data science will perform a performance and effectiveness evaluation of an industrial network that implements LoRaWAN technology considering its main parameters configuration relating these parameters to the information loss.

Long Term Examination of the Profitability Estimation Focused on Benefits

Strategic investment decisions are characterized by high innovation potential and long-term effects on the competitiveness of enterprises. Due to the uncertainty and risks involved in this complex decision making process, the need arises for well-structured support activities. A method that considers cost and the long-term added value is the cost-benefit effectiveness estimation. One of those methods is the “profitability estimation focused on benefits – PEFB”-method developed at the Institute of Management Cybernetics at RWTH Aachen University. The method copes with the challenges associated with strategic investment decisions by integrating long-term non-monetary aspects whilst also mapping the chronological sequence of an investment within the organization’s target system. Thus, this method is characterized as a holistic approach for the evaluation of costs and benefits of an investment. This participation-oriented method was applied to business environments in many workshops. The results of the workshops are a library of more than 96 cost aspects, as well as 122 benefit aspects. These aspects are preprocessed and comparatively analyzed with regards to their alignment to a series of risk levels. For the first time, an accumulation and a distribution of cost and benefit aspects regarding their impact and probability of occurrence are given. The results give evidence that the PEFB-method combines precise measures of financial accounting with the incorporation of benefits. Finally, the results constitute the basics for using information technology and data science for decision support when applying within the PEFB-method.