Injury Prediction for Soccer Players Using Machine Learning

Injuries in professional sports occur on a regular basis. Some may be minor while others can cause huge impact on a player’s career and earning potential. In soccer, there is a high risk of players picking up injuries during game time. This research work seeks to help soccer players reduce the risk of getting injured by predicting the likelihood of injury while playing in the near future and then providing recommendations for intervention. The injury prediction tool will use a soccer player’s number of minutes played on the field, number of appearances, distance covered and performance data for the current and previous seasons as variables to conduct statistical analysis and provide injury predictive results using a machine learning linear regression model.

Predicting Shot Making in Basketball Learnt from Adversarial Multiagent Trajectories

In this paper, we predict the likelihood of a player making a shot in basketball from multiagent trajectories. To approach this problem, we present a convolutional neural network (CNN) approach where we initially represent the multiagent behavior as an image. To encode the adversarial nature of basketball, we use a multichannel image which we then feed into a CNN. Additionally, to capture the temporal aspect of the trajectories we use “fading.” We find that this approach is superior to a traditional FFN model. By using gradient ascent, we were able to discover what the CNN filters look for during training. Last, we find that a combined FFN+CNN is the best performing network with an error rate of 39%.

Classification of Extreme Ground-Level Ozone Based on Generalized Extreme Value Model for Air Monitoring Station

Higher ground-level ozone (GLO) concentration adversely affects human health, vegetations as well as activities in the ecosystem. In Malaysia, most of the analysis on GLO concentration are carried out using the average value of GLO concentration, which refers to the centre of distribution to make a prediction or estimation. However, analysis which focuses on the higher value or extreme value in GLO concentration is rarely explored. Hence, the objective of this study is to classify the tail behaviour of GLO using generalized extreme value (GEV) distribution estimation the return level using the corresponding modelling (Gumbel, Weibull, and Frechet) of GEV distribution. The results show that Weibull distribution which is also known as short tail distribution and considered as having less extreme behaviour is the best-fitted distribution for four selected air monitoring stations in Peninsular Malaysia, namely Larkin, Pelabuhan Kelang, Shah Alam, and Tanjung Malim; while Gumbel distribution which is considered as a medium tail distribution is the best-fitted distribution for Nilai station. The return level of GLO concentration in Shah Alam station is comparatively higher than other stations. Overall, return levels increase with increasing return periods but the increment depends on the type of the tail of GEV distribution’s tail. We conduct this study by using maximum likelihood estimation (MLE) method to estimate the parameters at four selected stations in Peninsular Malaysia. Next, the validation for the fitted block maxima series to GEV distribution is performed using probability plot, quantile plot and likelihood ratio test. Profile likelihood confidence interval is tested to verify the type of GEV distribution. These results are important as a guide for early notification on future extreme ozone events.

The Reproducibility and Repeatability of Modified Likelihood Ratio for Forensics Handwriting Examination

The forensic use of handwriting depends on the analysis, comparison, and evaluation decisions made by forensic document examiners. When using biometric technology in forensic applications, it is necessary to compute Likelihood Ratio (LR) for quantifying strength of evidence under two competing hypotheses, namely the prosecution and the defense hypotheses wherein a set of assumptions and methods for a given data set will be made. It is therefore important to know how repeatable and reproducible our estimated LR is. This paper evaluated the accuracy and reproducibility of examiners' decisions. Confidence interval for the estimated LR were presented so as not get an incorrect estimate that will be used to deliver wrong judgment in the court of Law. The estimate of LR is fundamentally a Bayesian concept and we used two LR estimators, namely Logistic Regression (LoR) and Kernel Density Estimator (KDE) for this paper. The repeatability evaluation was carried out by retesting the initial experiment after an interval of six months to observe whether examiners would repeat their decisions for the estimated LR. The experimental results, which are based on handwriting dataset, show that LR has different confidence intervals which therefore implies that LR cannot be estimated with the same certainty everywhere. Though the LoR performed better than the KDE when tested using the same dataset, the two LR estimators investigated showed a consistent region in which LR value can be estimated confidently. These two findings advance our understanding of LR when used in computing the strength of evidence in handwriting using forensics.

Reimagining the Learning Management System as a “Third” Space

This paper focuses on a sense of belonging, isolation, and the use of a learning management system as a “third space” for connection and community. Given student use of learning management systems (LMS) for courses on campuses, moderate to high use of social media and hand-held devices, the author explores the possibilities of LMS as a third space. The COVID-19 pandemic has exacerbated student experiences of isolation, and research indicates that students who experience a sense of belonging have a greater likelihood for academic retention and success. The impacts on students of an LMS designed for student employee orientation and training were examined through a mixed methods approach, including a survey, individual interviews, and focus groups. The sample involved 250-450 undergraduate student employees at a US northwestern university. The goal of the study was to find out the efficiency and effectiveness of the orientation information for a wide range of student employees from multiple student affairs departments. And unexpected finding emerged within the study in 2015 and was noted again as a finding in the 2017 study. Students reported feeling like they individually connected to the department, and further to the university because of the LMS orientation. They stated they could see themselves as part of the university community and like they belonged. The orientation, through the LMS, was designed for and occurred online (asynchronous), prior to students traveling and beginning university life for the academic year. The students indicated connection and belonging resulting from some of the design features. With the onset of COVID-19 and prolonged sheltering in place in North America, as well as other parts of the world, students have been precluded from physically gathering to educate and learn. COVID-19 essentially paused face-to-face education in 2020. Media, governments, and higher education outlets have been reporting on widespread college student stress, isolation, loneliness, and sadness. In this context, the author conducted a current mixed methods study (online survey, online interviews) of students in advanced degree programs, like Ph.D. and Ed.D. specifically investigating isolation and sense of belonging. As a part of the study a prototype of a Canvas site was experienced by student interviewees for their reaction of this Canvas site prototype as a “third” space. Some preliminary findings of this study are presented. Doctoral students in the study affirmed the potential of LMS as a third space for community and social academic connection.

Infrastructure Change Monitoring Using Multitemporal Multispectral Satellite Images

The main objective of this study is to find a suitable approach to monitor the land infrastructure growth over a period of time using multispectral satellite images. Bi-temporal change detection method is unable to indicate the continuous change occurring over a long period of time. To achieve this objective, the approach used here estimates a statistical model from series of multispectral image data over a long period of time, assuming there is no considerable change during that time period and then compare it with the multispectral image data obtained at a later time. The change is estimated pixel-wise. Statistical composite hypothesis technique is used for estimating pixel based change detection in a defined region. The generalized likelihood ratio test (GLRT) is used to detect the changed pixel from probabilistic estimated model of the corresponding pixel. The changed pixel is detected assuming that the images have been co-registered prior to estimation. To minimize error due to co-registration, 8-neighborhood pixels around the pixel under test are also considered. The multispectral images from Sentinel-2 and Landsat-8 from 2015 to 2018 are used for this purpose. There are different challenges in this method. First and foremost challenge is to get quite a large number of datasets for multivariate distribution modelling. A large number of images are always discarded due to cloud coverage. Due to imperfect modelling there will be high probability of false alarm. Overall conclusion that can be drawn from this work is that the probabilistic method described in this paper has given some promising results, which need to be pursued further.

Cultural Effects on the Performance of Non- Profit and For-Profit Microfinance Institutions

Using a large dataset of more than 2,400 individual microfinance institutions (MFIs) from 120 countries from 1999 to 2016, this study finds that nearly half of the international MFIs operate as for-profit institutions. Formal institutions (business regulatory environment, property rights, social protection, and a developed financial sector) impact the likelihood of MFIs being for-profit across countries. Cultural differences across countries (power distance, individualism, masculinity, and indulgence) seem to be a factor in the legal status of the MFI (non-profit or for-profit). MFIs in countries with stronger formal institutions, a greater degree of power distance, and a higher degree of collectivism experience better financial and social performance.

Retail Strategy to Reduce Waste Keeping High Profit Utilizing Taylor's Law in Point-of-Sales Data

Waste reduction is a fundamental problem for sustainability. Methods for waste reduction with point-of-sales (POS) data are proposed, utilizing the knowledge of a recent econophysics study on a statistical property of POS data. Concretely, the non-stationary time series analysis method based on the Particle Filter is developed, which considers abnormal fluctuation scaling known as Taylor's law. This method is extended for handling incomplete sales data because of stock-outs by introducing maximum likelihood estimation for censored data. The way for optimal stock determination with pricing the cost of waste reduction is also proposed. This study focuses on the examination of the methods for large sales numbers where Taylor's law is obvious. Numerical analysis using aggregated POS data shows the effectiveness of the methods to reduce food waste maintaining a high profit for large sales numbers. Moreover, the way of pricing the cost of waste reduction reveals that a small profit loss realizes substantial waste reduction, especially in the case that the proportionality constant  of Taylor’s law is small. Specifically, around 1% profit loss realizes half disposal at =0.12, which is the actual  value of processed food items used in this research. The methods provide practical and effective solutions for waste reduction keeping a high profit, especially with large sales numbers.

The Non-Stationary BINARMA(1,1) Process with Poisson Innovations: An Application on Accident Data

This paper considers the modelling of a non-stationary bivariate integer-valued autoregressive moving average of order one (BINARMA(1,1)) with correlated Poisson innovations. The BINARMA(1,1) model is specified using the binomial thinning operator and by assuming that the cross-correlation between the two series is induced by the innovation terms only. Based on these assumptions, the non-stationary marginal and joint moments of the BINARMA(1,1) are derived iteratively by using some initial stationary moments. As regards to the estimation of parameters of the proposed model, the conditional maximum likelihood (CML) estimation method is derived based on thinning and convolution properties. The forecasting equations of the BINARMA(1,1) model are also derived. A simulation study is also proposed where BINARMA(1,1) count data are generated using a multivariate Poisson R code for the innovation terms. The performance of the BINARMA(1,1) model is then assessed through a simulation experiment and the mean estimates of the model parameters obtained are all efficient, based on their standard errors. The proposed model is then used to analyse a real-life accident data on the motorway in Mauritius, based on some covariates: policemen, daily patrol, speed cameras, traffic lights and roundabouts. The BINARMA(1,1) model is applied on the accident data and the CML estimates clearly indicate a significant impact of the covariates on the number of accidents on the motorway in Mauritius. The forecasting equations also provide reliable one-step ahead forecasts.

An Automated Stock Investment System Using Machine Learning Techniques: An Application in Australia

A key issue in stock investment is how to select representative features for stock selection. The objective of this paper is to firstly determine whether an automated stock investment system, using machine learning techniques, may be used to identify a portfolio of growth stocks that are highly likely to provide returns better than the stock market index. The second objective is to identify the technical features that best characterize whether a stock’s price is likely to go up and to identify the most important factors and their contribution to predicting the likelihood of the stock price going up. Unsupervised machine learning techniques, such as cluster analysis, were applied to the stock data to identify a cluster of stocks that was likely to go up in price – portfolio 1. Next, the principal component analysis technique was used to select stocks that were rated high on component one and component two – portfolio 2. Thirdly, a supervised machine learning technique, the logistic regression method, was used to select stocks with a high probability of their price going up – portfolio 3. The predictive models were validated with metrics such as, sensitivity (recall), specificity and overall accuracy for all models. All accuracy measures were above 70%. All portfolios outperformed the market by more than eight times. The top three stocks were selected for each of the three stock portfolios and traded in the market for one month. After one month the return for each stock portfolio was computed and compared with the stock market index returns. The returns for all three stock portfolios was 23.87% for the principal component analysis stock portfolio, 11.65% for the logistic regression portfolio and 8.88% for the K-means cluster portfolio while the stock market performance was 0.38%. This study confirms that an automated stock investment system using machine learning techniques can identify top performing stock portfolios that outperform the stock market.

Review of the Road Crash Data Availability in Iraq

Iraq is a middle income country where the road safety issue is considered one of the leading causes of deaths. To control the road risk issue, the Iraqi Ministry of Planning, General Statistical Organization started to organise a collection system of traffic accidents data with details related to their causes and severity. These data are published as an annual report. In this paper, a review of the available crash data in Iraq will be presented. The available data represent the rate of accidents in aggregated level and classified according to their types, road users’ details, and crash severity, type of vehicles, causes and number of causalities. The review is according to the types of models used in road safety studies and research, and according to the required road safety data in the road constructions tasks. The available data are also compared with the road safety dataset published in the United Kingdom as an example of developed country. It is concluded that the data in Iraq are suitable for descriptive and exploratory models, aggregated level comparison analysis, and evaluation and monitoring the progress of the overall traffic safety performance. However, important traffic safety studies require disaggregated level of data and details related to the factors of the likelihood of traffic crashes. Some studies require spatial geographic details such as the location of the accidents which is essential in ranking the roads according to their level of safety, and name the most dangerous roads in Iraq which requires tactic plan to control this issue. Global Road safety agencies interested in solve this problem in low and middle-income countries have designed road safety assessment methodologies which are basing on the road attributes data only. Therefore, in this research it is recommended to use one of these methodologies.

Classification of Health Risk Factors to Predict the Risk of Falling in Older Adults

Cognitive decline and frailty is apparent in older adults leading to an increased likelihood of the risk of falling. Currently health care professionals have to make professional decisions regarding such risks, and hence make difficult decisions regarding the future welfare of the ageing population. This study uses health data from The Irish Longitudinal Study on Ageing (TILDA), focusing on adults over the age of 50 years, in order to analyse health risk factors and predict the likelihood of falls. This prediction is based on the use of machine learning algorithms whereby health risk factors are used as inputs to predict the likelihood of falling. Initial results show that health risk factors such as long-term health issues contribute to the number of falls. The identification of such health risk factors has the potential to inform health and social care professionals, older people and their family members in order to mitigate daily living risks.

A Communication Signal Recognition Algorithm Based on Holder Coefficient Characteristics

Communication signal modulation recognition technology is one of the key technologies in the field of modern information warfare. At present, communication signal automatic modulation recognition methods are mainly divided into two major categories. One is the maximum likelihood hypothesis testing method based on decision theory, the other is a statistical pattern recognition method based on feature extraction. Now, the most commonly used is a statistical pattern recognition method, which includes feature extraction and classifier design. With the increasingly complex electromagnetic environment of communications, how to effectively extract the features of various signals at low signal-to-noise ratio (SNR) is a hot topic for scholars in various countries. To solve this problem, this paper proposes a feature extraction algorithm for the communication signal based on the improved Holder cloud feature. And the extreme learning machine (ELM) is used which aims at the problem of the real-time in the modern warfare to classify the extracted features. The algorithm extracts the digital features of the improved cloud model without deterministic information in a low SNR environment, and uses the improved cloud model to obtain more stable Holder cloud features and the performance of the algorithm is improved. This algorithm addresses the problem that a simple feature extraction algorithm based on Holder coefficient feature is difficult to recognize at low SNR, and it also has a better recognition accuracy. The results of simulations show that the approach in this paper still has a good classification result at low SNR, even when the SNR is -15dB, the recognition accuracy still reaches 76%.

Improving Flash Flood Forecasting with a Bayesian Probabilistic Approach: A Case Study on the Posina Basin in Italy

The Flash Flood Guidance (FFG) provides the rainfall amount of a given duration necessary to cause flooding. The approach is based on the development of rainfall-runoff curves, which helps us to find out the rainfall amount that would cause flooding. An alternative approach, mostly experimented with Italian Alpine catchments, is based on determining threshold discharges from past events and on finding whether or not an oncoming flood has its magnitude more than some critical discharge thresholds found beforehand. Both approaches suffer from large uncertainties in forecasting flash floods as, due to the simplistic approach followed, the same rainfall amount may or may not cause flooding. This uncertainty leads to the question whether a probabilistic model is preferable over a deterministic one in forecasting flash floods. We propose the use of a Bayesian probabilistic approach in flash flood forecasting. A prior probability of flooding is derived based on historical data. Additional information, such as antecedent moisture condition (AMC) and rainfall amount over any rainfall thresholds are used in computing the likelihood of observing these conditions given a flash flood has occurred. Finally, the posterior probability of flooding is computed using the prior probability and the likelihood. The variation of the computed posterior probability with rainfall amount and AMC presents the suitability of the approach in decision making in an uncertain environment. The methodology has been applied to the Posina basin in Italy. From the promising results obtained, we can conclude that the Bayesian approach in flash flood forecasting provides more realistic forecasting over the FFG.

Modelling Hydrological Time Series Using Wakeby Distribution

The statistical modelling of precipitation data for a given portion of territory is fundamental for the monitoring of climatic conditions and for Hydrogeological Management Plans (HMP). This modelling is rendered particularly complex by the changes taking place in the frequency and intensity of precipitation, presumably to be attributed to the global climate change. This paper applies the Wakeby distribution (with 5 parameters) as a theoretical reference model. The number and the quality of the parameters indicate that this distribution may be the appropriate choice for the interpolations of the hydrological variables and, moreover, the Wakeby is particularly suitable for describing phenomena producing heavy tails. The proposed estimation methods for determining the value of the Wakeby parameters are the same as those used for density functions with heavy tails. The commonly used procedure is the classic method of moments weighed with probabilities (probability weighted moments, PWM) although this has often shown difficulty of convergence, or rather, convergence to a configuration of inappropriate parameters. In this paper, we analyze the problem of the likelihood estimation of a random variable expressed through its quantile function. The method of maximum likelihood, in this case, is more demanding than in the situations of more usual estimation. The reasons for this lie, in the sampling and asymptotic properties of the estimators of maximum likelihood which improve the estimates obtained with indications of their variability and, therefore, their accuracy and reliability. These features are highly appreciated in contexts where poor decisions, attributable to an inefficient or incomplete information base, can cause serious damages.

The Efficiency of Association Measures in Automatic Extraction of Collocations: Exclusivity and Frequency

This paper deals with automatic extraction of 20 ‘adjective + noun’ collocations using four different association measures: T-score, MI, Log Dice, and Log Likelihood with most emphasis on mainly Log Likelihood and Log Dice scores for which an argument for their suitability in this experiment is to be presented. The nodes of the chosen collocates are 20 adjectival false friends between English and French. The noun candidate to be chosen needs to occur with a threshold of top ten collocates in two lists in which the results are sorted by Log Likelihood and Log Dice. The fulfillment of this criterion will guarantee that the chosen candidates are both exclusive and significant noun collocates and thereby, they make perfect noun candidates for the nodes. The results of the top 10 collocates sorted by Log Dice and Log Likelihood are not to be filtered. Thereby technical terms, function words, and stop words are not to be removed for the purposes of the analysis. Out of 20 adjectives, 15 ‘adjective + noun’ collocations have been extracted by the means of consensus of Log Likelihood and Log Dice scores on the top 10 noun collocates. The generated list of the automatic extracted ‘adjective + noun’ collocations will serve as the bulk of a translation test in which Algerian students of translation are asked to render these collocations into Arabic. The ultimate goal of this test is to test French influence as a Second Language on English as a Foreign Language in the Algerian context.

Systematics of Water Lilies (Genus Nymphaea L.) Using 18S rDNA Sequences

Water lily (Nymphaea L.) is the largest genus of Nymphaeaceae. This family is composed of six genera (Nuphar, Ondinea, Euryale, Victoria, Barclaya, Nymphaea). Its members are nearly worldwide in tropical and temperate regions. The classification of some species in Nymphaea is ambiguous due to high variation in leaf and flower parts such as leaf margin, stamen appendage. Therefore, the phylogenetic relationships based on 18S rDNA were constructed to delimit this genus. DNAs of 52 specimens belonging to water lily family were extracted using modified conventional method containing cetyltrimethyl ammonium bromide (CTAB). The results showed that the amplified fragment is about 1600 base pairs in size. After analysis, the aligned sequences presented 9.36% for variable characters comprising 2.66% of parsimonious informative sites and 6.70% of singleton sites. Moreover, there are 6 regions of 1-2 base(s) for insertion/deletion. The phylogenetic trees based on maximum parsimony and maximum likelihood with high bootstrap support indicated that genus Nymphaea was a paraphyletic group because of Ondinea, Victoria and Euryale disruption. Within genus Nymphaea, subgenus Nymphaea is a basal lineage group which cooperated with Euryale and Victoria. The other four subgenera, namely Lotos, Hydrocallis, Brachyceras and Anecphya were included the same large clade which Ondinea was placed within Anecphya clade due to geographical sharing.

Comparison of Methods of Estimation for Use in Goodness of Fit Tests for Binary Multilevel Models

It can be frequently observed that the data arising in our environment have a hierarchical or a nested structure attached with the data. Multilevel modelling is a modern approach to handle this kind of data. When multilevel modelling is combined with a binary response, the estimation methods get complex in nature and the usual techniques are derived from quasi-likelihood method. The estimation methods which are compared in this study are, marginal quasi-likelihood (order 1 & order 2) (MQL1, MQL2) and penalized quasi-likelihood (order 1 & order 2) (PQL1, PQL2). A statistical model is of no use if it does not reflect the given dataset. Therefore, checking the adequacy of the fitted model through a goodness-of-fit (GOF) test is an essential stage in any modelling procedure. However, prior to usage, it is also equally important to confirm that the GOF test performs well and is suitable for the given model. This study assesses the suitability of the GOF test developed for binary response multilevel models with respect to the method used in model estimation. An extensive set of simulations was conducted using MLwiN (v 2.19) with varying number of clusters, cluster sizes and intra cluster correlations. The test maintained the desirable Type-I error for models estimated using PQL2 and it failed for almost all the combinations of MQL. Power of the test was adequate for most of the combinations in all estimation methods except MQL1. Moreover, models were fitted using the four methods to a real-life dataset and performance of the test was compared for each model.

Dynamic Measurement System Modeling with Machine Learning Algorithms

In this paper, ways of modeling dynamic measurement systems are discussed. Specially, for linear system with single-input single-output, it could be modeled with shallow neural network. Then, gradient based optimization algorithms are used for searching the proper coefficients. Besides, method with normal equation and second order gradient descent are proposed to accelerate the modeling process, and ways of better gradient estimation are discussed. It shows that the mathematical essence of the learning objective is maximum likelihood with noises under Gaussian distribution. For conventional gradient descent, the mini-batch learning and gradient with momentum contribute to faster convergence and enhance model ability. Lastly, experimental results proved the effectiveness of second order gradient descent algorithm, and indicated that optimization with normal equation was the most suitable for linear dynamic models.

On the Efficiency and Robustness of Commingle Wiener and Lévy Driven Processes for Vasciek Model

The driven processes of Wiener and Lévy are known self-standing Gaussian-Markov processes for fitting non-linear dynamical Vasciek model. In this paper, a coincidental Gaussian density stationarity condition and autocorrelation function of the two driven processes were established. This led to the conflation of Wiener and Lévy processes so as to investigate the efficiency of estimates incorporated into the one-dimensional Vasciek model that was estimated via the Maximum Likelihood (ML) technique. The conditional laws of drift, diffusion and stationarity process was ascertained for the individual Wiener and Lévy processes as well as the commingle of the two processes for a fixed effect and Autoregressive like Vasciek model when subjected to financial series; exchange rate of Naira-CFA Franc. In addition, the model performance error of the sub-merged driven process was miniature compared to the self-standing driven process of Wiener and Lévy.