Machine Learning Techniques in Bank Credit Analysis

The aim of this paper is to compare and discuss better classifier algorithm options for credit risk assessment by applying different Machine Learning techniques. Using records from a Brazilian financial institution, this study uses a database of 5,432 companies that are clients of the bank, where 2,600 clients are classified as non-defaulters, 1,551 are classified as defaulters and 1,281 are temporarily defaulters, meaning that the clients are overdue on their payments for up 180 days. For each case, a total of 15 attributes was considered for a one-against-all assessment using four different techniques: Artificial Neural Networks Multilayer Perceptron (ANN-MLP), Artificial Neural Networks Radial Basis Functions (ANN-RBF), Logistic Regression (LR) and finally Support Vector Machines (SVM). For each method, different parameters were analyzed in order to obtain different results when the best of each technique was compared. Initially the data were coded in thermometer code (numerical attributes) or dummy coding (for nominal attributes). The methods were then evaluated for each parameter and the best result of each technique was compared in terms of accuracy, false positives, false negatives, true positives and true negatives. This comparison showed that the best method, in terms of accuracy, was ANN-RBF (79.20% for non-defaulter classification, 97.74% for defaulters and 75.37% for the temporarily defaulter classification). However, the best accuracy does not always represent the best technique. For instance, on the classification of temporarily defaulters, this technique, in terms of false positives, was surpassed by SVM, which had the lowest rate (0.07%) of false positive classifications. All these intrinsic details are discussed considering the results found, and an overview of what was presented is shown in the conclusion of this study.

Hybrid Anomaly Detection Using Decision Tree and Support Vector Machine

Intrusion detection systems (IDS) are the main components of network security. These systems analyze the network events for intrusion detection. The design of an IDS is through the training of normal traffic data or attack. The methods of machine learning are the best ways to design IDSs. In the method presented in this article, the pruning algorithm of C5.0 decision tree is being used to reduce the features of traffic data used and training IDS by the least square vector algorithm (LS-SVM). Then, the remaining features are arranged according to the predictor importance criterion. The least important features are eliminated in the order. The remaining features of this stage, which have created the highest level of accuracy in LS-SVM, are selected as the final features. The features obtained, compared to other similar articles which have examined the selected features in the least squared support vector machine model, are better in the accuracy, true positive rate, and false positive. The results are tested by the UNSW-NB15 dataset.

An Earth Mover’s Distance Algorithm Based DDoS Detection Mechanism in SDN

Software-defined networking (SDN) provides a solution for scalable network framework with decoupled control and data plane. However, this architecture also induces a particular distributed denial-of-service (DDoS) attack that can affect or even overwhelm the SDN network. DDoS attack detection problem has to date been mostly researched as entropy comparison problem. However, this problem lacks the utilization of SDN, and the results are not accurate. In this paper, we propose a DDoS attack detection method, which interprets DDoS detection as a signature matching problem and is formulated as Earth Mover’s Distance (EMD) model. Considering the feasibility and accuracy, we further propose to define the cost function of EMD to be a generalized Kullback-Leibler divergence. Simulation results show that our proposed method can detect DDoS attacks by comparing EMD values with the ones computed in the case without attacks. Moreover, our method can significantly increase the true positive rate of detection.

A 3-Year Evaluation Study on Fine Needle Aspiration Cytology and Corresponding Histology

Background and Objectives: Incidence of thyroid carcinoma has been increasing world-wide. In the present study, we evaluated diagnostic accuracy of Fine needle aspiration (FNA) and its efficiency in early detecting neoplastic lesions of thyroid gland over a 3-year period. Methods: Data have been retrieved from pathology files in King Khalid Hospital. For each patient, age, gender, FNA, site & size of nodule and final histopathologic diagnosis were recorded. Results: Study included 490 cases where 419 of them were female and 71 male. Male to female ratio was 1:6. Mean age was 43 years for males and 38 for females. Cases with confirmed histopathology were 131. In 101/131 (77.1%), concordance was found between FNA and histology. In 30/131 (22.9%), there was discrepancy in diagnosis. Total malignant cases were 43, out of which 14 (32.5%) were true positive and 29 (67.44%) were false negative. No false positive cases could be found in our series. Conclusion: FNA could diagnose benign nodules in all cases, however, in malignant cases, ultrasound findings have to be taken into consideration to avoid missing of a microcarcinoma in the contralateral lobe.

A Static Android Malware Detection Based on Actual Used Permissions Combination and API Calls

Android operating system has been recognized by most application developers because of its good open-source and compatibility, which enriches the categories of applications greatly. However, it has become the target of malware attackers due to the lack of strict security supervision mechanisms, which leads to the rapid growth of malware, thus bringing serious safety hazards to users. Therefore, it is critical to detect Android malware effectively. Generally, the permissions declared in the AndroidManifest.xml can reflect the function and behavior of the application to a large extent. Since current Android system has not any restrictions to the number of permissions that an application can request, developers tend to apply more than actually needed permissions in order to ensure the successful running of the application, which results in the abuse of permissions. However, some traditional detection methods only consider the requested permissions and ignore whether it is actually used, which leads to incorrect identification of some malwares. Therefore, a machine learning detection method based on the actually used permissions combination and API calls was put forward in this paper. Meanwhile, several experiments are conducted to evaluate our methodology. The result shows that it can detect unknown malware effectively with higher true positive rate and accuracy while maintaining a low false positive rate. Consequently, the AdaboostM1 (J48) classification algorithm based on information gain feature selection algorithm has the best detection result, which can achieve an accuracy of 99.8%, a true positive rate of 99.6% and a lowest false positive rate of 0.

Performance Comparison of ADTree and Naive Bayes Algorithms for Spam Filtering

Classification is an important data mining technique and could be used as data filtering in artificial intelligence. The broad application of classification for all kind of data leads to be used in nearly every field of our modern life. Classification helps us to put together different items according to the feature items decided as interesting and useful. In this paper, we compare two classification methods Naïve Bayes and ADTree use to detect spam e-mail. This choice is motivated by the fact that Naive Bayes algorithm is based on probability calculus while ADTree algorithm is based on decision tree. The parameter settings of the above classifiers use the maximization of true positive rate and minimization of false positive rate. The experiment results present classification accuracy and cost analysis in view of optimal classifier choice for Spam Detection. It is point out the number of attributes to obtain a tradeoff between number of them and the classification accuracy.

Angiographic Evaluation of ETT (Treadmill) Positive Patients in a Tertiary Care Hospital of Bangladesh

To evaluate the factors which predetermine the coronary artery disease in patients having positive Exercise Tolerance Test (ETT) that is treadmill results and coronary artery findings. This descriptive study was conducted at Department of Cardiology, Ibrahim Cardiac Hospital & Research Institute, Dhaka, Bangladesh from 1st January, 2014 to 31st August, 2014. All patients who had done ETT (treadmill) for chest pain diagnosis were studied. One hundred and four patients underwent coronary angiogram after positive treadmill result. Patients were divided into two groups depending upon the angiographic findings, i.e. true positive and false positive. Positive treadmill test patients who have coronary artery involvement these are called true positive and who have no involvement they are called false positive group. Both groups were compared with each other. Out of 104 patients, 81 (77.9%) patients had true positive ETT and 23 (22.1%) patients had false positive ETT. The mean age of patients in positive ETT was 53.46± 8.06 years and male mean age was 53.63±8.36 years and female was 52.87±7.0 years. Sixty nine (85.19%) male patients and twelve (14.81%) female patients had true positive ETT, whereas 15 (65.21%) males and 8 (34.79%) females had false positive ETT, this was statistically significant (p

Feature Based Unsupervised Intrusion Detection

The goal of a network-based intrusion detection system is to classify activities of network traffics into two major categories: normal and attack (intrusive) activities. Nowadays, data mining and machine learning plays an important role in many sciences; including intrusion detection system (IDS) using both supervised and unsupervised techniques. However, one of the essential steps of data mining is feature selection that helps in improving the efficiency, performance and prediction rate of proposed approach. This paper applies unsupervised K-means clustering algorithm with information gain (IG) for feature selection and reduction to build a network intrusion detection system. For our experimental analysis, we have used the new NSL-KDD dataset, which is a modified dataset for KDDCup 1999 intrusion detection benchmark dataset. With a split of 60.0% for the training set and the remainder for the testing set, a 2 class classifications have been implemented (Normal, Attack). Weka framework which is a java based open source software consists of a collection of machine learning algorithms for data mining tasks has been used in the testing process. The experimental results show that the proposed approach is very accurate with low false positive rate and high true positive rate and it takes less learning time in comparison with using the full features of the dataset with the same algorithm.

An Algorithm for Detecting Seam Cracks in Steel Plates

In this study, we developed an algorithm for detecting seam cracks in a steel plate. Seam cracks are generated in the edge region of a steel plate. We used the Gabor filter and an adaptive double threshold method to detect them. To reduce the number of pseudo defects, features based on the shape of seam cracks were used. To evaluate the performance of the proposed algorithm, we tested 989 images with seam cracks and 9470 defect-free images. Experimental results show that the proposed algorithm is suitable for detecting seam cracks. However, it should be improved to increase the true positive rate.

Moving towards Positive Security Model for Web Application Firewall

The proliferation of web application and the pervasiveness of mobile technology make web-based attacks even more attractive and even easier to launch. Web Application Firewall (WAF) is an intermediate tool between web server and users that provides comprehensive protection for web application. WAF is a negative security model where the detection and prevention mechanisms are based on predefined or user-defined attack signatures and patterns. However, WAF alone is not adequate to offer best defensive system against web vulnerabilities that are increasing in number and complexity daily. This paper presents a methodology to automatically design a positive security based model which identifies and allows only legitimate web queries. The paper shows a true positive rate of more than 90% can be achieved.

Bayes Net Classifiers for Prediction of Renal Graft Status and Survival Period

This paper presents the development of a Bayesian belief network classifier for prediction of graft status and survival period in renal transplantation using the patient profile information prior to the transplantation. The objective was to explore feasibility of developing a decision making tool for identifying the most suitable recipient among the candidate pool members. The dataset was compiled from the University of Toledo Medical Center Hospital patients as reported to the United Network Organ Sharing, and had 1228 patient records for the period covering 1987 through 2009. The Bayes net classifiers were developed using the Weka machine learning software workbench. Two separate classifiers were induced from the data set, one to predict the status of the graft as either failed or living, and a second classifier to predict the graft survival period. The classifier for graft status prediction performed very well with a prediction accuracy of 97.8% and true positive values of 0.967 and 0.988 for the living and failed classes, respectively. The second classifier to predict the graft survival period yielded a prediction accuracy of 68.2% and a true positive rate of 0.85 for the class representing those instances with kidneys failing during the first year following transplantation. Simulation results indicated that it is feasible to develop a successful Bayesian belief network classifier for prediction of graft status, but not the graft survival period, using the information in UNOS database.

An Effective Method of Head Lamp and Tail Lamp Recognition for Night Time Vehicle Detection

This paper presents an effective method for detecting vehicles in front of the camera-assisted car during nighttime driving. The proposed method detects vehicles based on detecting vehicle headlights and taillights using techniques of image segmentation and clustering. First, to effectively extract spotlight of interest, a segmentation process based on automatic multi-level threshold method is applied on the road-scene images. Second, to spatial clustering vehicle of detecting lamps, a grouping process based on light tracking and locating vehicle lighting patterns. For simulation, we are implemented through Da-vinci 7437 DSP board with near infrared mono-camera and tested it in the urban and rural roads. Through the test, classification performances are above 97% of true positive rate evaluated on real-time environment. Our method also has good performance in the case of clear, fog and rain weather.

Learning to Order Terms: Supervised Interestingness Measures in Terminology Extraction

Term Extraction, a key data preparation step in Text Mining, extracts the terms, i.e. relevant collocation of words, attached to specific concepts (e.g. genetic-algorithms and decisiontrees are terms associated to the concept “Machine Learning" ). In this paper, the task of extracting interesting collocations is achieved through a supervised learning algorithm, exploiting a few collocations manually labelled as interesting/not interesting. From these examples, the ROGER algorithm learns a numerical function, inducing some ranking on the collocations. This ranking is optimized using genetic algorithms, maximizing the trade-off between the false positive and true positive rates (Area Under the ROC curve). This approach uses a particular representation for the word collocations, namely the vector of values corresponding to the standard statistical interestingness measures attached to this collocation. As this representation is general (over corpora and natural languages), generality tests were performed by experimenting the ranking function learned from an English corpus in Biology, onto a French corpus of Curriculum Vitae, and vice versa, showing a good robustness of the approaches compared to the state-of-the-art Support Vector Machine (SVM).

Autonomously Determining the Parameters for SVDD with RBF Kernel from a One-Class Training Set

The one-class support vector machine “support vector data description” (SVDD) is an ideal approach for anomaly or outlier detection. However, for the applicability of SVDD in real-world applications, the ease of use is crucial. The results of SVDD are massively determined by the choice of the regularisation parameter C and the kernel parameter  of the widely used RBF kernel. While for two-class SVMs the parameters can be tuned using cross-validation based on the confusion matrix, for a one-class SVM this is not possible, because only true positives and false negatives can occur during training. This paper proposes an approach to find the optimal set of parameters for SVDD solely based on a training set from one class and without any user parameterisation. Results on artificial and real data sets are presented, underpinning the usefulness of the approach.

Performance of Histogram-Based Skin Colour Segmentation for Arms Detection in Human Motion Analysis Application

Arms detection is one of the fundamental problems in human motion analysis application. The arms are considered as the most challenging body part to be detected since its pose and speed varies in image sequences. Moreover, the arms are usually occluded with other body parts such as the head and torso. In this paper, histogram-based skin colour segmentation is proposed to detect the arms in image sequences. Six different colour spaces namely RGB, rgb, HSI, TSL, SCT and CIELAB are evaluated to determine the best colour space for this segmentation procedure. The evaluation is divided into three categories, which are single colour component, colour without luminance and colour with luminance. The performance is measured using True Positive (TP) and True Negative (TN) on 250 images with manual ground truth. The best colour is selected based on the highest TN value followed by the highest TP value.

Hot-Spot Blob Merging for Real-Time Image Segmentation

One of the major, difficult tasks in automated video surveillance is the segmentation of relevant objects in the scene. Current implementations often yield inconsistent results on average from frame to frame when trying to differentiate partly occluding objects. This paper presents an efficient block-based segmentation algorithm which is capable of separating partly occluding objects and detecting shadows. It has been proven to perform in real time with a maximum duration of 47.48 ms per frame (for 8x8 blocks on a 720x576 image) with a true positive rate of 89.2%. The flexible structure of the algorithm enables adaptations and improvements with little effort. Most of the parameters correspond to relative differences between quantities extracted from the image and should therefore not depend on scene and lighting conditions. Thus presenting a performance oriented segmentation algorithm which is applicable in all critical real time scenarios.