Abstract: The aim of this paper is to compare and discuss better classifier algorithm options for credit risk assessment by applying different Machine Learning techniques. Using records from a Brazilian financial institution, this study uses a database of 5,432 companies that are clients of the bank, where 2,600 clients are classified as non-defaulters, 1,551 are classified as defaulters and 1,281 are temporarily defaulters, meaning that the clients are overdue on their payments for up 180 days. For each case, a total of 15 attributes was considered for a one-against-all assessment using four different techniques: Artificial Neural Networks Multilayer Perceptron (ANN-MLP), Artificial Neural Networks Radial Basis Functions (ANN-RBF), Logistic Regression (LR) and finally Support Vector Machines (SVM). For each method, different parameters were analyzed in order to obtain different results when the best of each technique was compared. Initially the data were coded in thermometer code (numerical attributes) or dummy coding (for nominal attributes). The methods were then evaluated for each parameter and the best result of each technique was compared in terms of accuracy, false positives, false negatives, true positives and true negatives. This comparison showed that the best method, in terms of accuracy, was ANN-RBF (79.20% for non-defaulter classification, 97.74% for defaulters and 75.37% for the temporarily defaulter classification). However, the best accuracy does not always represent the best technique. For instance, on the classification of temporarily defaulters, this technique, in terms of false positives, was surpassed by SVM, which had the lowest rate (0.07%) of false positive classifications. All these intrinsic details are discussed considering the results found, and an overview of what was presented is shown in the conclusion of this study.
Abstract: Intrusion detection systems (IDS) are the main components of network security. These systems analyze the network events for intrusion detection. The design of an IDS is through the training of normal traffic data or attack. The methods of machine learning are the best ways to design IDSs. In the method presented in this article, the pruning algorithm of C5.0 decision tree is being used to reduce the features of traffic data used and training IDS by the least square vector algorithm (LS-SVM). Then, the remaining features are arranged according to the predictor importance criterion. The least important features are eliminated in the order. The remaining features of this stage, which have created the highest level of accuracy in LS-SVM, are selected as the final features. The features obtained, compared to other similar articles which have examined the selected features in the least squared support vector machine model, are better in the accuracy, true positive rate, and false positive. The results are tested by the UNSW-NB15 dataset.
Abstract: Software-defined networking (SDN) provides a solution
for scalable network framework with decoupled control and data
plane. However, this architecture also induces a particular distributed
denial-of-service (DDoS) attack that can affect or even overwhelm
the SDN network. DDoS attack detection problem has to date been
mostly researched as entropy comparison problem. However, this
problem lacks the utilization of SDN, and the results are not accurate.
In this paper, we propose a DDoS attack detection method, which
interprets DDoS detection as a signature matching problem and is
formulated as Earth Mover’s Distance (EMD) model. Considering
the feasibility and accuracy, we further propose to define the cost
function of EMD to be a generalized Kullback-Leibler divergence.
Simulation results show that our proposed method can detect DDoS
attacks by comparing EMD values with the ones computed in the case
without attacks. Moreover, our method can significantly increase the
true positive rate of detection.
Abstract: Background and Objectives: Incidence of thyroid carcinoma has been increasing world-wide. In the present study, we evaluated diagnostic accuracy of Fine needle aspiration (FNA) and its efficiency in early detecting neoplastic lesions of thyroid gland over a 3-year period. Methods: Data have been retrieved from pathology files in King Khalid Hospital. For each patient, age, gender, FNA, site & size of nodule and final histopathologic diagnosis were recorded. Results: Study included 490 cases where 419 of them were female and 71 male. Male to female ratio was 1:6. Mean age was 43 years for males and 38 for females. Cases with confirmed histopathology were 131. In 101/131 (77.1%), concordance was found between FNA and histology. In 30/131 (22.9%), there was discrepancy in diagnosis. Total malignant cases were 43, out of which 14 (32.5%) were true positive and 29 (67.44%) were false negative. No false positive cases could be found in our series. Conclusion: FNA could diagnose benign nodules in all cases, however, in malignant cases, ultrasound findings have to be taken into consideration to avoid missing of a microcarcinoma in the contralateral lobe.
Abstract: Android operating system has been recognized by most application developers because of its good open-source and compatibility, which enriches the categories of applications greatly. However, it has become the target of malware attackers due to the lack of strict security supervision mechanisms, which leads to the rapid growth of malware, thus bringing serious safety hazards to users. Therefore, it is critical to detect Android malware effectively. Generally, the permissions declared in the AndroidManifest.xml can reflect the function and behavior of the application to a large extent. Since current Android system has not any restrictions to the number of permissions that an application can request, developers tend to apply more than actually needed permissions in order to ensure the successful running of the application, which results in the abuse of permissions. However, some traditional detection methods only consider the requested permissions and ignore whether it is actually used, which leads to incorrect identification of some malwares. Therefore, a machine learning detection method based on the actually used permissions combination and API calls was put forward in this paper. Meanwhile, several experiments are conducted to evaluate our methodology. The result shows that it can detect unknown malware effectively with higher true positive rate and accuracy while maintaining a low false positive rate. Consequently, the AdaboostM1 (J48) classification algorithm based on information gain feature selection algorithm has the best detection result, which can achieve an accuracy of 99.8%, a true positive rate of 99.6% and a lowest false positive rate of 0.
Abstract: Classification is an important data mining technique
and could be used as data filtering in artificial intelligence. The
broad application of classification for all kind of data leads to be
used in nearly every field of our modern life. Classification helps us
to put together different items according to the feature items decided
as interesting and useful. In this paper, we compare two
classification methods Naïve Bayes and ADTree use to detect spam
e-mail. This choice is motivated by the fact that Naive Bayes
algorithm is based on probability calculus while ADTree algorithm is
based on decision tree. The parameter settings of the above
classifiers use the maximization of true positive rate and
minimization of false positive rate. The experiment results present
classification accuracy and cost analysis in view of optimal classifier
choice for Spam Detection. It is point out the number of attributes to
obtain a tradeoff between number of them and the classification
accuracy.
Abstract: To evaluate the factors which predetermine the
coronary artery disease in patients having positive Exercise Tolerance
Test (ETT) that is treadmill results and coronary artery findings. This
descriptive study was conducted at Department of Cardiology,
Ibrahim Cardiac Hospital & Research Institute, Dhaka, Bangladesh
from 1st January, 2014 to 31st August, 2014. All patients who had
done ETT (treadmill) for chest pain diagnosis were studied. One
hundred and four patients underwent coronary angiogram after
positive treadmill result. Patients were divided into two groups
depending upon the angiographic findings, i.e. true positive and false
positive. Positive treadmill test patients who have coronary artery
involvement these are called true positive and who have no
involvement they are called false positive group. Both groups were
compared with each other. Out of 104 patients, 81 (77.9%) patients
had true positive ETT and 23 (22.1%) patients had false positive
ETT. The mean age of patients in positive ETT was 53.46± 8.06
years and male mean age was 53.63±8.36 years and female was
52.87±7.0 years. Sixty nine (85.19%) male patients and twelve
(14.81%) female patients had true positive ETT, whereas 15
(65.21%) males and 8 (34.79%) females had false positive ETT, this
was statistically significant (p
Abstract: The goal of a network-based intrusion detection
system is to classify activities of network traffics into two major
categories: normal and attack (intrusive) activities. Nowadays, data
mining and machine learning plays an important role in many
sciences; including intrusion detection system (IDS) using both
supervised and unsupervised techniques. However, one of the
essential steps of data mining is feature selection that helps in
improving the efficiency, performance and prediction rate of
proposed approach. This paper applies unsupervised K-means
clustering algorithm with information gain (IG) for feature selection
and reduction to build a network intrusion detection system. For our
experimental analysis, we have used the new NSL-KDD dataset,
which is a modified dataset for KDDCup 1999 intrusion detection
benchmark dataset. With a split of 60.0% for the training set and the
remainder for the testing set, a 2 class classifications have been
implemented (Normal, Attack). Weka framework which is a java
based open source software consists of a collection of machine
learning algorithms for data mining tasks has been used in the testing
process. The experimental results show that the proposed approach is
very accurate with low false positive rate and high true positive rate
and it takes less learning time in comparison with using the full
features of the dataset with the same algorithm.
Abstract: In this study, we developed an algorithm for detecting
seam cracks in a steel plate. Seam cracks are generated in the edge
region of a steel plate. We used the Gabor filter and an adaptive double
threshold method to detect them. To reduce the number of pseudo
defects, features based on the shape of seam cracks were used. To
evaluate the performance of the proposed algorithm, we tested 989
images with seam cracks and 9470 defect-free images. Experimental
results show that the proposed algorithm is suitable for detecting seam
cracks. However, it should be improved to increase the true positive
rate.
Abstract: The proliferation of web application and the pervasiveness of mobile technology make web-based attacks even more attractive and even easier to launch. Web Application Firewall (WAF) is an intermediate tool between web server and users that provides comprehensive protection for web application. WAF is a negative security model where the detection and prevention mechanisms are based on predefined or user-defined attack signatures and patterns. However, WAF alone is not adequate to offer best defensive system against web vulnerabilities that are increasing in number and complexity daily. This paper presents a methodology to automatically design a positive security based model which identifies and allows only legitimate web queries. The paper shows a true positive rate of more than 90% can be achieved.
Abstract: This paper presents the development of a Bayesian
belief network classifier for prediction of graft status and survival
period in renal transplantation using the patient profile information
prior to the transplantation. The objective was to explore feasibility
of developing a decision making tool for identifying the most suitable
recipient among the candidate pool members. The dataset was
compiled from the University of Toledo Medical Center Hospital
patients as reported to the United Network Organ Sharing, and had
1228 patient records for the period covering 1987 through 2009. The
Bayes net classifiers were developed using the Weka machine
learning software workbench. Two separate classifiers were induced
from the data set, one to predict the status of the graft as either failed
or living, and a second classifier to predict the graft survival period.
The classifier for graft status prediction performed very well with a
prediction accuracy of 97.8% and true positive values of 0.967 and
0.988 for the living and failed classes, respectively. The second
classifier to predict the graft survival period yielded a prediction
accuracy of 68.2% and a true positive rate of 0.85 for the class
representing those instances with kidneys failing during the first year
following transplantation. Simulation results indicated that it is
feasible to develop a successful Bayesian belief network classifier for
prediction of graft status, but not the graft survival period, using the
information in UNOS database.
Abstract: This paper presents an effective method for detecting vehicles in front of the camera-assisted car during nighttime driving. The proposed method detects vehicles based on detecting vehicle headlights and taillights using techniques of image segmentation and clustering. First, to effectively extract spotlight of interest, a segmentation process based on automatic multi-level threshold method is applied on the road-scene images. Second, to spatial clustering vehicle of detecting lamps, a grouping process based on light tracking and locating vehicle lighting patterns. For simulation, we are implemented through Da-vinci 7437 DSP board with near infrared mono-camera and tested it in the urban and rural roads. Through the test, classification performances are above 97% of true positive rate evaluated on real-time environment. Our method also has good performance in the case of clear, fog and rain weather.
Abstract: Term Extraction, a key data preparation step in Text
Mining, extracts the terms, i.e. relevant collocation of words,
attached to specific concepts (e.g. genetic-algorithms and decisiontrees
are terms associated to the concept “Machine Learning" ). In
this paper, the task of extracting interesting collocations is achieved
through a supervised learning algorithm, exploiting a few
collocations manually labelled as interesting/not interesting. From
these examples, the ROGER algorithm learns a numerical function,
inducing some ranking on the collocations. This ranking is optimized
using genetic algorithms, maximizing the trade-off between the false
positive and true positive rates (Area Under the ROC curve). This
approach uses a particular representation for the word collocations,
namely the vector of values corresponding to the standard statistical
interestingness measures attached to this collocation. As this
representation is general (over corpora and natural languages),
generality tests were performed by experimenting the ranking
function learned from an English corpus in Biology, onto a French
corpus of Curriculum Vitae, and vice versa, showing a good
robustness of the approaches compared to the state-of-the-art Support
Vector Machine (SVM).
Abstract: The one-class support vector machine “support vector
data description” (SVDD) is an ideal approach for anomaly or outlier
detection. However, for the applicability of SVDD in real-world
applications, the ease of use is crucial. The results of SVDD are
massively determined by the choice of the regularisation parameter C
and the kernel parameter of the widely used RBF kernel. While for
two-class SVMs the parameters can be tuned using cross-validation
based on the confusion matrix, for a one-class SVM this is not
possible, because only true positives and false negatives can occur
during training. This paper proposes an approach to find the optimal
set of parameters for SVDD solely based on a training set from
one class and without any user parameterisation. Results on artificial
and real data sets are presented, underpinning the usefulness of the
approach.
Abstract: Arms detection is one of the fundamental problems in
human motion analysis application. The arms are considered as the
most challenging body part to be detected since its pose and speed
varies in image sequences. Moreover, the arms are usually occluded
with other body parts such as the head and torso. In this paper,
histogram-based skin colour segmentation is proposed to detect the
arms in image sequences. Six different colour spaces namely RGB,
rgb, HSI, TSL, SCT and CIELAB are evaluated to determine the best
colour space for this segmentation procedure. The evaluation is
divided into three categories, which are single colour component,
colour without luminance and colour with luminance. The
performance is measured using True Positive (TP) and True Negative
(TN) on 250 images with manual ground truth. The best colour is
selected based on the highest TN value followed by the highest TP
value.
Abstract: One of the major, difficult tasks in automated video
surveillance is the segmentation of relevant objects in the scene.
Current implementations often yield inconsistent results on average
from frame to frame when trying to differentiate partly occluding
objects. This paper presents an efficient block-based segmentation
algorithm which is capable of separating partly occluding objects and
detecting shadows. It has been proven to perform in real time with a
maximum duration of 47.48 ms per frame (for 8x8 blocks on a
720x576 image) with a true positive rate of 89.2%. The flexible
structure of the algorithm enables adaptations and improvements with
little effort. Most of the parameters correspond to relative differences
between quantities extracted from the image and should therefore not
depend on scene and lighting conditions. Thus presenting a
performance oriented segmentation algorithm which is applicable in
all critical real time scenarios.