Analysis of Relation between Unlabeled and Labeled Data to Self-Taught Learning Performance

Obtaining labeled data in supervised learning is often difficult and expensive, and thus the trained learning algorithm tends to be overfitting due to small number of training data. As a result, some researchers have focused on using unlabeled data which may not necessary to follow the same generative distribution as the labeled data to construct a high-level feature for improving performance on supervised learning tasks. In this paper, we investigate the impact of the relationship between unlabeled and labeled data for classification performance. Specifically, we will apply difference unlabeled data which have different degrees of relation to the labeled data for handwritten digit classification task based on MNIST dataset. Our experimental results show that the higher the degree of relation between unlabeled and labeled data, the better the classification performance. Although the unlabeled data that is completely from different generative distribution to the labeled data provides the lowest classification performance, we still achieve high classification performance. This leads to expanding the applicability of the supervised learning algorithms using unsupervised learning.

Learning to Order Terms: Supervised Interestingness Measures in Terminology Extraction

Term Extraction, a key data preparation step in Text Mining, extracts the terms, i.e. relevant collocation of words, attached to specific concepts (e.g. genetic-algorithms and decisiontrees are terms associated to the concept “Machine Learning" ). In this paper, the task of extracting interesting collocations is achieved through a supervised learning algorithm, exploiting a few collocations manually labelled as interesting/not interesting. From these examples, the ROGER algorithm learns a numerical function, inducing some ranking on the collocations. This ranking is optimized using genetic algorithms, maximizing the trade-off between the false positive and true positive rates (Area Under the ROC curve). This approach uses a particular representation for the word collocations, namely the vector of values corresponding to the standard statistical interestingness measures attached to this collocation. As this representation is general (over corpora and natural languages), generality tests were performed by experimenting the ranking function learned from an English corpus in Biology, onto a French corpus of Curriculum Vitae, and vice versa, showing a good robustness of the approaches compared to the state-of-the-art Support Vector Machine (SVM).

User Pattern Learning Algorithm based MDSS(Medical Decision Support System) Framework under Ubiquitous

In this paper, we present user pattern learning algorithm based MDSS (Medical Decision support system) under ubiquitous. Most of researches are focus on hardware system, hospital management and whole concept of ubiquitous environment even though it is hard to implement. Our objective of this paper is to design a MDSS framework. It helps to patient for medical treatment and prevention of the high risk patient (COPD, heart disease, Diabetes). This framework consist database, CAD (Computer Aided diagnosis support system) and CAP (computer aided user vital sign prediction system). It can be applied to develop user pattern learning algorithm based MDSS for homecare and silver town service. Especially this CAD has wise decision making competency. It compares current vital sign with user-s normal condition pattern data. In addition, the CAP computes user vital sign prediction using past data of the patient. The novel approach is using neural network method, wireless vital sign acquisition devices and personal computer DB system. An intelligent agent based MDSS will help elder people and high risk patients to prevent sudden death and disease, the physician to get the online access to patients- data, the plan of medication service priority (e.g. emergency case).

Forecasting Fraudulent Financial Statements using Data Mining

This paper explores the effectiveness of machine learning techniques in detecting firms that issue fraudulent financial statements (FFS) and deals with the identification of factors associated to FFS. To this end, a number of experiments have been conducted using representative learning algorithms, which were trained using a data set of 164 fraud and non-fraud Greek firms in the recent period 2001-2002. The decision of which particular method to choose is a complicated problem. A good alternative to choosing only one method is to create a hybrid forecasting system incorporating a number of possible solution methods as components (an ensemble of classifiers). For this purpose, we have implemented a hybrid decision support system that combines the representative algorithms using a stacking variant methodology and achieves better performance than any examined simple and ensemble method. To sum up, this study indicates that the investigation of financial information can be used in the identification of FFS and underline the importance of financial ratios.

The Influence of Preprocessing Parameters on Text Categorization

Text categorization (the assignment of texts in natural language into predefined categories) is an important and extensively studied problem in Machine Learning. Currently, popular techniques developed to deal with this task include many preprocessing and learning algorithms, many of which in turn require tuning nontrivial internal parameters. Although partial studies are available, many authors fail to report values of the parameters they use in their experiments, or reasons why these values were used instead of others. The goal of this work then is to create a more thorough comparison of preprocessing parameters and their mutual influence, and report interesting observations and results.

Weka Based Desktop Data Mining as Web Service

Data mining is the process of sifting through large volumes of data, analyzing data from different perspectives and summarizing it into useful information. One of the widely used desktop applications for data mining is the Weka tool which is nothing but a collection of machine learning algorithms implemented in Java and open sourced under the General Public License (GPL). A web service is a software system designed to support interoperable machine to machine interaction over a network using SOAP messages. Unlike a desktop application, a web service is easy to upgrade, deliver and access and does not occupy any memory on the system. Keeping in mind the advantages of a web service over a desktop application, in this paper we are demonstrating how this Java based desktop data mining application can be implemented as a web service to support data mining across the internet.

Understanding and Designing Situation-Aware Mobile and Ubiquitous Computing Systems

Using spatial models as a shared common basis of information about the environment for different kinds of contextaware systems has been a heavily researched topic in the last years. Thereby the research focused on how to create, to update, and to merge spatial models so as to enable highly dynamic, consistent and coherent spatial models at large scale. In this paper however, we want to concentrate on how context-aware applications could use this information so as to adapt their behavior according to the situation they are in. The main idea is to provide the spatial model infrastructure with a situation recognition component based on generic situation templates. A situation template is – as part of a much larger situation template library – an abstract, machinereadable description of a certain basic situation type, which could be used by different applications to evaluate their situation. In this paper, different theoretical and practical issues – technical, ethical and philosophical ones – are discussed important for understanding and developing situation dependent systems based on situation templates. A basic system design is presented which allows for the reasoning with uncertain data using an improved version of a learning algorithm for the automatic adaption of situation templates. Finally, for supporting the development of adaptive applications, we present a new situation-aware adaptation concept based on workflows.

Combining Bagging and Additive Regression

Bagging and boosting are among the most popular re-sampling ensemble methods that generate and combine a diversity of regression models using the same learning algorithm as base-learner. Boosting algorithms are considered stronger than bagging on noise-free data. However, there are strong empirical indications that bagging is much more robust than boosting in noisy settings. For this reason, in this work we built an ensemble using an averaging methodology of bagging and boosting ensembles with 10 sub-learners in each one. We performed a comparison with simple bagging and boosting ensembles with 25 sub-learners on standard benchmark datasets and the proposed ensemble gave better accuracy.

A Cognitive Robot Collaborative Reinforcement Learning Algorithm

A cognitive collaborative reinforcement learning algorithm (CCRL) that incorporates an advisor into the learning process is developed to improve supervised learning. An autonomous learner is enabled with a self awareness cognitive skill to decide when to solicit instructions from the advisor. The learner can also assess the value of advice, and accept or reject it. The method is evaluated for robotic motion planning using simulation. Tests are conducted for advisors with skill levels from expert to novice. The CCRL algorithm and a combined method integrating its logic with Clouse-s Introspection Approach, outperformed a base-line fully autonomous learner, and demonstrated robust performance when dealing with various advisor skill levels, learning to accept advice received from an expert, while rejecting that of less skilled collaborators. Although the CCRL algorithm is based on RL, it fits other machine learning methods, since advisor-s actions are only added to the outer layer.

Visualization and Indexing of Spectral Databases

On-line (near infrared) spectroscopy is widely used to support the operation of complex process systems. Information extracted from spectral database can be used to estimate unmeasured product properties and monitor the operation of the process. These techniques are based on looking for similar spectra by nearest neighborhood algorithms and distance based searching methods. Search for nearest neighbors in the spectral space is an NP-hard problem, the computational complexity increases by the number of points in the discrete spectrum and the number of samples in the database. To reduce the calculation time some kind of indexing could be used. The main idea presented in this paper is to combine indexing and visualization techniques to reduce the computational requirement of estimation algorithms by providing a two dimensional indexing that can also be used to visualize the structure of the spectral database. This 2D visualization of spectral database does not only support application of distance and similarity based techniques but enables the utilization of advanced clustering and prediction algorithms based on the Delaunay tessellation of the mapped spectral space. This means the prediction has not to use the high dimension space but can be based on the mapped space too. The results illustrate that the proposed method is able to segment (cluster) spectral databases and detect outliers that are not suitable for instance based learning algorithms.

Distributed Relay Selection and Channel Choice in Cognitive Radio Network

In this paper, we study the cooperative communications where multiple cognitive radio (CR) transmit-receive pairs competitive maximize their own throughputs. In CR networks, the influences of primary users and the spectrum availability are usually different among CR users. Due to the existence of multiple relay nodes and the different spectrum availability, each CR transmit-receive pair should not only select the relay node but also choose the appropriate channel. For this distributed problem, we propose a game theoretic framework to formulate this problem and we apply a regret-matching learning algorithm which is leading to correlated equilibrium. We further formulate a modified regret-matching learning algorithm which is fully distributed and only use the local information of each CR transmit-receive pair. This modified algorithm is more practical and suitable for the cooperative communications in CR network. Simulation results show the algorithm convergence and the modified learning algorithm can achieve comparable performance to the original regretmatching learning algorithm.

On Speeding Up Support Vector Machines: Proximity Graphs Versus Random Sampling for Pre-Selection Condensation

Support vector machines (SVMs) are considered to be the best machine learning algorithms for minimizing the predictive probability of misclassification. However, their drawback is that for large data sets the computation of the optimal decision boundary is a time consuming function of the size of the training set. Hence several methods have been proposed to speed up the SVM algorithm. Here three methods used to speed up the computation of the SVM classifiers are compared experimentally using a musical genre classification problem. The simplest method pre-selects a random sample of the data before the application of the SVM algorithm. Two additional methods use proximity graphs to pre-select data that are near the decision boundary. One uses k-Nearest Neighbor graphs and the other Relative Neighborhood Graphs to accomplish the task.

Combining Fuzzy Logic and Neural Networks in Modeling Landfill Gas Production

Heterogeneity of solid waste characteristics as well as the complex processes taking place within the landfill ecosystem motivated the implementation of soft computing methodologies such as artificial neural networks (ANN), fuzzy logic (FL), and their combination. The present work uses a hybrid ANN-FL model that employs knowledge-based FL to describe the process qualitatively and implements the learning algorithm of ANN to optimize model parameters. The model was developed to simulate and predict the landfill gas production at a given time based on operational parameters. The experimental data used were compiled from lab-scale experiment that involved various operating scenarios. The developed model was validated and statistically analyzed using F-test, linear regression between actual and predicted data, and mean squared error measures. Overall, the simulated landfill gas production rates demonstrated reasonable agreement with actual data. The discussion focused on the effect of the size of training datasets and number of training epochs.

Attacks Classification in Adaptive Intrusion Detection using Decision Tree

Recently, information security has become a key issue in information technology as the number of computer security breaches are exposed to an increasing number of security threats. A variety of intrusion detection systems (IDS) have been employed for protecting computers and networks from malicious network-based or host-based attacks by using traditional statistical methods to new data mining approaches in last decades. However, today's commercially available intrusion detection systems are signature-based that are not capable of detecting unknown attacks. In this paper, we present a new learning algorithm for anomaly based network intrusion detection system using decision tree algorithm that distinguishes attacks from normal behaviors and identifies different types of intrusions. Experimental results on the KDD99 benchmark network intrusion detection dataset demonstrate that the proposed learning algorithm achieved 98% detection rate (DR) in comparison with other existing methods.

Agent-based Simulation for Blood Glucose Control in Diabetic Patients

This paper employs a new approach to regulate the blood glucose level of type I diabetic patient under an intensive insulin treatment. The closed-loop control scheme incorporates expert knowledge about treatment by using reinforcement learning theory to maintain the normoglycemic average of 80 mg/dl and the normal condition for free plasma insulin concentration in severe initial state. The insulin delivery rate is obtained off-line by using Qlearning algorithm, without requiring an explicit model of the environment dynamics. The implementation of the insulin delivery rate, therefore, requires simple function evaluation and minimal online computations. Controller performance is assessed in terms of its ability to reject the effect of meal disturbance and to overcome the variability in the glucose-insulin dynamics from patient to patient. Computer simulations are used to evaluate the effectiveness of the proposed technique and to show its superiority in controlling hyperglycemia over other existing algorithms

Combining Bagging and Boosting

Bagging and boosting are among the most popular resampling ensemble methods that generate and combine a diversity of classifiers using the same learning algorithm for the base-classifiers. Boosting algorithms are considered stronger than bagging on noisefree data. However, there are strong empirical indications that bagging is much more robust than boosting in noisy settings. For this reason, in this work we built an ensemble using a voting methodology of bagging and boosting ensembles with 10 subclassifiers in each one. We performed a comparison with simple bagging and boosting ensembles with 25 sub-classifiers, as well as other well known combining methods, on standard benchmark datasets and the proposed technique was the most accurate.

Kernel’s Parameter Selection for Support Vector Domain Description

Support Vector Domain Description (SVDD) is one of the best-known one-class support vector learning methods, in which one tries the strategy of using balls defined on the feature space in order to distinguish a set of normal data from all other possible abnormal objects. As all kernel-based learning algorithms its performance depends heavily on the proper choice of the kernel parameter. This paper proposes a new approach to select kernel's parameter based on maximizing the distance between both gravity centers of normal and abnormal classes, and at the same time minimizing the variance within each class. The performance of the proposed algorithm is evaluated on several benchmarks. The experimental results demonstrate the feasibility and the effectiveness of the presented method.

A Multi-layer Artificial Neural Network Architecture Design for Load Forecasting in Power Systems

In this paper, the modelling and design of artificial neural network architecture for load forecasting purposes is investigated. The primary pre-requisite for power system planning is to arrive at realistic estimates of future demand of power, which is known as Load Forecasting. Short Term Load Forecasting (STLF) helps in determining the economic, reliable and secure operating strategies for power system. The dependence of load on several factors makes the load forecasting a very challenging job. An over estimation of the load may cause premature investment and unnecessary blocking of the capital where as under estimation of load may result in shortage of equipment and circuits. It is always better to plan the system for the load slightly higher than expected one so that no exigency may arise. In this paper, a load-forecasting model is proposed using a multilayer neural network with an appropriately modified back propagation learning algorithm. Once the neural network model is designed and trained, it can forecast the load of the power system 24 hours ahead on daily basis and can also forecast the cumulative load on daily basis. The real load data that is used for the Artificial Neural Network training was taken from LDC, Gujarat Electricity Board, Jambuva, Gujarat, India. The results show that the load forecasting of the ANN model follows the actual load pattern more accurately throughout the forecasted period.

Dataset Analysis Using Membership-Deviation Graph

Classification is one of the primary themes in computational biology. The accuracy of classification strongly depends on quality of a dataset, and we need some method to evaluate this quality. In this paper, we propose a new graphical analysis method using 'Membership-Deviation Graph (MDG)' for analyzing quality of a dataset. MDG represents degree of membership and deviations for instances of a class in the dataset. The result of MDG analysis is used for understanding specific feature and for selecting best feature for classification.

Sprayer Boom Active Suspension Using Intelligent Active Force Control

The control of sprayer boom undesired vibrations pose a great challenge to investigators due to various disturbances and conditions. Sprayer boom movements lead to reduce of spread efficiency and crop yield. This paper describes the design of a novel control method for an active suspension system applying proportional-integral-derivative (PID) controller with an active force control (AFC) scheme integration of an iterative learning algorithm employed to a sprayer boom. The iterative learning as an intelligent method is principally used as a method to calculate the best value of the estimated inertia of the sprayer boom needed for the AFC loop. Results show that the proposed AFC-based scheme performs much better than the standard PID control technique. Also, this shows that the system is more robust and accurate.