A Probabilistic View of the Spatial Pooler in Hierarchical Temporal Memory

In the Hierarchical Temporal Memory (HTM) paradigm the effect of overlap between inputs on the activation of columns in the spatial pooler is studied. Numerical results suggest that similar inputs are represented by similar sets of columns and dissimilar inputs are represented by dissimilar sets of columns. It is shown that the spatial pooler produces these results under certain conditions for the connectivity and proximal thresholds. Following the discussion of the initialization of parameters for the thresholds, corresponding qualitative arguments about the learning dynamics of the spatial pooler are discussed.

A Comparative Study of Malware Detection Techniques Using Machine Learning Methods

In the past few years, the amount of malicious software increased exponentially and, therefore, machine learning algorithms became instrumental in identifying clean and malware files through (semi)-automated classification. When working with very large datasets, the major challenge is to reach both a very high malware detection rate and a very low false positive rate. Another challenge is to minimize the time needed for the machine learning algorithm to do so. This paper presents a comparative study between different machine learning techniques such as linear classifiers, ensembles, decision trees or various hybrids thereof. The training dataset consists of approximately 2 million clean files and 200.000 infected files, which is a realistic quantitative mixture. The paper investigates the above mentioned methods with respect to both their performance (detection rate and false positive rate) and their practicability.

Educase – Intelligent System for Pedagogical Advising Using Case-Based Reasoning

This paper introduces a proposal scheme for an Intelligent System applied to Pedagogical Advising using Case-Based Reasoning, to find consolidated solutions before used for the new problems, making easier the task of advising students to the pedagogical staff. We do intend, through this work, introduce the motivation behind the choices for this system structure, justifying the development of an incremental and smart web system who learns bests solutions for new cases when it’s used, showing technics and technology.

Customer Churn Prediction: A Cognitive Approach

Customer churn prediction is one of the most useful areas of study in customer analytics. Due to the enormous amount of data available for such predictions, machine learning and data mining have been heavily used in this domain. There exist many machine learning algorithms directly applicable for the problem of customer churn prediction, and here, we attempt to experiment on a novel approach by using a cognitive learning based technique in an attempt to improve the results obtained by using a combination of supervised learning methods, with cognitive unsupervised learning methods.

Feature-Based Summarizing and Ranking from Customer Reviews

Due to the rapid increase of Internet, web opinion sources dynamically emerge which is useful for both potential customers and product manufacturers for prediction and decision purposes. These are the user generated contents written in natural languages and are unstructured-free-texts scheme. Therefore, opinion mining techniques become popular to automatically process customer reviews for extracting product features and user opinions expressed over them. Since customer reviews may contain both opinionated and factual sentences, a supervised machine learning technique applies for subjectivity classification to improve the mining performance. In this paper, we dedicate our work is the task of opinion summarization. Therefore, product feature and opinion extraction is critical to opinion summarization, because its effectiveness significantly affects the identification of semantic relationships. The polarity and numeric score of all the features are determined by Senti-WordNet Lexicon. The problem of opinion summarization refers how to relate the opinion words with respect to a certain feature. Probabilistic based model of supervised learning will improve the result that is more flexible and effective.

Application of KL Divergence for Estimation of Each Metabolic Pathway Genes

Development of a method to estimate gene functions is an important task in bioinformatics. One of the approaches for the annotation is the identification of the metabolic pathway that genes are involved in. Since gene expression data reflect various intracellular phenomena, those data are considered to be related with genes’ functions. However, it has been difficult to estimate the gene function with high accuracy. It is considered that the low accuracy of the estimation is caused by the difficulty of accurately measuring a gene expression. Even though they are measured under the same condition, the gene expressions will vary usually. In this study, we proposed a feature extraction method focusing on the variability of gene expressions to estimate the genes' metabolic pathway accurately. First, we estimated the distribution of each gene expression from replicate data. Next, we calculated the similarity between all gene pairs by KL divergence, which is a method for calculating the similarity between distributions. Finally, we utilized the similarity vectors as feature vectors and trained the multiclass SVM for identifying the genes' metabolic pathway. To evaluate our developed method, we applied the method to budding yeast and trained the multiclass SVM for identifying the seven metabolic pathways. As a result, the accuracy that calculated by our developed method was higher than the one that calculated from the raw gene expression data. Thus, our developed method combined with KL divergence is useful for identifying the genes' metabolic pathway.

New Approach for Load Modeling

Load modeling is one of the central functions in power systems operations. Electricity cannot be stored, which means that for electric utility, the estimate of the future demand is necessary in managing the production and purchasing in an economically reasonable way. A majority of the recently reported approaches are based on neural network. The attraction of the methods lies in the assumption that neural networks are able to learn properties of the load. However, the development of the methods is not finished, and the lack of comparative results on different model variations is a problem. This paper presents a new approach in order to predict the Tunisia daily peak load. The proposed method employs a computational intelligence scheme based on the Fuzzy neural network (FNN) and support vector regression (SVR). Experimental results obtained indicate that our proposed FNN-SVR technique gives significantly good prediction accuracy compared to some classical techniques.

Imputation Technique for Feature Selection in Microarray Data Set

Analyzing DNA microarray data sets is a great challenge, which faces the bioinformaticians due to the complication of using statistical and machine learning techniques. The challenge will be doubled if the microarray data sets contain missing data, which happens regularly because these techniques cannot deal with missing data. One of the most important data analysis process on the microarray data set is feature selection. This process finds the most important genes that affect certain disease. In this paper, we introduce a technique for imputing the missing data in microarray data sets while performing feature selection.

Predicting Application Layer DDoS Attacks Using Machine Learning Algorithms

A Distributed Denial of Service (DDoS) attack is a major threat to cyber security. It originates from the network layer or the application layer of compromised/attacker systems which are connected to the network. The impact of this attack ranges from the simple inconvenience to use a particular service to causing major failures at the targeted server. When there is heavy traffic flow to a target server, it is necessary to classify the legitimate access and attacks. In this paper, a novel method is proposed to detect DDoS attacks from the traces of traffic flow. An access matrix is created from the traces. As the access matrix is multi dimensional, Principle Component Analysis (PCA) is used to reduce the attributes used for detection. Two classifiers Naive Bayes and K-Nearest neighborhood are used to classify the traffic as normal or abnormal. The performance of the classifier with PCA selected attributes and actual attributes of access matrix is compared by the detection rate and False Positive Rate (FPR).

Comparative Study Using Weka for Red Blood Cells Classification

Red blood cells (RBC) are the most common types of blood cells and are the most intensively studied in cell biology. The lack of RBCs is a condition in which the amount of hemoglobin level is lower than normal and is referred to as “anemia”. Abnormalities in RBCs will affect the exchange of oxygen. This paper presents a comparative study for various techniques for classifying the RBCs as normal or abnormal (anemic) using WEKA. WEKA is an open source consists of different machine learning algorithms for data mining applications. The algorithms tested are Radial Basis Function neural network, Support vector machine, and K-Nearest Neighbors algorithm. Two sets of combined features were utilized for classification of blood cells images. The first set, exclusively consist of geometrical features, was used to identify whether the tested blood cell has a spherical shape or non-spherical cells. While the second set, consist mainly of textural features was used to recognize the types of the spherical cells. We have provided an evaluation based on applying these classification methods to our RBCs image dataset which were obtained from Serdang Hospital - Malaysia, and measuring the accuracy of test results. The best achieved classification rates are 97%, 98%, and 79% for Support vector machines, Radial Basis Function neural network, and K-Nearest Neighbors algorithm respectively.

What the Future Holds for Social Media Data Analysis

The dramatic rise in the use of Social Media (SM) platforms such as Facebook and Twitter provide access to an unprecedented amount of user data. Users may post reviews on products and services they bought, write about their interests, share ideas or give their opinions and views on political issues. There is a growing interest in the analysis of SM data from organisations for detecting new trends, obtaining user opinions on their products and services or finding out about their online reputations. A recent research trend in SM analysis is making predictions based on sentiment analysis of SM. Often indicators of historic SM data are represented as time series and correlated with a variety of real world phenomena like the outcome of elections, the development of financial indicators, box office revenue and disease outbreaks. This paper examines the current state of research in the area of SM mining and predictive analysis and gives an overview of the analysis methods using opinion mining and machine learning techniques.

A Review: Comparative Study of Diverse Collection of Data Mining Tools

There have been a lot of efforts and researches undertaken in developing efficient tools for performing several tasks in data mining. Due to the massive amount of information embedded in huge data warehouses maintained in several domains, the extraction of meaningful pattern is no longer feasible. This issue turns to be more obligatory for developing several tools in data mining. Furthermore the major aspire of data mining software is to build a resourceful predictive or descriptive model for handling large amount of information more efficiently and user friendly. Data mining mainly contracts with excessive collection of data that inflicts huge rigorous computational constraints. These out coming challenges lead to the emergence of powerful data mining technologies. In this survey a diverse collection of data mining tools are exemplified and also contrasted with the salient features and performance behavior of each tool.

Methods for Distinction of Cattle Using Supervised Learning

Machine learning represents a set of topics dealing with the creation and evaluation of algorithms that facilitate pattern recognition, classification, and prediction, based on models derived from existing data. The data can present identification patterns which are used to classify into groups. The result of the analysis is the pattern which can be used for identification of data set without the need to obtain input data used for creation of this pattern. An important requirement in this process is careful data preparation validation of model used and its suitable interpretation. For breeders, it is important to know the origin of animals from the point of the genetic diversity. In case of missing pedigree information, other methods can be used for traceability of animal´s origin. Genetic diversity written in genetic data is holding relatively useful information to identify animals originated from individual countries. We can conclude that the application of data mining for molecular genetic data using supervised learning is an appropriate tool for hypothesis testing and identifying an individual.

Resident-Aware Green Home

The amount of energy the world uses doubles every 20 years. Green homes play an important role in reducing the residential energy demand. This paper presents a platform that is intended to learn the behavior of home residents and build a profile about their habits and actions. The proposed resident aware home controller intervenes in the operation of home appliances in order to save energy without compromising the convenience of the residents. The presented platform can be used to simulate the actions and movements happening inside a home. The paper includes several optimization techniques that are meant to save energy in the home. In addition, several test scenarios are presented that show how the controller works. Moreover, this paper shows the computed actual savings when each of the presented techniques is implemented in a typical home. The test scenarios have validated that the techniques developed are capable of effectively saving energy at homes.

A TIPSO-SVM Expert System for Efficient Classification of TSTO Surrogates

Fully reusable spaceplanes do not exist as yet. This implies that design-qualification for optimized highly-integrated forebody-inlet configuration of booster-stage vehicle cannot be based on archival data of other spaceplanes. Therefore, this paper proposes a novel TIPSO-SVM expert system methodology. A non-trivial problem related to optimization and classification of hypersonic forebody-inlet configuration in conjunction with mass-model of the two-stage-to-orbit (TSTO) vehicle is solved. The hybrid-heuristic machine learning methodology is based on two-step improved particle swarm optimizer (TIPSO) algorithm and two-step support vector machine (SVM) data classification method. The efficacy of method is tested by first evolving an optimal configuration for hypersonic compression system using TIPSO algorithm; thereafter, classifying the results using two-step SVM method. In the first step extensive but non-classified mass-model training data for multiple optimized configurations is segregated and pre-classified for learning of SVM algorithm. In second step the TIPSO optimized mass-model data is classified using the SVM classification. Results showed remarkable improvement in configuration and mass-model along with sizing parameters.

Enhance the Power of Sentiment Analysis

Since big data has become substantially more accessible and manageable due to the development of powerful tools for dealing with unstructured data, people are eager to mine information from social media resources that could not be handled in the past. Sentiment analysis, as a novel branch of text mining, has in the last decade become increasingly important in marketing analysis, customer risk prediction and other fields. Scientists and researchers have undertaken significant work in creating and improving their sentiment models. In this paper, we present a concept of selecting appropriate classifiers based on the features and qualities of data sources by comparing the performances of five classifiers with three popular social media data sources: Twitter, Amazon Customer Reviews, and Movie Reviews. We introduced a couple of innovative models that outperform traditional sentiment classifiers for these data sources, and provide insights on how to further improve the predictive power of sentiment analysis. The modeling and testing work was done in R and Greenplum in-database analytic tools.

Incorporating Multiple Supervised Learning Algorithms for Effective Intrusion Detection

As internet continues to expand its usage with an  enormous number of applications, cyber-threats have significantly  increased accordingly. Thus, accurate detection of malicious traffic in  a timely manner is a critical concern in today’s Internet for security.  One approach for intrusion detection is to use Machine Learning (ML)  techniques. Several methods based on ML algorithms have been  introduced over the past years, but they are largely limited in terms of  detection accuracy and/or time and space complexity to run. In this  work, we present a novel method for intrusion detection that  incorporates a set of supervised learning algorithms. The proposed  technique provides high accuracy and outperforms existing techniques  that simply utilizes a single learning method. In addition, our  technique relies on partial flow information (rather than full  information) for detection, and thus, it is light-weight and desirable for  online operations with the property of early identification. With the  mid-Atlantic CCDC intrusion dataset publicly available, we show that  our proposed technique yields a high degree of detection rate over 99%  with a very low false alarm rate (0.4%).   

Prediction of Research Topics Using Ensemble of Best Predictors from Similar Dataset

Prediction of future research topics by using time series analysis either statistical or machine learning has been conducted previously by several researchers. Several methods have been proposed to combine the forecasting results into single forecast. These methods use fixed combination of individual forecast to get the final forecast result. In this paper, quite different approach is employed to select the forecasting methods, in which every point to forecast is calculated by using the best methods used by similar validation dataset. The dataset used in the experiment is time series derived from research report in Garuda, which is an online sites belongs to the Ministry of Education in Indonesia, over the past 20 years. The experimental result demonstrates that the proposed method may perform better compared to the fix combination of predictors. In addition, based on the prediction result, we can forecast emerging research topics for the next few years.

Evaluating Performance of an Anomaly Detection Module with Artificial Neural Network Implementation

Anomaly detection techniques have been focused on two main components: data extraction and selection and the second one is the analysis performed over the obtained data. The goal of this paper is to analyze the influence that each of these components has over the system performance by evaluating detection over network scenarios with different setups. The independent variables are as follows: the number of system inputs, the way the inputs are codified and the complexity of the analysis techniques. For the analysis, some approaches of artificial neural networks are implemented with different number of layers. The obtained results show the influence that each of these variables has in the system performance.

Empirical Process Monitoring Via Chemometric Analysis of Partially Unbalanced Data

Real-time or in-line process monitoring frameworks are designed to give early warnings for a fault along with meaningful identification of its assignable causes. In artificial intelligence and machine learning fields of pattern recognition various promising approaches have been proposed such as kernel-based nonlinear machine learning techniques. This work presents a kernel-based empirical monitoring scheme for batch type production processes with small sample size problem of partially unbalanced data. Measurement data of normal operations are easy to collect whilst special events or faults data are difficult to collect. In such situations, noise filtering techniques can be helpful in enhancing process monitoring performance. Furthermore, preprocessing of raw process data is used to get rid of unwanted variation of data. The performance of the monitoring scheme was demonstrated using three-dimensional batch data. The results showed that the monitoring performance was improved significantly in terms of detection success rate of process fault.