Abstract: The main cause of several neurodegenerative diseases such as Alzhemier, Parkinson and spongiform encephalopathies is formation of amyloid fibrils and plaques in proteins. We have analyzed different sets of proteins and peptides to understand the influence of sequence based features on protein aggregation process. The comparison of 373 pairs of homologous mesophilic and thermophilic proteins showed that aggregation prone regions (APRs) are present in both. But, the thermophilic protein monomers show greater ability to ‘stow away’ the APRs in their hydrophobic cores and protect them from solvent exposure. The comparison of amyloid forming and amorphous b-aggregating hexapeptides suggested distinct preferences for specific residues at the six positions as well as all possible combinations of nine residue pairs. The compositions of residues at different positions and residue pairs have been converted into energy potentials and utilized for distinguishing between amyloid forming and amorphous b-aggregating peptides. Our method could correctly identify the amyloid forming peptides at an accuracy of 95-100% in different datasets of peptides.
Abstract: In this paper, we present the use of the discriminant analysis to select evolutionary algorithms that better solve instances of the vehicle routing problem with time windows. We use indicators as independent variables to obtain the classification criteria, and the best algorithm from the generic genetic algorithm (GA), random search (RS), steady-state genetic algorithm (SSGA), and sexual genetic algorithm (SXGA) as the dependent variable for the classification. The discriminant classification was trained with classic instances of the vehicle routing problem with time windows obtained from the Solomon benchmark. We obtained a classification of the discriminant analysis of 66.7%.
Abstract: Today, there is a large number of political transcripts
available on the Web to be mined and used for statistical analysis,
and product recommendations. As the online political resources are
used for various purposes, automatically determining the political
orientation on these transcripts becomes crucial. The methodologies
used by machine learning algorithms to do an automatic classification
are based on different features that are classified under categories
such as Linguistic, Personality etc. Considering the ideological
differences between Liberals and Conservatives, in this paper, the
effect of Personality traits on political orientation classification is
studied. The experiments in this study were based on the correlation
between LIWC features and the BIG Five Personality traits. Several
experiments were conducted using Convote U.S. Congressional-
Speech dataset with seven benchmark classification algorithms. The
different methodologies were applied on several LIWC feature sets
that constituted by 8 to 64 varying number of features that are
correlated to five personality traits. As results of experiments,
Neuroticism trait was obtained to be the most differentiating
personality trait for classification of political orientation. At the same
time, it was observed that the personality trait based classification
methodology gives better and comparable results with the related
work.
Abstract: This paper introduces a proposal scheme for an
Intelligent System applied to Pedagogical Advising using Case-Based
Reasoning, to find consolidated solutions before used for the new
problems, making easier the task of advising students to the
pedagogical staff. We do intend, through this work, introduce the
motivation behind the choices for this system structure, justifying the
development of an incremental and smart web system who learns
bests solutions for new cases when it’s used, showing technics and
technology.
Abstract: Development of a method to estimate gene functions is
an important task in bioinformatics. One of the approaches for the
annotation is the identification of the metabolic pathway that genes are
involved in. Since gene expression data reflect various intracellular
phenomena, those data are considered to be related with genes’
functions. However, it has been difficult to estimate the gene function
with high accuracy. It is considered that the low accuracy of the
estimation is caused by the difficulty of accurately measuring a gene
expression. Even though they are measured under the same condition,
the gene expressions will vary usually. In this study, we proposed a
feature extraction method focusing on the variability of gene
expressions to estimate the genes' metabolic pathway accurately. First,
we estimated the distribution of each gene expression from replicate
data. Next, we calculated the similarity between all gene pairs by KL
divergence, which is a method for calculating the similarity between
distributions. Finally, we utilized the similarity vectors as feature
vectors and trained the multiclass SVM for identifying the genes'
metabolic pathway. To evaluate our developed method, we applied the
method to budding yeast and trained the multiclass SVM for
identifying the seven metabolic pathways. As a result, the accuracy
that calculated by our developed method was higher than the one that
calculated from the raw gene expression data. Thus, our developed
method combined with KL divergence is useful for identifying the
genes' metabolic pathway.
Abstract: The dramatic rise in the use of Social Media (SM)
platforms such as Facebook and Twitter provide access to an
unprecedented amount of user data. Users may post reviews on
products and services they bought, write about their interests, share
ideas or give their opinions and views on political issues. There is a
growing interest in the analysis of SM data from organisations for
detecting new trends, obtaining user opinions on their products and
services or finding out about their online reputations. A recent
research trend in SM analysis is making predictions based on
sentiment analysis of SM. Often indicators of historic SM data are
represented as time series and correlated with a variety of real world
phenomena like the outcome of elections, the development of
financial indicators, box office revenue and disease outbreaks. This
paper examines the current state of research in the area of SM mining
and predictive analysis and gives an overview of the analysis
methods using opinion mining and machine learning techniques.
Abstract: The amount of energy the world uses doubles every 20 years. Green homes play an important role in reducing the residential energy demand. This paper presents a platform that is intended to learn the behavior of home residents and build a profile about their habits and actions. The proposed resident aware home controller intervenes in the operation of home appliances in order to save energy without compromising the convenience of the residents. The presented platform can be used to simulate the actions and movements happening inside a home. The paper includes several optimization techniques that are meant to save energy in the home. In addition, several test scenarios are presented that show how the controller works. Moreover, this paper shows the computed actual savings when each of the presented techniques is implemented in a typical home. The test scenarios have validated that the techniques developed are capable of effectively saving energy at homes.
Abstract: Educational data mining is a specific data mining field applied to data originating from educational environments, it relies on different approaches to discover hidden knowledge from the available data. Among these approaches are machine learning techniques which are used to build a system that acquires learning from previous data. Machine learning can be applied to solve different regression, classification, clustering and optimization problems.
In our research, we propose a “Student Advisory Framework” that utilizes classification and clustering to build an intelligent system. This system can be used to provide pieces of consultations to a first year university student to pursue a certain education track where he/she will likely succeed in, aiming to decrease the high rate of academic failure among these students. A real case study in Cairo Higher Institute for Engineering, Computer Science and Management is presented using real dataset collected from 2000−2012.The dataset has two main components: pre-higher education dataset and first year courses results dataset. Results have proved the efficiency of the suggested framework.
Abstract: For a given specific problem an efficient algorithm has been the matter of study. However, an alternative approach orthogonal to this approach comes out, which is called a reduction. In general for a given specific problem this reduction approach studies how to convert an original problem into subproblems. This paper proposes a formal modeling language to support this reduction approach in order to make a solver quickly. We show three examples from the wide area of learning problems. The benefit is a fast prototyping of algorithms for a given new problem. It is noted that our formal modeling language is not intend for providing an efficient notation for data mining application, but for facilitating a designer who develops solvers in machine learning.
Abstract: Estimates of temperature values at a specific time of day, from daytime and daily profiles, are needed for a number of environmental, ecological, agricultural and technical applications, ranging from natural hazards assessments, crop growth forecasting to design of solar energy systems. The scope of this research is to investigate the efficiency of data mining techniques in estimating minimum, maximum and mean temperature values. For this reason, a number of experiments have been conducted with well-known regression algorithms using temperature data from the city of Patras in Greece. The performance of these algorithms has been evaluated using standard statistical indicators, such as Correlation Coefficient, Root Mean Squared Error, etc.
Abstract: Music segmentation is a key issue in music information
retrieval (MIR) as it provides an insight into the
internal structure of a composition. Structural information about
a composition can improve several tasks related to MIR such
as searching and browsing large music collections, visualizing
musical structure, lyric alignment, and music summarization.
The authors of this paper present the MTSSM framework, a twolayer
framework for the multi-track segmentation of symbolic
music. The strength of this framework lies in the combination of
existing methods for local track segmentation and the application
of global structure information spanning via multiple tracks.
The first layer of the MTSSM uses various string matching
techniques to detect the best candidate segmentations for each
track of a multi-track composition independently. The second
layer combines all single track results and determines the best
segmentation for each track in respect to the global structure of
the composition.
Abstract: In recent years, real estate prediction or valuation has
been a topic of discussion in many developed countries. Improper
hype created by investors leads to fluctuating prices of real estate,
affecting many consumers to purchase their own homes. Therefore,
scholars from various countries have conducted research in real estate
valuation and prediction. With the back-propagation neural network
that has been popular in recent years and the orthogonal array in the
Taguchi method, this study aimed to find the optimal parameter
combination at different levels of orthogonal array after the system
presented different parameter combinations, so that the artificial
neural network obtained the most accurate results. The experimental
results also demonstrated that the method presented in the study had a
better result than traditional machine learning. Finally, it also showed
that the model proposed in this study had the optimal predictive effect,
and could significantly reduce the cost of time in simulation operation.
The best predictive results could be found with a fewer number of
experiments more efficiently. Thus users could predict a real estate
transaction price that is not far from the current actual prices.
Abstract: In the present study, a support vector machine (SVM) learning approach to character recognition is proposed. Simple
feature detectors, similar to those found in the human visual system, were used in the SVM classifier. Alphabetic characters were rotated
to 8 different angles and using the proposed cognitive model, all characters were recognized with 100% accuracy and specificity.
These same results were found in psychiatric studies of human character recognition.
Abstract: Combining classifiers is a useful method for solving
complex problems in machine learning. The ECOC (Error Correcting
Output Codes) method has been widely used for designing combining
classifiers with an emphasis on the diversity of classifiers. In this
paper, in contrast to the standard ECOC approach in which individual
classifiers are chosen homogeneously, classifiers are selected
according to the complexity of the corresponding binary problem. We
use SATIMAGE database (containing 6 classes) for our experiments.
The recognition error rate in our proposed method is %10.37 which
indicates a considerable improvement in comparison with the
conventional ECOC and stack generalization methods.
Abstract: This paper presents a simple and effective method for approximate indexing of instances for instance based learning. The method uses an interval tree to determine a good starting search point for the nearest neighbor. The search stops when an early stopping criterion is met. The method proved to be very effective especially when only the first nearest neighbor is required.
Abstract: This paper represents four unsupervised clustering algorithms namely sIB, RandomFlatClustering, FarthestFirst, and FilteredClusterer that previously works have not been used for network traffic classification. The methodology, the result, the products of the cluster and evaluation of these algorithms with efficiency of each algorithm from accuracy are shown. Otherwise, the efficiency of these algorithms considering form the time that it use to generate the cluster quickly and correctly. Our work study and test the best algorithm by using classify traffic anomaly in network traffic with different attribute that have not been used before. We analyses the algorithm that have the best efficiency or the best learning and compare it to the previously used (K-Means). Our research will be use to develop anomaly detection system to more efficiency and more require in the future.
Abstract: Text categorization - the assignment of natural language documents to one or more predefined categories based on their semantic content - is an important component in many information organization and management tasks. Performance of neural networks learning is known to be sensitive to the initial weights and architecture. This paper discusses the use multilayer neural network initialization with decision tree classifier for improving text categorization accuracy. An adaptation of the algorithm is proposed in which a decision tree from root node until a final leave is used for initialization of multilayer neural network. The experimental evaluation demonstrates this approach provides better classification accuracy with Reuters-21578 corpus, one of the standard benchmarks for text categorization tasks. We present results comparing the accuracy of this approach with multilayer neural network initialized with traditional random method and decision tree classifiers.
Abstract: In the automotive industry test drives are being conducted
during the development of new vehicle models or as a part of
quality assurance of series-production vehicles. The communication
on the in-vehicle network, data from external sensors, or internal
data from the electronic control units is recorded by automotive
data loggers during the test drives. The recordings are used for fault
analysis. Since the resulting data volume is tremendous, manually
analysing each recording in great detail is not feasible.
This paper proposes to use machine learning to support domainexperts
by preventing them from contemplating irrelevant data and
rather pointing them to the relevant parts in the recordings. The
underlying idea is to learn the normal behaviour from available
recordings, i.e. a training set, and then to autonomously detect
unexpected deviations and report them as anomalies.
The one-class support vector machine “support vector data description”
is utilised to calculate distances of feature vectors. SVDDSUBSEQ
is proposed as a novel approach, allowing to classify subsequences
in multivariate time series data. The approach allows to
detect unexpected faults without modelling effort as is shown with
experimental results on recordings from test drives.
Abstract: Text categorization (the assignment of texts in natural language into predefined categories) is an important and extensively studied problem in Machine Learning. Currently, popular techniques developed to deal with this task include many preprocessing and learning algorithms, many of which in turn require tuning nontrivial internal parameters. Although partial studies are available, many authors fail to report values of the parameters they use in their experiments, or reasons why these values were used instead of others. The goal of this work then is to create a more thorough comparison of preprocessing parameters and their mutual influence, and report interesting observations and results.
Abstract: The belief decision tree (BDT) approach is a decision
tree in an uncertain environment where the uncertainty is represented
through the Transferable Belief Model (TBM), one interpretation
of the belief function theory. The uncertainty can appear either in
the actual class of training objects or attribute values of objects to
classify. In this paper, we develop a post-pruning method of belief
decision trees in order to reduce size and improve classification
accuracy on unseen cases. The pruning of decision tree has a
considerable intention in the areas of machine learning.