Abstract: The reason for conducting this research is to develop an algorithm that is capable of classifying news articles from the automobile industry, according to the competitive actions that they entail, with the use of Text Mining (TM) methods. It is needed to test how to properly preprocess the data for this research by preparing pipelines which fits each algorithm the best. The pipelines are tested along with nine different classification algorithms in the realm of regression, support vector machines, and neural networks. Preliminary testing for identifying the optimal pipelines and algorithms resulted in the selection of two algorithms with two different pipelines. The two algorithms are Logistic Regression (LR) and Artificial Neural Network (ANN). These algorithms are optimized further, where several parameters of each algorithm are tested. The best result is achieved with the ANN. The final model yields an accuracy of 0.79, a precision of 0.80, a recall of 0.78, and an F1 score of 0.76. By removing three of the classes that created noise, the final algorithm is capable of reaching an accuracy of 94%.
Abstract: The Internet has grown into a powerful medium for information dispersion and social interaction that leads to a rapid growth of social media which allows users to easily post their emotions and perspectives regarding certain topics online. Our research aims at using natural language processing and text mining techniques to explore the public emotions expressed on Twitter by analyzing the sentiment behind tweets. In this paper, we propose a composite kernel method that integrates tree kernel with the linear kernel to simultaneously exploit both the tree representation and the distributed emotion keyword representation to analyze the syntactic and content information in tweets. The experiment results demonstrate that our method can effectively detect public emotion of tweets while outperforming the other compared methods.
Abstract: Datasets or collections are becoming important assets by themselves and now they can be accepted as a primary intellectual output of a research. The quality and usage of the datasets depend mainly on the context under which they have been collected, processed, analyzed, validated, and interpreted. This paper aims to present a collection of program educational objectives mapped to student’s outcomes collected from self-study reports prepared by 32 engineering programs accredited by ABET. The manual mapping (classification) of this data is a notoriously tedious, time consuming process. In addition, it requires experts in the area, which are mostly not available. It has been shown the operational settings under which the collection has been produced. The collection has been cleansed, preprocessed, some features have been selected and preliminary exploratory data analysis has been performed so as to illustrate the properties and usefulness of the collection. At the end, the collection has been benchmarked using nine of the most widely used supervised multiclass classification techniques (Binary Relevance, Label Powerset, Classifier Chains, Pruned Sets, Random k-label sets, Ensemble of Classifier Chains, Ensemble of Pruned Sets, Multi-Label k-Nearest Neighbors and Back-Propagation Multi-Label Learning). The techniques have been compared to each other using five well-known measurements (Accuracy, Hamming Loss, Micro-F, Macro-F, and Macro-F). The Ensemble of Classifier Chains and Ensemble of Pruned Sets have achieved encouraging performance compared to other experimented multi-label classification methods. The Classifier Chains method has shown the worst performance. To recap, the benchmark has achieved promising results by utilizing preliminary exploratory data analysis performed on the collection, proposing new trends for research and providing a baseline for future studies.
Abstract: In this paper, we present an evolving knowledge
extraction system named AKEOS (Automatic Knowledge Extraction
from Online Sources). AKEOS consists of two modules, including
a one-time learning module and an evolving learning module.
The one-time learning module takes in user input query, and
automatically harvests knowledge from online unstructured resources
in an unsupervised way. The output of the one-time learning is a
structured vector representing the harvested knowledge. The evolving
learning module automatically schedules and performs repeated
one-time learning to extract the newest information and track the
development of an event. In addition, the evolving learning module
summarizes the knowledge learned at different time points to produce
a final knowledge vector about the event. With the evolving learning,
we are able to visualize the key information of the event, discover
the trends, and track the development of an event.
Abstract: Due to the rapid increase of Internet, web opinion
sources dynamically emerge which is useful for both potential
customers and product manufacturers for prediction and decision
purposes. These are the user generated contents written in natural
languages and are unstructured-free-texts scheme. Therefore, opinion
mining techniques become popular to automatically process customer
reviews for extracting product features and user opinions expressed
over them. Since customer reviews may contain both opinionated and
factual sentences, a supervised machine learning technique applies
for subjectivity classification to improve the mining performance. In
this paper, we dedicate our work is the task of opinion
summarization. Therefore, product feature and opinion extraction is
critical to opinion summarization, because its effectiveness
significantly affects the identification of semantic relationships. The
polarity and numeric score of all the features are determined by
Senti-WordNet Lexicon. The problem of opinion summarization
refers how to relate the opinion words with respect to a certain
feature. Probabilistic based model of supervised learning will
improve the result that is more flexible and effective.
Abstract: The use of eXtensible Markup Language (XML) in
web, business and scientific databases lead to the development of
methods, techniques and systems to manage and analyze XML data.
Semi-structured documents suffer due to its heterogeneity and
dimensionality. XML structure and content mining represent
convergence for research in semi-structured data and text mining. As
the information available on the internet grows drastically, extracting
knowledge from XML documents becomes a harder task. Certainly,
documents are often so large that the data set returned as answer to a
query may also be very big to convey the required information. To
improve the query answering, a Semantic Tree Based Association
Rule (STAR) mining method is proposed. This method provides
intentional information by considering the structure, content and the
semantics of the content. The method is applied on Reuter’s dataset
and the results show that the proposed method outperforms well.
Abstract: Since big data has become substantially more accessible and manageable due to the development of powerful tools for dealing with unstructured data, people are eager to mine information from social media resources that could not be handled in the past. Sentiment analysis, as a novel branch of text mining, has in the last decade become increasingly important in marketing analysis, customer risk prediction and other fields. Scientists and researchers have undertaken significant work in creating and improving their sentiment models. In this paper, we present a concept of selecting appropriate classifiers based on the features and qualities of data sources by comparing the performances of five classifiers with three popular social media data sources: Twitter, Amazon Customer Reviews, and Movie Reviews. We introduced a couple of innovative models that outperform traditional sentiment classifiers for these data sources, and provide insights on how to further improve the predictive power of sentiment analysis. The modeling and testing work was done in R and Greenplum in-database analytic tools.
Abstract: Text Mining is around applying knowledge discovery
techniques to unstructured text is termed knowledge discovery in text
(KDT), or Text data mining or Text Mining. In decision tree
approach is most useful in classification problem. With this
technique, tree is constructed to model the classification process.
There are two basic steps in the technique: building the tree and
applying the tree to the database. This paper describes a proposed
C5.0 classifier that performs rulesets, cross validation and boosting
for original C5.0 in order to reduce the optimization of error ratio.
The feasibility and the benefits of the proposed approach are
demonstrated by means of medial data set like hypothyroid. It is
shown that, the performance of a classifier on the training cases from
which it was constructed gives a poor estimate by sampling or using a
separate test file, either way, the classifier is evaluated on cases that
were not used to build and evaluate the classifier are both are large. If
the cases in hypothyroid.data and hypothyroid.test were to be
shuffled and divided into a new 2772 case training set and a 1000
case test set, C5.0 might construct a different classifier with a lower
or higher error rate on the test cases. An important feature of see5 is
its ability to classifiers called rulesets. The ruleset has an error rate
0.5 % on the test cases. The standard errors of the means provide an
estimate of the variability of results. One way to get a more reliable
estimate of predictive is by f-fold –cross- validation. The error rate of
a classifier produced from all the cases is estimated as the ratio of the
total number of errors on the hold-out cases to the total number of
cases. The Boost option with x trials instructs See5 to construct up to
x classifiers in this manner. Trials over numerous datasets, large and
small, show that on average 10-classifier boosting reduces the error
rate for test cases by about 25%.
Abstract: Classifying biomedical literature is a difficult and
challenging task, especially when a large number of biomedical
articles should be organized into a hierarchical structure. In this paper,
we present an approach for classifying a collection of biomedical text
abstracts downloaded from Medline database with the help of
ontology alignment. To accomplish our goal, we construct two types
of hierarchies, the OHSUMED disease hierarchy and the Medline
abstract disease hierarchies from the OHSUMED dataset and the
Medline abstracts, respectively. Then, we enrich the OHSUMED
disease hierarchy before adapting it to ontology alignment process for
finding probable concepts or categories. Subsequently, we compute
the cosine similarity between the vector in probable concepts (in the
“enriched" OHSUMED disease hierarchy) and the vector in Medline
abstract disease hierarchies. Finally, we assign category to the new
Medline abstracts based on the similarity score. The results obtained
from the experiments show the performance of our proposed approach
for hierarchical classification is slightly better than the performance of
the multi-class flat classification.
Abstract: Searching similar documents and document
management subjects have important place in text mining. One of the
most important parts of similar document research studies is the
process of classifying or clustering the documents. In this study, a
similar document search approach that includes discussion of out the
case of belonging to multiple categories (multiple categories
problem) has been carried. The proposed method that based on Fuzzy
Similarity Classification (FSC) has been compared with Rocchio
algorithm and naive Bayes method which are widely used in text
mining. Empirical results show that the proposed method is quite
successful and can be applied effectively. For the second stage,
multiple categories vector method based on information of categories
regarding to frequency of being seen together has been used.
Empirical results show that achievement is increased almost two
times, when proposed method is compared with classical approach.
Abstract: Recently the use of data mining to scientific bibliographic data bases has been implemented to analyze the pathways of the knowledge or the core scientific relevances of a laureated novel or a country. This specific case of data mining has been named citation mining, and it is the integration of citation bibliometrics and text mining. In this paper we present an improved WEB implementation of statistical physics algorithms to perform the text mining component of citation mining. In particular we use an entropic like distance between the compression of text as an indicator of the similarity between them. Finally, we have included the recently proposed index h to characterize the scientific production. We have used this web implementation to identify users, applications and impact of the Mexican scientific institutions located in the State of Morelos.
Abstract: Most of the existing text mining approaches are
proposed, keeping in mind, transaction databases model. Thus, the
mined dataset is structured using just one concept: the “transaction",
whereas the whole dataset is modeled using the “set" abstract type. In
such cases, the structure of the whole dataset and the relationships
among the transactions themselves are not modeled and
consequently, not considered in the mining process.
We believe that taking into account structure properties of
hierarchically structured information (e.g. textual document, etc ...)
in the mining process, can leads to best results. For this purpose, an
hierarchical associations rule mining approach for textual documents
is proposed in this paper and the classical set-oriented mining
approach is reconsidered profits to a Direct Acyclic Graph (DAG)
oriented approach. Natural languages processing techniques are used
in order to obtain the DAG structure. Based on this graph model, an
hierarchical bottom up algorithm is proposed. The main idea is that
each node is mined with its parent node.
Abstract: With the extensive inclusion of document, especially
text, in the business systems, data mining does not cover the full
scope of Business Intelligence. Data mining cannot deliver its impact
on extracting useful details from the large collection of unstructured
and semi-structured written materials based on natural languages.
The most pressing issue is to draw the potential business intelligence
from text. In order to gain competitive advantages for the business, it
is necessary to develop the new powerful tool, text mining, to expand
the scope of business intelligence.
In this paper, we will work out the strong points of text mining in
extracting business intelligence from huge amount of textual
information sources within business systems. We will apply text
mining to each stage of Business Intelligence systems to prove that
text mining is the powerful tool to expand the scope of BI. After
reviewing basic definitions and some related technologies, we will
discuss the relationship and the benefits of these to text mining. Some
examples and applications of text mining will also be given. The
motivation behind is to develop new approach to effective and
efficient textual information analysis. Thus we can expand the scope
of Business Intelligence using the powerful tool, text mining.
Abstract: Text Mining is around applying knowledge discovery techniques to unstructured text is termed knowledge discovery in text (KDT), or Text data mining or Text Mining. In Neural Network that address classification problems, training set, testing set, learning rate are considered as key tasks. That is collection of input/output patterns that are used to train the network and used to assess the network performance, set the rate of adjustments. This paper describes a proposed back propagation neural net classifier that performs cross validation for original Neural Network. In order to reduce the optimization of classification accuracy, training time. The feasibility the benefits of the proposed approach are demonstrated by means of five data sets like contact-lenses, cpu, weather symbolic, Weather, labor-nega-data. It is shown that , compared to exiting neural network, the training time is reduced by more than 10 times faster when the dataset is larger than CPU or the network has many hidden units while accuracy ('percent correct') was the same for all datasets but contact-lences, which is the only one with missing attributes. For contact-lences the accuracy with Proposed Neural Network was in average around 0.3 % less than with the original Neural Network. This algorithm is independent of specify data sets so that many ideas and solutions can be transferred to other classifier paradigms.
Abstract: This paper presents a system for discovering
association rules from collections of unstructured documents called
EART (Extract Association Rules from Text). The EART system
treats texts only not images or figures. EART discovers association
rules amongst keywords labeling the collection of textual documents.
The main characteristic of EART is that the system integrates XML
technology (to transform unstructured documents into structured
documents) with Information Retrieval scheme (TF-IDF) and Data
Mining technique for association rules extraction. EART depends on
word feature to extract association rules. It consists of four phases:
structure phase, index phase, text mining phase and visualization
phase. Our work depends on the analysis of the keywords in the
extracted association rules through the co-occurrence of the keywords
in one sentence in the original text and the existing of the keywords
in one sentence without co-occurrence. Experiments applied on a
collection of scientific documents selected from MEDLINE that are
related to the outbreak of H5N1 avian influenza virus.