Abstract: Traditional document representation for classification
follows Bag of Words (BoW) approach to represent the term weights.
The conventional method uses the Vector Space Model (VSM) to
exploit the statistical information of terms in the documents and they
fail to address the semantic information as well as order of the terms
present in the documents. Although, the phrase based approach
follows the order of the terms present in the documents rather than
semantics behind the word. Therefore, a semantic concept based
approach is used in this paper for enhancing the semantics by
incorporating the ontology information. In this paper a novel method
is proposed to forecast the intraday stock market price directional
movement based on the sentiments from Twitter and money control
news articles. The stock market forecasting is a very difficult and
highly complicated task because it is affected by many factors such
as economic conditions, political events and investor’s sentiment etc.
The stock market series are generally dynamic, nonparametric, noisy
and chaotic by nature. The sentiment analysis along with wisdom of
crowds can automatically compute the collective intelligence of
future performance in many areas like stock market, box office sales
and election outcomes. The proposed method utilizes collective
sentiments for stock market to predict the stock price directional
movements. The collective sentiments in the above social media have
powerful prediction on the stock price directional movements as
up/down by using Granger Causality test.
Abstract: Information retrieval has become an important field of study and research under computer science due to explosive growth of information available in the form of full text, hypertext, administrative text, directory, numeric or bibliographic text. The research work is going on various aspects of information retrieval systems so as to improve its efficiency and reliability. This paper presents a comprehensive study, which discusses not only emergence and evolution of information retrieval but also includes different information retrieval models and some important aspects such as document representation, similarity measure and query expansion.
Abstract: Text similarity measurement is a fundamental issue in
many textual applications such as document clustering, classification,
summarization and question answering. However, prevailing approaches
based on Vector Space Model (VSM) more or less suffer
from the limitation of Bag of Words (BOW), which ignores the semantic
relationship among words. Enriching document representation
with background knowledge from Wikipedia is proven to be an effective
way to solve this problem, but most existing methods still
cannot avoid similar flaws of BOW in a new vector space. In this
paper, we propose a novel text similarity measurement which goes
beyond VSM and can find semantic affinity between documents.
Specifically, it is a unified graph model that exploits Wikipedia as
background knowledge and synthesizes both document representation
and similarity computation. The experimental results on two different
datasets show that our approach significantly improves VSM-based
methods in both text clustering and classification.
Abstract: In text categorization problem the most used method
for documents representation is based on words frequency vectors
called VSM (Vector Space Model). This representation is based only
on words from documents and in this case loses any “word context"
information found in the document. In this article we make a
comparison between the classical method of document representation
and a method called Suffix Tree Document Model (STDM) that is
based on representing documents in the Suffix Tree format. For the
STDM model we proposed a new approach for documents
representation and a new formula for computing the similarity
between two documents. Thus we propose to build the suffix tree
only for any two documents at a time. This approach is faster, it has
lower memory consumption and use entire document representation
without using methods for disposing nodes. Also for this method is
proposed a formula for computing the similarity between documents,
which improves substantially the clustering quality. This
representation method was validated using HAC - Hierarchical
Agglomerative Clustering. In this context we experiment also the
stemming influence in the document preprocessing step and highlight
the difference between similarity or dissimilarity measures to find
“closer" documents.
Abstract: Most of the existing text mining approaches are
proposed, keeping in mind, transaction databases model. Thus, the
mined dataset is structured using just one concept: the “transaction",
whereas the whole dataset is modeled using the “set" abstract type. In
such cases, the structure of the whole dataset and the relationships
among the transactions themselves are not modeled and
consequently, not considered in the mining process.
We believe that taking into account structure properties of
hierarchically structured information (e.g. textual document, etc ...)
in the mining process, can leads to best results. For this purpose, an
hierarchical associations rule mining approach for textual documents
is proposed in this paper and the classical set-oriented mining
approach is reconsidered profits to a Direct Acyclic Graph (DAG)
oriented approach. Natural languages processing techniques are used
in order to obtain the DAG structure. Based on this graph model, an
hierarchical bottom up algorithm is proposed. The main idea is that
each node is mined with its parent node.
Abstract: Nowadays, organizing a repository of documents and
resources for learning on a special field as Information Technology
(IT), together with search techniques based on domain knowledge or
document-s content is an urgent need in practice of teaching, learning
and researching. There have been several works related to methods of
organization and search by content. However, the results are still
limited and insufficient to meet user-s demand for semantic
document retrieval. This paper presents a solution for the
organization of a repository that supports semantic representation and
processing in search. The proposed solution is a model which
integrates components such as an ontology describing domain
knowledge, a database of document repository, semantic
representation for documents and a file system; with problems,
semantic processing techniques and advanced search techniques
based on measuring semantic similarity. The solution is applied to
build a IT learning materials management system of a university with
semantic search function serving students, teachers, and manager as
well. The application has been implemented, tested at the University
of Information Technology, Ho Chi Minh City, Vietnam and has
achieved good results.