Abstract: Twitter is one of the most popular social media platforms where users share their opinions on different subjects. Twitter can be considered a great source for mining text due to the high volumes of data generated through the platform daily. Many industries such as telecommunication companies can leverage the availability of Twitter data to better understand their markets and make an appropriate business decision. This study performs topic modeling on Twitter data using Latent Dirichlet Allocation (LDA). The obtained results are benchmarked with another topic modeling technique, Latent Semantic Indexing (LSI). The study aims to retrieve topics on a Twitter dataset containing user tweets on South African Telcos. Results from this study show that LSI is much faster than LDA. However, LDA yields better results with higher topic coherence by 8% for the best-performing model in this experiment. A higher topic coherence score indicates better performance of the model.
Abstract: Ontologies and various semantic repositories became a convenient approach for implementing model-driven architectures of distributed systems on the Web. SPARQL is the standard query language for querying such. However, although SPARQL is well-established standard for querying semantic repositories in RDF and OWL format and there are commonly used APIs which supports it, like Jena for Java, its parallel option is not incorporated in them. This article presents a complete framework consisting of an object algebra for parallel RDF and an index-based implementation of the parallel query engine capable of dealing with the distributed RDF ontologies which share common vocabulary. It has been implemented in Java, and for validation of the algorithms has been applied to the problem of organizing virtual exhibitions on the Web.
Abstract: The growth in the volume of text data such as books
and articles in libraries for centuries has imposed to establish
effective mechanisms to locate them. Early techniques such as
abstraction, indexing and the use of classification categories have
marked the birth of a new field of research called "Information
Retrieval". Information Retrieval (IR) can be defined as the task of
defining models and systems whose purpose is to facilitate access to
a set of documents in electronic form (corpus) to allow a user to find
the relevant ones for him, that is to say, the contents which matches
with the information needs of the user. This paper presents a new
semantic indexing approach of a documentary corpus. The indexing
process starts first by a term weighting phase to determine the
importance of these terms in the documents. Then the use of a
thesaurus like Wordnet allows moving to the conceptual level.
Each candidate concept is evaluated by determining its level of
representation of the document, that is to say, the importance of the
concept in relation to other concepts of the document. Finally, the
semantic index is constructed by attaching to each concept of the
ontology, the documents of the corpus in which these concepts are
found.
Abstract: With the advance of multimedia and diagnostic
images technologies, the number of radiographic images is increasing
constantly. The medical field demands sophisticated systems for
search and retrieval of the produced multimedia document. This
paper presents an ongoing research that focuses on the semantic
content of radiographic image documents to facilitate semantic-based
radiographic image indexing and a retrieval system. The proposed
model would divide a radiographic image document, based on its
semantic content, and would be converted into a logical structure or
a semantic structure. The logical structure represents the overall
organization of information. The semantic structure, which is bound
to logical structure, is composed of semantic objects with
interrelationships in the various spaces in the radiographic image.
Abstract: Word sense disambiguation is one of the most important open problems in natural language processing applications such as information retrieval and machine translation. Many approach strategies can be employed to resolve word ambiguity with a reasonable degree of accuracy. These strategies are: knowledgebased, corpus-based, and hybrid-based. This paper pays attention to the corpus-based strategy that employs an unsupervised learning method for disambiguation. We report our investigation of Latent Semantic Indexing (LSI), an information retrieval technique and unsupervised learning, to the task of Thai noun and verbal word sense disambiguation. The Latent Semantic Indexing has been shown to be efficient and effective for Information Retrieval. For the purposes of this research, we report experiments on two Thai polysemous words, namely /hua4/ and /kep1/ that are used as a representative of Thai nouns and verbs respectively. The results of these experiments demonstrate the effectiveness and indicate the potential of applying vector-based distributional information measures to semantic disambiguation.
Abstract: As a popular rank-reduced vector space approach,
Latent Semantic Indexing (LSI) has been used in information
retrieval and other applications. In this paper, an LSI-based content
vector model for text classification is presented, which constructs
multiple augmented category LSI spaces and classifies text by their
content. The model integrates the class discriminative information
from the training data and is equipped with several pertinent feature
selection and text classification algorithms. The proposed classifier
has been applied to email classification and its experiments on a
benchmark spam testing corpus (PU1) have shown that the approach
represents a competitive alternative to other email classifiers based
on the well-known SVM and naïve Bayes algorithms.