Abstract: Centroid terms are single words that semantically and
topically characterise text documents and so may serve as their
very compact representation in automatic text processing. In the
present paper, centroids are used to measure the relevance of text
documents with respect to a given search query. Thus, a new graphbased
paradigm for searching texts in large corpora is proposed
and evaluated against keyword-based methods. The first, promising
experimental results demonstrate the usefulness of the centroid-based
search procedure. It is shown that especially the routing of search
queries in interactive and decentralised search systems can be greatly
improved by applying this approach. A detailed discussion on further
fields of its application completes this contribution.
Abstract: The internet is growing larger and becoming the most popular platform for the people to share their opinion in different interests. We choose the education domain specifically comparing some Malaysian universities against each other. This comparison produces benchmark based on different criteria shared by the online users in various online resources including Twitter, Facebook and web pages. The comparison is accomplished using opinion mining framework to extract, process the unstructured text and classify the result to positive, negative or neutral (polarity). Hence, we divide our framework to three main stages; opinion collection (extraction), unstructured text processing and polarity classification. The extraction stage includes web crawling, HTML parsing, Sentence segmentation for punctuation classification, Part of Speech (POS) tagging, the second stage processes the unstructured text with stemming and stop words removal and finally prepare the raw text for classification using Named Entity Recognition (NER). Last phase is to classify the polarity and present overall result for the comparison among the Malaysian universities. The final result is useful for those who are interested to study in Malaysia, in which our final output declares clear winners based on the public opinions all over the web.
Abstract: Parallel text alignment is proposed as a way of aligning bahasa Indonesia to words in Javanese. Since the one-to-one word translator does not have the facility to translate pragmatic aspects of Javanese, the parallel text alignment model described uses a phrase pair combination. The algorithm aligns the parallel text automatically from the beginning to the end of each sentence. Even though the results of the phrase pair combination outperform the previous algorithm, it is still inefficient. Recording all possible combinations consume more space in the database and time consuming. The original algorithm is modified by applying the edit distance coefficient to improve the data-storage efficiency. As a result, the data-storage consumption is 90% reduced as well as its learning period (42s).
Abstract: Internet is one of the major sources of information for
the person belonging to almost all the fields of life. Major language
that is used to publish information on internet is language. This thing
becomes a problem in a country like Pakistan, where Urdu is the
national language. Only 10% of Pakistan mass can understand
English. The reason is millions of people are deprived of precious
information available on internet. This paper presents a system for
translation from English to Urdu. A module LESSA is used that uses
a rule based algorithm to read the input text in English language,
understand it and translate it into Urdu language. The designed
approach was further incorporated to translate the complete website
from English language o Urdu language. An option appears in the
browser to translate the webpage in a new window. The designed
system will help the millions of users of internet to get benefit of the
internet and approach the latest information and knowledge posted
daily on internet.
Abstract: All Text processing systems allow their users to
search a pattern of string from a given text. String matching is
fundamental to database and text processing applications. Every text
editor must contain a mechanism to search the current document for
arbitrary strings. Spelling checkers scan an input text for words in the
dictionary and reject any strings that do not match. We store our
information in data bases so that later on we can retrieve the same
and this retrieval can be done by using various string matching
algorithms. This paper is describing a new string matching algorithm
for various applications. A new algorithm has been designed with the
help of Rabin Karp Matcher, to improve string matching process.
Abstract: Digital news with a variety topics is abundant on the
internet. The problem is to classify news based on its appropriate
category to facilitate user to find relevant news rapidly. Classifier
engine is used to split any news automatically into the respective
category. This research employs Support Vector Machine (SVM) to
classify Indonesian news. SVM is a robust method to classify
binary classes. The core processing of SVM is in the formation of an
optimum separating plane to separate the different classes. For
multiclass problem, a mechanism called one against one is used to
combine the binary classification result. Documents were taken
from the Indonesian digital news site, www.kompas.com. The
experiment showed a promising result with the accuracy rate of 85%.
This system is feasible to be implemented on Indonesian news
classification.