Abstract: On March 11, 2011, the Great East Japan Earthquake occurred off the coast of Sanriku, Japan. It is important to build a sustainable society through the reconstruction process rather than simply restoring the infrastructure. To compare the goals of reconstruction plans of quake-stricken municipalities, Japanese language morphological analysis was performed by using text mining techniques. Frequently-used nouns were sorted into four main categories of “life”, “disaster prevention”, “economy”, and “harmony with environment”. Because Soma City is affected by nuclear accident, sentences tagged to “harmony with environment” tended to be frequent compared to the other municipalities. Results from cluster analysis and principle component analysis clearly indicated that the local government reinforces the efforts to reduce risks from radiation exposure as a top priority.
Abstract: Chinese Idioms are a type of traditional Chinese idiomatic
expressions with specific meanings and stereotypes structure
which are widely used in classical Chinese and are still common in
vernacular written and spoken Chinese today. Currently, Chinese
Idioms are retrieved in glossary with key character or key word in
morphology or pronunciation index that can not meet the need of
searching semantically. OCIRS is proposed to search the desired
idiom in the case of users only knowing its meaning without any key
character or key word. The user-s request in a sentence or phrase will
be grammatically analyzed in advance by word segmentation, key
word extraction and semantic similarity computation, thus can be
mapped to the idiom domain ontology which is constructed to provide
ample semantic relations and to facilitate description logics-based
reasoning for idiom retrieval. The experimental evaluation shows that
OCIRS realizes the function of searching idioms via semantics, obtaining
preliminary achievement as requested by the users.
Abstract: In this paper, we propose a method of resolving dependency ambiguities of Korean subordinate clauses based on Support Vector Machines (SVMs). Dependency analysis of clauses is well known to be one of the most difficult tasks in parsing sentences, especially in Korean. In order to solve this problem, we assume that the dependency relation of Korean subordinate clauses is the dependency relation among verb phrase, verb and endings in the clauses. As a result, this problem is represented as a binary classification task. In order to apply SVMs to this problem, we selected two kinds of features: static and dynamic features. The experimental results on STEP2000 corpus show that our system achieves the accuracy of 73.5%.
Abstract: Machine Translation (MT) between the Thai and English languages has been a challenging research topic in natural language processing. Most research has been done on English to Thai machine translation, but not the other way around. This paper presents a Thai to English Machine Translation System that translates a Thai sentence into interlingua of a Thai LFG tree using LFG grammar and a bottom up parser. The Thai LFG tree is then transformed into the corresponding English LFG tree by pattern matching and node transformation. Finally, an equivalent English sentence is created using structural information prescribed by the English LFG tree. Based on results of experiments designed to evaluate the performance of the proposed system, it can be stated that the system has been proven to be effective in providing a useful translation from Thai to English.
Abstract: This work proposes an approach to address automatic
text summarization. This approach is a trainable summarizer, which
takes into account several features, including sentence position,
positive keyword, negative keyword, sentence centrality, sentence
resemblance to the title, sentence inclusion of name entity, sentence
inclusion of numerical data, sentence relative length, Bushy path of
the sentence and aggregated similarity for each sentence to generate
summaries. First we investigate the effect of each sentence feature on
the summarization task. Then we use all features score function to
train genetic algorithm (GA) and mathematical regression (MR)
models to obtain a suitable combination of feature weights. The
proposed approach performance is measured at several compression
rates on a data corpus composed of 100 English religious articles.
The results of the proposed approach are promising.
Abstract: In this paper we focus on event extraction from Tamil
news article. This system utilizes a scoring scheme for extracting and
grouping event-specific sentences. Using this scoring scheme eventspecific
clustering is performed for multiple documents. Events are
extracted from each document using a scoring scheme based on
feature score and condition score. Similarly event specific sentences
are clustered from multiple documents using this scoring scheme.
The proposed system builds the Event Template based on user
specified query. The templates are filled with event specific details
like person, location and timeline extracted from the formed clusters.
The proposed system applies these methodologies for Tamil news
articles that have been enconverted into UNL graphs using a Tamil to
UNL-enconverter. The main intention of this work is to generate an
event based template.
Abstract: In this paper we present the first Arabic sentence
dataset for on-line handwriting recognition written on tablet pc. The
dataset is natural, simple and clear. Texts are sampled from daily
newspapers. To collect naturally written handwriting, forms are
dictated to writers. The current version of our dataset includes 154
paragraphs written by 48 writers. It contains more than 3800 words
and more than 19,400 characters. Handwritten texts are mainly
written by researchers from different research centers. In order to use
this dataset in a recognition system word extraction is needed. In this
paper a new word extraction technique based on the Arabic
handwriting cursive nature is also presented. The technique is applied
to this dataset and good results are obtained. The results can be
considered as a bench mark for future research to be compared with.
Abstract: XML data consists of a very flexible tree-structure
which makes it difficult to support the storing and retrieving of XML
data. The node numbering scheme is one of the most popular
approaches to store XML in relational databases. Together with the
node numbering storage scheme, structural joins can be used to
efficiently process the hierarchical relationships in XML. However, in
order to process a tree-structured XPath query containing several
hierarchical relationships and conditional sentences on XML data,
many structural joins need to be carried out, which results in a high
query execution cost. This paper introduces mechanisms to reduce the
XPath queries including branch nodes into a much more efficient form
with less numbers of structural joins. A two step approach is proposed.
The first step merges duplicate nodes in the tree-structured query and
the second step divides the query into sub-queries, shortens the paths
and then merges the sub-queries back together. The proposed
approach can highly contribute to the efficient execution of XML
queries. Experimental results show that the proposed scheme can
reduce the query execution cost by up to an order of magnitude of the
original execution cost.
Abstract: Summarizing skills have been introduced to English
syllabus in secondary school in Malaysia to evaluate student-s comprehension for a given text where it requires students to employ several strategies to produce the summary. This paper reports on our effort to develop a computer-based summarization assessment system
that detects the strategies used by the students in producing their
summaries. Sentence decomposition of expert-written summaries is
used to analyze how experts produce their summary sentences. From
the analysis, we identified seven summarizing strategies and their
rules which are then transformed into a set of heuristic rules on how
to determine the summarizing strategies. We developed an algorithm
based on the heuristic rules and performed some experiments to
evaluate and support the technique proposed.
Abstract: In this article we present a methodology which
enables preschool and primary school unlanguaged children to
remember words, phrases and texts with the help of graphic signs -
letters, syllables and words. Reading for a child becomes a support
for speech development. Teaching is based on the principle "from
simple to complex", "a letter - a syllable - a word - a proposal - a
text." Availability of multi-level texts allows using this methodology
for working with children who have different levels of speech
development.
Abstract: The purposes of this study are 1) to study the frequent
English writing errors of students registering the course: Reading and
Writing English for Academic Purposes II, and 2) to find out the
results of writing error correction by using coded indirect corrective
feedback and writing error treatments. Samples include 28 2nd year
English Major students, Faculty of Education, Suan Sunandha
Rajabhat University. Tool for experimental study includes the lesson
plan of the course; Reading and Writing English for Academic
Purposes II, and tool for data collection includes 4 writing tests of
short texts. The research findings disclose that frequent English
writing errors found in this course comprise 7 types of grammatical
errors, namely Fragment sentence, Subject-verb agreement, Wrong
form of verb tense, Singular or plural noun endings, Run-ons
sentence, Wrong form of verb pattern and Lack of parallel structure.
Moreover, it is found that the results of writing error correction by
using coded indirect corrective feedback and error treatment reveal
the overall reduction of the frequent English writing errors and the
increase of students’ achievement in the writing of short texts with
the significance at .05.
Abstract: This paper presents an approach for repairing word order errors in English text by reordering words in a sentence and choosing the version that maximizes the number of trigram hits according to a language model. A possible way for reordering the words is to use all the permutations. The problem is that for a sentence with length N words the number of all permutations is N!. The novelty of this method concerns the use of an efficient confusion matrix technique for reordering the words. The confusion matrix technique has been designed in order to reduce the search space among permuted sentences. The limitation of search space is succeeded using the statistical inference of N-grams. The results of this technique are very interesting and prove that the number of permuted sentences can be reduced by 98,16%. For experimental purposes a test set of TOEFL sentences was used and the results show that more than 95% can be repaired using the proposed method.
Abstract: Selecting the word translation from a set of target
language words, one that conveys the correct sense of source word
and makes more fluent target language output, is one of core
problems in machine translation. In this paper we compare the 3
methods of estimating word translation probabilities for selecting the
translation word in Thai – English Machine Translation. The 3
methods are (1) Method based on frequency of word translation, (2)
Method based on collocation of word translation, and (3) Method
based on Expectation Maximization (EM) algorithm. For evaluation
we used Thai – English parallel sentences generated by NECTEC.
The method based on EM algorithm is the best method in comparison
to the other methods and gives the satisfying results.
Abstract: Extracting thematic (semantic) roles is one of the
major steps in representing text meaning. It refers to finding the
semantic relations between a predicate and syntactic constituents in a
sentence. In this paper we present a rule-based approach to extract
semantic roles from Persian sentences. The system exploits a twophase
architecture to (1) identify the arguments and (2) label them
for each predicate.
For the first phase we developed a rule based shallow parser to
chunk Persian sentences and for the second phase we developed a
knowledge-based system to assign 16 selected thematic roles to the
chunks. The experimental results of testing each phase are shown at
the end of the paper.
Abstract: In this paper, a new adaptive Fourier decomposition
(AFD) based time-frequency speech analysis approach is proposed.
Given the fact that the fundamental frequency of speech signals often
undergo fluctuation, the classical short-time Fourier transform (STFT)
based spectrogram analysis suffers from the difficulty of window size
selection. AFD is a newly developed signal decomposition theory. It is
designed to deal with time-varying non-stationary signals. Its
outstanding characteristic is to provide instantaneous frequency for
each decomposed component, so the time-frequency analysis becomes
easier. Experiments are conducted based on the sample sentence in
TIMIT Acoustic-Phonetic Continuous Speech Corpus. The results
show that the AFD based time-frequency distribution outperforms the
STFT based one.
Abstract: Opinion extraction about products from customer
reviews is becoming an interesting area of research. Customer
reviews about products are nowadays available from blogs and
review sites. Also tools are being developed for extraction of opinion
from these reviews to help the user as well merchants to track the
most suitable choice of product. Therefore efficient method and
techniques are needed to extract opinions from review and blogs. As
reviews of products mostly contains discussion about the features,
functions and services, therefore, efficient techniques are required to
extract user comments about the desired features, functions and
services. In this paper we have proposed a novel idea to find features
of product from user review in an efficient way. Our focus in this
paper is to get the features and opinion-oriented words about
products from text through auxiliary verbs (AV) {is, was, are, were,
has, have, had}. From the results of our experiments we found that
82% of features and 85% of opinion-oriented sentences include AVs.
Thus these AVs are good indicators of features and opinion
orientation in customer reviews.
Abstract: This research uses computational linguistics, an area of study that employs a computer to process natural language, and aims at discerning the patterns that exist in declarative sentences used in technical texts. The approach is mathematical, and the focus is on instructional texts found on web pages. The technique developed by the author and named the MAYA Semantic Technique is used here and organized into four stages. In the first stage, the parts of speech in each sentence are identified. In the second stage, the subject of the sentence is determined. In the third stage, MAYA performs a frequency analysis on the remaining words to determine the verb and its object. In the fourth stage, MAYA does statistical analysis to determine the content of the web page. The advantage of the MAYA Semantic Technique lies in its use of mathematical principles to represent grammatical operations which assist processing and accuracy if performed on unambiguous text. The MAYA Semantic Technique is part of a proposed architecture for an entire web-based intelligent tutoring system. On a sample set of sentences, partial semantics derived using the MAYA Semantic Technique were approximately 80% accurate. The system currently processes technical text in one domain, namely Cµ programming. In this domain all the keywords and programming concepts are known and understood.
Abstract: Processing the data by computers and performing
reasoning tasks is an important aim in Computer Science. Semantic
Web is one step towards it. The use of ontologies to enhance the
information by semantically is the current trend. Huge amount of
domain specific, unstructured on-line data needs to be expressed in
machine understandable and semantically searchable format.
Currently users are often forced to search manually in the results
returned by the keyword-based search services. They also want to use
their native languages to express what they search. In this paper, an
ontology-based automated question answering system on software
test documents domain is presented. The system allows users to enter
a question about the domain by means of natural language and
returns exact answer of the questions. Conversion of the natural
language question into the ontology based query is the challenging
part of the system. To be able to achieve this, a new algorithm
regarding free text to ontology based search engine query conversion
is proposed. The algorithm is based on investigation of suitable
question type and parsing the words of the question sentence.
Abstract: Information sharing and gathering are important in the rapid advancement era of technology. The existence of WWW has caused rapid growth of information explosion. Readers are overloaded with too many lengthy text documents in which they are more interested in shorter versions. Oil and gas industry could not escape from this predicament. In this paper, we develop an Automated Text Summarization System known as AutoTextSumm to extract the salient points of oil and gas drilling articles by incorporating statistical approach, keywords identification, synonym words and sentence-s position. In this study, we have conducted interviews with Petroleum Engineering experts and English Language experts to identify the list of most commonly used keywords in the oil and gas drilling domain. The system performance of AutoTextSumm is evaluated using the formulae of precision, recall and F-score. Based on the experimental results, AutoTextSumm has produced satisfactory performance with F-score of 0.81.
Abstract: Machine Translation, (hereafter in this document
referred to as the "MT") faces a lot of complex problems from its
origination. Extracting multiword expressions is also one of the
complex problems in MT. Finding multiword expressions during
translating a sentence from English into Urdu, through existing
solutions, takes a lot of time and occupies system resources. We have
designed a simple relational data approach, in which we simply set a
bit in dictionary (database) for multiword, to find and handle
multiword expression. This approach handles multiword efficiently.