Abstract: In language learning, second language learners as well
as Native speakers commit errors in their attempt to achieve
competence in the target language. The realm of collocation has to do
with meaning relation between lexical items. In all human language,
there is a kind of ‘natural order’ in which words are arranged or relate
to one another in sentences so much so that when a word occurs in a
given context, the related or naturally co-occurring word will
automatically come to the mind. It becomes an error, therefore, if
students inappropriately pair or arrange such ‘naturally’ co–occurring
lexical items in a text. It has been observed that most of the second
language learners in this research group commit collocation errors. A
study of this kind is very significant as it gives insight into the kinds
of errors committed by learners. This will help the language teacher
to be able to identify the sources and causes of such errors as well as
correct them thereby guiding, helping and leading the learners
towards achieving some level of competence in the language. The
aim of the study is to understand the nature of these errors as
stumbling blocks to effective essay writing. The objective of the
study is to identify the errors, analyze their structural compositions so
as to determine whether there are similarities between students in this
regard and to find out whether there are patterns to these kinds of
errors which will enable the researcher to understand their sources
and causes. As a descriptive research, the researcher samples some
nine hundred essays collected from three hundred undergraduate
learners of English as a second language in the Federal College of
Education, Kano, North- West Nigeria, i.e. three essays per each
student. The essays which were given on three different lecture times
were of similar thematic preoccupations (i.e. same topics) and length
(i.e. same number of words). The essays were written during the
lecture hour at three different lecture occasions. The errors were
identified in a systematic manner whereby errors so identified were
recorded only once even if they occur severally in students’ essays.
The data was collated using percentages in which the identified
numbers of occurrences were converted accordingly in percentages.
The findings from the study indicate that there are similarities as well
as regular and repeated errors which provided a pattern. Based on the
pattern identified, the conclusion is that students’ collocation errors
are attributable to poor teaching and learning which resulted in wrong
generalization of rules.
Abstract: The aim of the study is to compare behavioral and
EEG reactions in Turkic-speaking inhabitants of Siberia (Tuvinians
and Yakuts) and Russians during the recognition of syntax errors in
native and foreign languages. Sixty-three healthy aboriginals of the
Tyva Republic, 29 inhabitants of the Sakha (Yakutia) Republic, and
55 Russians from Novosibirsk participated in the study. EEG were
recorded during execution of error-recognition task in Russian and
English language (in all participants) and in native languages
(Tuvinian or Yakut Turkic-speaking inhabitants). Reaction time (RT)
and quality of task execution were chosen as behavioral measures.
Amplitude and cortical distribution of P300 and P600 peaks of ERP
were used as a measure of speech-related brain activity. In Tuvinians,
there were no differences in the P300 and P600 amplitudes as well as
in cortical topology for Russian and Tuvinian languages, but there
was a difference for English. In Yakuts, the P300 and P600
amplitudes and topology of ERP for Russian language were the same
as Russians had for native language. In Yakuts, brain reactions during
Yakut and English language comprehension had no difference, while
the Russian language comprehension was differed from both Yakut
and English. We found out that the Tuvinians recognized both Russian and
Tuvinian as native languages, and English as a foreign language. The
Yakuts recognized both English and Yakut as foreign languages, but
Russian as a native language. According to the inquirer, both
Tuvinians and Yakuts use the national language as a spoken
language, whereas they do not use it for writing. It can well be a
reason that Yakuts perceive the Yakut writing language as a foreign
language while writing Russian as their native.
Abstract: Due to the rapid increase of Internet, web opinion
sources dynamically emerge which is useful for both potential
customers and product manufacturers for prediction and decision
purposes. These are the user generated contents written in natural
languages and are unstructured-free-texts scheme. Therefore, opinion
mining techniques become popular to automatically process customer
reviews for extracting product features and user opinions expressed
over them. Since customer reviews may contain both opinionated and
factual sentences, a supervised machine learning technique applies
for subjectivity classification to improve the mining performance. In
this paper, we dedicate our work is the task of opinion
summarization. Therefore, product feature and opinion extraction is
critical to opinion summarization, because its effectiveness
significantly affects the identification of semantic relationships. The
polarity and numeric score of all the features are determined by
Senti-WordNet Lexicon. The problem of opinion summarization
refers how to relate the opinion words with respect to a certain
feature. Probabilistic based model of supervised learning will
improve the result that is more flexible and effective.
Abstract: Code- mixing in spontaneous speech has been widely
discussed, but not in virtual situations; especially in context of the
third language learning students. Thus, this study is an attempt to
explore the linguistic characteristics of the mixing of Japanese,
English and Thai in a mobile Line chat room by students with their
background of English as L2, Japanese as L3 and Thai as mother
tongue. The result found that insertion of Thai content words is a very
common linguistic phenomenon embedded with the other two
languages in the sentences. As chatting is to be ‘relational’ or
‘interactional’, it affected the style of lexical choices to be speech-like,
more personal and emotionally-related. A personal pronoun in
Japanese is often mixed into the sentences. The Japanese
sentence-final question particle か “ka” was added to the end of the
sentence based on Thai grammar rules. Some unique characteristics
were created while chatting.
Abstract: In this paper, Fuzzy C-Means clustering with
Expectation Maximization-Gaussian Mixture Model based hybrid
modeling algorithm is proposed for Continuous Tamil Speech
Recognition. The speech sentences from various speakers are used
for training and testing phase and objective measures are between the
proposed and existing Continuous Speech Recognition algorithms.
From the simulated results, it is observed that the proposed algorithm
improves the recognition accuracy and F-measure up to 3% as
compared to that of the existing algorithms for the speech signal from
various speakers. In addition, it reduces the Word Error Rate, Error
Rate and Error up to 4% as compared to that of the existing
algorithms. In all aspects, the proposed hybrid modeling for Tamil
speech recognition provides the significant improvements for speechto-
text conversion in various applications.
Abstract: This paper aims to build an Arabic learning language tool using Flash CS4 professional software with action script 3.0 programming language, based on the Computer Aided Language Learning (CALL) material. An extra intention is to provide a primary tool and focus on learning Arabic as a second language to adults. It contains letters, words and sentences at the first stage. This includes interactive practices, which evaluates learners’ comprehension of the Arabic language. The system was examined and it was found that the language structure was correct and learners were satisfied regarding the system tools. The learners found the system tools efficient and simple to use. The paper's main conclusion illustrates that CALL can be applied without any hesitation to second language learners
Abstract: On March 11, 2011, the Great East Japan Earthquake occurred off the coast of Sanriku, Japan. It is important to build a sustainable society through the reconstruction process rather than simply restoring the infrastructure. To compare the goals of reconstruction plans of quake-stricken municipalities, Japanese language morphological analysis was performed by using text mining techniques. Frequently-used nouns were sorted into four main categories of “life”, “disaster prevention”, “economy”, and “harmony with environment”. Because Soma City is affected by nuclear accident, sentences tagged to “harmony with environment” tended to be frequent compared to the other municipalities. Results from cluster analysis and principle component analysis clearly indicated that the local government reinforces the efforts to reduce risks from radiation exposure as a top priority.
Abstract: In this paper, we propose a method of resolving dependency ambiguities of Korean subordinate clauses based on Support Vector Machines (SVMs). Dependency analysis of clauses is well known to be one of the most difficult tasks in parsing sentences, especially in Korean. In order to solve this problem, we assume that the dependency relation of Korean subordinate clauses is the dependency relation among verb phrase, verb and endings in the clauses. As a result, this problem is represented as a binary classification task. In order to apply SVMs to this problem, we selected two kinds of features: static and dynamic features. The experimental results on STEP2000 corpus show that our system achieves the accuracy of 73.5%.
Abstract: In this paper we focus on event extraction from Tamil
news article. This system utilizes a scoring scheme for extracting and
grouping event-specific sentences. Using this scoring scheme eventspecific
clustering is performed for multiple documents. Events are
extracted from each document using a scoring scheme based on
feature score and condition score. Similarly event specific sentences
are clustered from multiple documents using this scoring scheme.
The proposed system builds the Event Template based on user
specified query. The templates are filled with event specific details
like person, location and timeline extracted from the formed clusters.
The proposed system applies these methodologies for Tamil news
articles that have been enconverted into UNL graphs using a Tamil to
UNL-enconverter. The main intention of this work is to generate an
event based template.
Abstract: XML data consists of a very flexible tree-structure
which makes it difficult to support the storing and retrieving of XML
data. The node numbering scheme is one of the most popular
approaches to store XML in relational databases. Together with the
node numbering storage scheme, structural joins can be used to
efficiently process the hierarchical relationships in XML. However, in
order to process a tree-structured XPath query containing several
hierarchical relationships and conditional sentences on XML data,
many structural joins need to be carried out, which results in a high
query execution cost. This paper introduces mechanisms to reduce the
XPath queries including branch nodes into a much more efficient form
with less numbers of structural joins. A two step approach is proposed.
The first step merges duplicate nodes in the tree-structured query and
the second step divides the query into sub-queries, shortens the paths
and then merges the sub-queries back together. The proposed
approach can highly contribute to the efficient execution of XML
queries. Experimental results show that the proposed scheme can
reduce the query execution cost by up to an order of magnitude of the
original execution cost.
Abstract: Summarizing skills have been introduced to English
syllabus in secondary school in Malaysia to evaluate student-s comprehension for a given text where it requires students to employ several strategies to produce the summary. This paper reports on our effort to develop a computer-based summarization assessment system
that detects the strategies used by the students in producing their
summaries. Sentence decomposition of expert-written summaries is
used to analyze how experts produce their summary sentences. From
the analysis, we identified seven summarizing strategies and their
rules which are then transformed into a set of heuristic rules on how
to determine the summarizing strategies. We developed an algorithm
based on the heuristic rules and performed some experiments to
evaluate and support the technique proposed.
Abstract: This paper presents an approach for repairing word order errors in English text by reordering words in a sentence and choosing the version that maximizes the number of trigram hits according to a language model. A possible way for reordering the words is to use all the permutations. The problem is that for a sentence with length N words the number of all permutations is N!. The novelty of this method concerns the use of an efficient confusion matrix technique for reordering the words. The confusion matrix technique has been designed in order to reduce the search space among permuted sentences. The limitation of search space is succeeded using the statistical inference of N-grams. The results of this technique are very interesting and prove that the number of permuted sentences can be reduced by 98,16%. For experimental purposes a test set of TOEFL sentences was used and the results show that more than 95% can be repaired using the proposed method.
Abstract: Selecting the word translation from a set of target
language words, one that conveys the correct sense of source word
and makes more fluent target language output, is one of core
problems in machine translation. In this paper we compare the 3
methods of estimating word translation probabilities for selecting the
translation word in Thai – English Machine Translation. The 3
methods are (1) Method based on frequency of word translation, (2)
Method based on collocation of word translation, and (3) Method
based on Expectation Maximization (EM) algorithm. For evaluation
we used Thai – English parallel sentences generated by NECTEC.
The method based on EM algorithm is the best method in comparison
to the other methods and gives the satisfying results.
Abstract: Extracting thematic (semantic) roles is one of the
major steps in representing text meaning. It refers to finding the
semantic relations between a predicate and syntactic constituents in a
sentence. In this paper we present a rule-based approach to extract
semantic roles from Persian sentences. The system exploits a twophase
architecture to (1) identify the arguments and (2) label them
for each predicate.
For the first phase we developed a rule based shallow parser to
chunk Persian sentences and for the second phase we developed a
knowledge-based system to assign 16 selected thematic roles to the
chunks. The experimental results of testing each phase are shown at
the end of the paper.
Abstract: Opinion extraction about products from customer
reviews is becoming an interesting area of research. Customer
reviews about products are nowadays available from blogs and
review sites. Also tools are being developed for extraction of opinion
from these reviews to help the user as well merchants to track the
most suitable choice of product. Therefore efficient method and
techniques are needed to extract opinions from review and blogs. As
reviews of products mostly contains discussion about the features,
functions and services, therefore, efficient techniques are required to
extract user comments about the desired features, functions and
services. In this paper we have proposed a novel idea to find features
of product from user review in an efficient way. Our focus in this
paper is to get the features and opinion-oriented words about
products from text through auxiliary verbs (AV) {is, was, are, were,
has, have, had}. From the results of our experiments we found that
82% of features and 85% of opinion-oriented sentences include AVs.
Thus these AVs are good indicators of features and opinion
orientation in customer reviews.
Abstract: This research uses computational linguistics, an area of study that employs a computer to process natural language, and aims at discerning the patterns that exist in declarative sentences used in technical texts. The approach is mathematical, and the focus is on instructional texts found on web pages. The technique developed by the author and named the MAYA Semantic Technique is used here and organized into four stages. In the first stage, the parts of speech in each sentence are identified. In the second stage, the subject of the sentence is determined. In the third stage, MAYA performs a frequency analysis on the remaining words to determine the verb and its object. In the fourth stage, MAYA does statistical analysis to determine the content of the web page. The advantage of the MAYA Semantic Technique lies in its use of mathematical principles to represent grammatical operations which assist processing and accuracy if performed on unambiguous text. The MAYA Semantic Technique is part of a proposed architecture for an entire web-based intelligent tutoring system. On a sample set of sentences, partial semantics derived using the MAYA Semantic Technique were approximately 80% accurate. The system currently processes technical text in one domain, namely Cµ programming. In this domain all the keywords and programming concepts are known and understood.
Abstract: We report in this paper the model adopted by our
system of continuous speech recognition in Arab language SySRA
and the results obtained until now. This system uses the database
Arabdic-10 which is a corpus of word for the Arab language and
which was manually segmented. Phonetic decoding is represented
by an expert system where the knowledge base is translated in the
form of production rules. This expert system transforms a vocal
signal into a phonetic lattice. The higher level of the system takes
care of the recognition of the lattice thus obtained by deferring it in
the form of written sentences (orthographical Form). This level
contains initially the lexical analyzer which is not other than the
module of recognition. We subjected this analyzer to a set of
spectrograms obtained by dictating a score of sentences in Arab
language. The rate of recognition of these sentences is about 70%
which is, to our knowledge, the best result for the recognition of the
Arab language. The test set consists of twenty sentences from four
speakers not having taken part in the training.
Abstract: Automatic keyphrase extraction is useful in efficiently
locating specific documents in online databases. While several
techniques have been introduced over the years, improvement on
accuracy rate is minimal. This research examines attribute scores for
author-supplied keyphrases to better understand how the scores affect
the accuracy rate of automatic keyphrase extraction. Five attributes
are chosen for examination: Term Frequency, First Occurrence, Last
Occurrence, Phrase Position in Sentences, and Term Cohesion
Degree. The results show that First Occurrence is the most reliable
attribute. Term Frequency, Last Occurrence and Term Cohesion
Degree display a wide range of variation but are still usable with
suggested tweaks. Only Phrase Position in Sentences shows a totally
unpredictable pattern. The results imply that the commonly used
ranking approach which directly extracts top ranked potential phrases
from candidate keyphrase list as the keyphrases may not be reliable.
Abstract: Hypernetworks are a generalized graph structure
representing higher-order interactions between variables. We present a
method for self-organizing hypernetworks to learn an associative
memory of sentences and to recall the sentences from this memory.
This learning method is inspired by the “mental chemistry" model of
cognition and the “molecular self-assembly" technology in
biochemistry. Simulation experiments are performed on a corpus of
natural-language dialogues of approximately 300K sentences
collected from TV drama captions. We report on the sentence
completion performance as a function of the order of word-interaction
and the size of the learning corpus, and discuss the plausibility of this
architecture as a cognitive model of language learning and memory.
Abstract: In the paper a method of modeling text for Polish is
discussed. The method is aimed at transforming continuous input text
into a text consisting of sentences in so called canonical form, whose
characteristic is, among others, a complete structure as well as no
anaphora or ellipses. The transformation is lossless as to the content
of text being transformed. The modeling method has been worked
out for the needs of the Thetos system, which translates Polish
written texts into the Polish sign language. We believe that the
method can be also used in various applications that deal with the
natural language, e.g. in a text summary generator for Polish.