Abstract: Over the past decade, there have been promising developments in Natural Language Processing (NLP) with several investigations of approaches focusing on Recognizing Textual Entailment (RTE). These models include models based on lexical similarities, models based on formal reasoning, and most recently deep neural models. In this paper, we present a sentence encoding model that exploits the sentence-to-sentence relation information for RTE. In terms of sentence modeling, Convolutional neural network (CNN) and recurrent neural networks (RNNs) adopt different approaches. RNNs are known to be well suited for sequence modeling, whilst CNN is suited for the extraction of n-gram features through the filters and can learn ranges of relations via the pooling mechanism. We combine the strength of RNN and CNN as stated above to present a unified model for the RTE task. Our model basically combines relation vectors computed from the phrasal representation of each sentence and final encoded sentence representations. Firstly, we pass each sentence through a convolutional layer to extract a sequence of higher-level phrase representation for each sentence from which the first relation vector is computed. Secondly, the phrasal representation of each sentence from the convolutional layer is fed into a Bidirectional Long Short Term Memory (Bi-LSTM) to obtain the final sentence representations from which a second relation vector is computed. The relations vectors are combined and then used in then used in the same fashion as attention mechanism over the Bi-LSTM outputs to yield the final sentence representations for the classification. Experiment on the Stanford Natural Language Inference (SNLI) corpus suggests that this is a promising technique for RTE.
Abstract: This paper proposes a method of learning topics for
broadcasting contents. There are two kinds of texts related to
broadcasting contents. One is a broadcasting script, which is a series of
texts including directions and dialogues. The other is blogposts, which
possesses relatively abstracted contents, stories, and diverse
information of broadcasting contents. Although two texts range over
similar broadcasting contents, words in blogposts and broadcasting
script are different. When unseen words appear, it needs a method to
reflect to existing topic. In this paper, we introduce a semantic
vocabulary expansion method to reflect unseen words. We expand
topics of the broadcasting script by incorporating the words in
blogposts. Each word in blogposts is added to the most semantically
correlated topics. We use word2vec to get the semantic correlation
between words in blogposts and topics of scripts. The vocabularies of
topics are updated and then posterior inference is performed to
rearrange the topics. In experiments, we verified that the proposed
method can discover more salient topics for broadcasting contents.
Abstract: Scripts are one of the basic text resources to understand
broadcasting contents. Topic modeling is the method to get the
summary of the broadcasting contents from its scripts. Generally,
scripts represent contents descriptively with directions and speeches,
and provide scene segments that can be seen as semantic units.
Therefore, a script can be topic modeled by treating a scene segment
as a document. Because scene segments consist of speeches mainly,
however, relatively small co-occurrences among words in the scene
segments are observed. This causes inevitably the bad quality of
topics by statistical learning method. To tackle this problem, we
propose a method to improve topic quality with additional word
co-occurrence information obtained using scene similarities. The
main idea of improving topic quality is that the information that
two or more texts are topically related can be useful to learn high
quality of topics. In addition, more accurate topical representations
lead to get information more accurate whether two texts are related
or not. In this paper, we regard two scene segments are related
if their topical similarity is high enough. We also consider that
words are co-occurred if they are in topically related scene segments
together. By iteratively inferring topics and determining semantically
neighborhood scene segments, we draw a topic space represents
broadcasting contents well. In the experiments, we showed the
proposed method generates a higher quality of topics from Korean
drama scripts than the baselines.
Abstract: Consumer-to-Consumer (C2C) E-commerce has been
growing at a very high speed in recent years. Since identical or
nearly-same kinds of products compete one another by relying on
keyword search in C2C E-commerce, some sellers describe their
products with spam keywords that are popular but are not related to
their products. Though such products get more chances to be retrieved
and selected by consumers than those without spam keywords,
the spam keywords mislead the consumers and waste their time.
This problem has been reported in many commercial services like
ebay and taobao, but there have been little research to solve this
problem. As a solution to this problem, this paper proposes a method
to classify whether keywords of a product are spam or not. The
proposed method assumes that a keyword for a given product is
more reliable if the keyword is observed commonly in specifications
of products which are the same or the same kind as the given
product. This is because that a hierarchical category of a product
in general determined precisely by a seller of the product and so is
the specification of the product. Since higher layers of the hierarchical
category represent more general kinds of products, a reliable degree
is differently determined according to the layers. Hence, reliable
degrees from different layers of a hierarchical category become
features for keywords and they are used together with features only
from specifications for classification of the keywords. Support Vector
Machines are adopted as a basic classifier using the features, since
it is powerful, and widely used in many classification tasks. In
the experiments, the proposed method is evaluated with a golden
standard dataset from Yi-han-wang, a Chinese C2C E-commerce,
and is compared with a baseline method that does not consider
the hierarchical category. The experimental results show that the
proposed method outperforms the baseline in F1-measure, which
proves that spam keywords are effectively identified by a hierarchical
category in C2C E-commerce.
Abstract: In this paper, we propose a method of resolving dependency ambiguities of Korean subordinate clauses based on Support Vector Machines (SVMs). Dependency analysis of clauses is well known to be one of the most difficult tasks in parsing sentences, especially in Korean. In order to solve this problem, we assume that the dependency relation of Korean subordinate clauses is the dependency relation among verb phrase, verb and endings in the clauses. As a result, this problem is represented as a binary classification task. In order to apply SVMs to this problem, we selected two kinds of features: static and dynamic features. The experimental results on STEP2000 corpus show that our system achieves the accuracy of 73.5%.
Abstract: It is an important task in Korean-English machine
translation to classify the gender of names correctly. When a sentence
is composed of two or more clauses and only one subject is given as a proper noun, it is important to find the gender of the proper noun
for correct translation of the sentence. This is because a singular pronoun has a gender in English while it does not in Korean. Thus,
in Korean-English machine translation, the gender of a proper noun should be determined. More generally, this task can be expanded into the classification of the general Korean names. This paper proposes a statistical method for this problem. By considering a name as just
a sequence of syllables, it is possible to get a statistics for each name from a collection of names. An evaluation of the proposed method
yields the improvement in accuracy over the simple looking-up of the
collection. While the accuracy of the looking-up method is 64.11%, that of the proposed method is 81.49%. This implies that the proposed
method is more plausible for the gender classification of the Korean names.
Abstract: An ontology is widely used in many kinds of applications as a knowledge representation tool for domain knowledge. However, even though an ontology schema is well prepared by domain experts, it is tedious and cost-intensive to add instances into the ontology. The most confident and trust-worthy way to add instances into the ontology is to gather instances from tables in the related Web pages. In automatic populating of instances, the primary task is to find the most proper concept among all possible concepts within the ontology for a given table. This paper proposes a novel method for this problem by defining the similarity between the table and the concept using the overlap of their properties. According to a series of experiments, the proposed method achieves 76.98% of accuracy. This implies that the proposed method is a plausible way for automatic ontology population from Web tables.