Abstract: In order to survive on the market, companies must
constantly develop improved and new products. These products are
designed to serve the needs of their customers in the best possible
way. The creation of new products is also called innovation and is
primarily driven by a company’s internal research and development
department. However, a new approach has been taking place for some
years now, involving external knowledge in the innovation process.
This approach is called open innovation and identifies customer
knowledge as the most important source in the innovation process. This paper presents a concept of using social media posts as an external source to support the open innovation approach in its
initial phase, the Ideation phase. For this purpose, the social media
posts are semantically structured with the help of an ontology and
the authors are evaluated using graph-theoretical metrics such as
density. For the structuring and evaluation of relevant social media
posts, we also use the findings of Natural Language Processing, e.
g. Named Entity Recognition, specific dictionaries, Triple Tagger
and Part-of-Speech-Tagger. The selection and evaluation of the tools
used are discussed in this paper. Using our ontology and metrics
to structure social media posts enables users to semantically search
these posts for new product ideas and thus gain an improved insight
into the external sources such as customer needs.
Abstract: One of the major goals of Spoken Dialog Systems
(SDS) is to understand what the user utters.
In the SDS domain, the Spoken Language Understanding (SLU)
Module classifies user utterances by means of a pre-definite
conceptual knowledge. The SLU module is able to recognize only the
meaning previously included in its knowledge base. Due the vastity
of that knowledge, the information storing is a very expensive
process.
Updating and managing the knowledge base are time-consuming
and error-prone processes because of the rapidly growing number of
entities like proper nouns and domain-specific nouns. This paper
proposes a solution to the problem of Name Entity Recognition
(NER) applied to a SDS domain. The proposed solution attempts to
automatically recognize the meaning associated with an utterance by
using the PANKOW (Pattern based Annotation through Knowledge
On the Web) method at runtime.
The method being proposed extracts information from the Web to
increase the SLU knowledge module and reduces the development
effort. In particular, the Google Search Engine is used to extract
information from the Facebook social network.
Abstract: The internet is growing larger and becoming the most popular platform for the people to share their opinion in different interests. We choose the education domain specifically comparing some Malaysian universities against each other. This comparison produces benchmark based on different criteria shared by the online users in various online resources including Twitter, Facebook and web pages. The comparison is accomplished using opinion mining framework to extract, process the unstructured text and classify the result to positive, negative or neutral (polarity). Hence, we divide our framework to three main stages; opinion collection (extraction), unstructured text processing and polarity classification. The extraction stage includes web crawling, HTML parsing, Sentence segmentation for punctuation classification, Part of Speech (POS) tagging, the second stage processes the unstructured text with stemming and stop words removal and finally prepare the raw text for classification using Named Entity Recognition (NER). Last phase is to classify the polarity and present overall result for the comparison among the Malaysian universities. The final result is useful for those who are interested to study in Malaysia, in which our final output declares clear winners based on the public opinions all over the web.
Abstract: This paper explores the scalability issues associated
with solving the Named Entity Recognition (NER) problem using
Support Vector Machines (SVM) and high-dimensional features. The
performance results of a set of experiments conducted using binary
and multi-class SVM with increasing training data sizes are
examined. The NER domain chosen for these experiments is the
biomedical publications domain, especially selected due to its
importance and inherent challenges. A simple machine learning
approach is used that eliminates prior language knowledge such as
part-of-speech or noun phrase tagging thereby allowing for its
applicability across languages. No domain-specific knowledge is
included. The accuracy measures achieved are comparable to those
obtained using more complex approaches, which constitutes a
motivation to investigate ways to improve the scalability of multiclass
SVM in order to make the solution more practical and useable.
Improving training time of multi-class SVM would make support
vector machines a more viable and practical machine learning
solution for real-world problems with large datasets. An initial
prototype results in great improvement of the training time at the
expense of memory requirements.
Abstract: Named Entity Recognition (NER) aims to classify each word of a document into predefined target named entity classes and is now-a-days considered to be fundamental for many Natural Language Processing (NLP) tasks such as information retrieval, machine translation, information extraction, question answering systems and others. This paper reports about the development of a NER system for Bengali and Hindi using Support Vector Machine (SVM). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Indian languages (ILs) is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named (NE) classes, such as Person name, Location name, Organization name and Miscellaneous name. We have used the annotated corpora of 122,467 tokens of Bengali and 502,974 tokens of Hindi tagged with the twelve different NE classes 1, defined as part of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (SSEAL) 2. In addition, we have manually annotated 150K wordforms of the Bengali news corpus, developed from the web-archive of a leading Bengali newspaper. We have also developed an unsupervised algorithm in order to generate the lexical context patterns from a part of the unlabeled Bengali news corpus. Lexical patterns have been used as the features of SVM in order to improve the system performance. The NER system has been tested with the gold standard test sets of 35K, and 60K tokens for Bengali, and Hindi, respectively. Evaluation results have demonstrated the recall, precision, and f-score values of 88.61%, 80.12%, and 84.15%, respectively, for Bengali and 80.23%, 74.34%, and 77.17%, respectively, for Hindi. Results show the improvement in the f-score by 5.13% with the use of context patterns. Statistical analysis, ANOVA is also performed to compare the performance of the proposed NER system with that of the existing HMM based system for both the languages.