Abstract: In recent years, the number of document data has been
increasing since the spread of the Internet. Many methods have been
studied for extracting topics from large document data. We proposed
Independent Topic Analysis (ITA) to extract topics independent of
each other from large document data such as newspaper data. ITA is a
method for extracting the independent topics from the document data
by using the Independent Component Analysis. The topic represented
by ITA is represented by a set of words. However, the set of words
is quite different from the topics the user imagines. For example,
the top five words with high independence of a topic are as follows.
Topic1 = {"scor", "game", "lead", "quarter", "rebound"}. This Topic
1 is considered to represent the topic of "SPORTS". This topic name
"SPORTS" has to be attached by the user. ITA cannot name topics.
Therefore, in this research, we propose a method to obtain topics easy
for people to understand by using the web search engine, topics given
by the set of words given by independent topic analysis. In particular,
we search a set of topical words, and the title of the homepage of
the search result is taken as the topic name. And we also use the
proposed method for some data and verify its effectiveness.
Abstract: Recently, numerous documents including large
volumes of unstructured data and text have been created because of the
rapid increase in the use of social media and the Internet. Usually,
these documents are categorized for the convenience of users. Because
the accuracy of manual categorization is not guaranteed, and such
categorization requires a large amount of time and incurs huge costs.
Many studies on automatic categorization have been conducted to help
mitigate the limitations of manual categorization. Unfortunately, most
of these methods cannot be applied to categorize complex documents
with multiple topics because they work on the assumption that
individual documents can be categorized into single categories only.
Therefore, to overcome this limitation, some studies have attempted to
categorize each document into multiple categories. However, the
learning process employed in these studies involves training using a
multi-categorized document set. These methods therefore cannot be
applied to the multi-categorization of most documents unless
multi-categorized training sets using traditional multi-categorization
algorithms are provided. To overcome this limitation, in this study, we
review our novel methodology for extending the category of a
single-categorized document to multiple categorizes, and then
introduce a survey-based verification scenario for estimating the
accuracy of our automatic categorization methodology.
Abstract: The need to extract R&D keywords from issues and use
them to retrieve R&D information is increasing rapidly. However, it is
difficult to identify related issues or distinguish them. Although the
similarity between issues cannot be identified, with an R&D lexicon,
issues that always share the same R&D keywords can be determined.
In detail, the R&D keywords that are associated with a particular issue
imply the key technology elements that are needed to solve a particular
issue.
Furthermore, the relationship among issues that share the same
R&D keywords can be shown in a more systematic way by clustering
them according to keywords. Thus, sharing R&D results and reusing
R&D technology can be facilitated. Indirectly, redundant investment
in R&D can be reduced as the relevant R&D information can be shared
among corresponding issues and the reusability of related R&D can be
improved. Therefore, a methodology to cluster issues from the
perspective of common R&D keywords is proposed to satisfy these
demands.