Abstract: Current research practices sentiment analysis with a focus on social networks, DEfi Fouille de Texte (DEFT) (Text Mining Challenge) evaluation campaign focuses on opinion mining and sentiment analysis on social networks, especially social network Twitter. It aims to confront the systems produced by several teams from public and private research laboratories. DEFT offers participants the opportunity to work on regularly renewed themes and proposes to work on opinion mining in several editions. The purpose of this article is to scrutinize and analyze the works relating to opinions mining and sentiment analysis in the Twitter social network realized by DEFT. It examines the tasks proposed by the organizers of the challenge and the methods used by the participants.
Abstract: The reason for conducting this research is to develop an algorithm that is capable of classifying news articles from the automobile industry, according to the competitive actions that they entail, with the use of Text Mining (TM) methods. It is needed to test how to properly preprocess the data for this research by preparing pipelines which fits each algorithm the best. The pipelines are tested along with nine different classification algorithms in the realm of regression, support vector machines, and neural networks. Preliminary testing for identifying the optimal pipelines and algorithms resulted in the selection of two algorithms with two different pipelines. The two algorithms are Logistic Regression (LR) and Artificial Neural Network (ANN). These algorithms are optimized further, where several parameters of each algorithm are tested. The best result is achieved with the ANN. The final model yields an accuracy of 0.79, a precision of 0.80, a recall of 0.78, and an F1 score of 0.76. By removing three of the classes that created noise, the final algorithm is capable of reaching an accuracy of 94%.
Abstract: This article offers a approach to the automatic discovery
of semantic concepts and links in the domain of Oil Exploration
and Production (E&P). Machine learning methods combined with
textual pre-processing techniques were used to detect local patterns in
texts and, thus, generate new concepts and new semantic links. Even
using more specific vocabularies within the oil domain, our approach
has achieved satisfactory results, suggesting that the proposal can
be applied in other domains and languages, requiring only minor
adjustments.
Abstract: Japan’s semiconductor industries have developed greatly in recent years. Many were started from a Small and Medium-sized Enterprises (SMEs) that found at a good circumstance and now become the prosperous industries in the world. Sustainable growth factors that support the creation of spirit value inside the Japanese company were strongly embedded through performance. Those factors were not clearly defined among each company. A series of literature research conducted to explore quantitative text mining about the definition of sustainable growth factors. Sustainable criteria were developed from previous research to verify the definition of the factors. A typical frame work was proposed as a systematical approach to develop sustainable growth factor in a specific company. Result of approach was review in certain period shows that factors influenced in sustainable growth was importance for the company to achieve the goal.
Abstract: The Internet has grown into a powerful medium for information dispersion and social interaction that leads to a rapid growth of social media which allows users to easily post their emotions and perspectives regarding certain topics online. Our research aims at using natural language processing and text mining techniques to explore the public emotions expressed on Twitter by analyzing the sentiment behind tweets. In this paper, we propose a composite kernel method that integrates tree kernel with the linear kernel to simultaneously exploit both the tree representation and the distributed emotion keyword representation to analyze the syntactic and content information in tweets. The experiment results demonstrate that our method can effectively detect public emotion of tweets while outperforming the other compared methods.
Abstract: The problem of Entity relation discovery in structured
data, a well covered topic in literature, consists in searching within
unstructured sources (typically, text) in order to find connections
among entities. These can be a whole dictionary, or a specific
collection of named items. In many cases machine learning and/or
text mining techniques are used for this goal. These approaches
might be unfeasible in computationally challenging problems, such
as processing massive data streams. A faster approach consists in collecting the cooccurrences of any
two words (entities) in order to create a graph of relations - a
cooccurrence graph. Indeed each cooccurrence highlights some grade
of semantic correlation between the words because it is more common
to have related words close each other than having them in the
opposite sides of the text. Some authors have used sliding windows for such problem: they
count all the occurrences within a sliding windows running over the
whole text. In this paper we generalise such technique, coming up
to a Weighted-Distance Sliding Window, where each occurrence of
two named items within the window is accounted with a weight
depending on the distance between items: a closer distance implies
a stronger evidence of a relationship. We develop an experiment in
order to support this intuition, by applying this technique to a data
set consisting in the text of the Bible, split into verses.
Abstract: Twitter is a microblogging platform, where millions of users daily share their attitudes, views, and opinions. Using a probabilistic Latent Dirichlet Allocation (LDA) topic model to discern the most popular topics in the Twitter data is an effective way to analyze a large set of tweets to find a set of topics in a computationally efficient manner. Sentiment analysis provides an effective method to show the emotions and sentiments found in each tweet and an efficient way to summarize the results in a manner that is clearly understood. The primary goal of this paper is to explore text mining, extract and analyze useful information from unstructured text using two approaches: LDA topic modelling and sentiment analysis by examining Twitter plain text data in English. These two methods allow people to dig data more effectively and efficiently. LDA topic model and sentiment analysis can also be applied to provide insight views in business and scientific fields.
Abstract: Datasets or collections are becoming important assets by themselves and now they can be accepted as a primary intellectual output of a research. The quality and usage of the datasets depend mainly on the context under which they have been collected, processed, analyzed, validated, and interpreted. This paper aims to present a collection of program educational objectives mapped to student’s outcomes collected from self-study reports prepared by 32 engineering programs accredited by ABET. The manual mapping (classification) of this data is a notoriously tedious, time consuming process. In addition, it requires experts in the area, which are mostly not available. It has been shown the operational settings under which the collection has been produced. The collection has been cleansed, preprocessed, some features have been selected and preliminary exploratory data analysis has been performed so as to illustrate the properties and usefulness of the collection. At the end, the collection has been benchmarked using nine of the most widely used supervised multiclass classification techniques (Binary Relevance, Label Powerset, Classifier Chains, Pruned Sets, Random k-label sets, Ensemble of Classifier Chains, Ensemble of Pruned Sets, Multi-Label k-Nearest Neighbors and Back-Propagation Multi-Label Learning). The techniques have been compared to each other using five well-known measurements (Accuracy, Hamming Loss, Micro-F, Macro-F, and Macro-F). The Ensemble of Classifier Chains and Ensemble of Pruned Sets have achieved encouraging performance compared to other experimented multi-label classification methods. The Classifier Chains method has shown the worst performance. To recap, the benchmark has achieved promising results by utilizing preliminary exploratory data analysis performed on the collection, proposing new trends for research and providing a baseline for future studies.
Abstract: Nowadays, Internet enables its users to share the information online and to interact with others. Facing with numerous information, these Internet users are confused and begin to rely on the opinion leaders’ recommendations. The online opinion leaders are the individuals who have professional knowledge, who utilize the online channels to spread word-of-mouth information and who can affect the attitudes or even the behavior of their followers to some degree. Because utilizing the online opinion leaders is seen as an important approach to affect the potential consumers, how to identify them has become one of the hottest topics in the related field. Hence, in this article, the concepts and characteristics are introduced, and the researches related to identifying opinion leaders are collected and divided into three categories. Finally, the implications for future studies are provided.
Abstract: In this paper, we present an evolving knowledge
extraction system named AKEOS (Automatic Knowledge Extraction
from Online Sources). AKEOS consists of two modules, including
a one-time learning module and an evolving learning module.
The one-time learning module takes in user input query, and
automatically harvests knowledge from online unstructured resources
in an unsupervised way. The output of the one-time learning is a
structured vector representing the harvested knowledge. The evolving
learning module automatically schedules and performs repeated
one-time learning to extract the newest information and track the
development of an event. In addition, the evolving learning module
summarizes the knowledge learned at different time points to produce
a final knowledge vector about the event. With the evolving learning,
we are able to visualize the key information of the event, discover
the trends, and track the development of an event.
Abstract: New approaches to analyze and visualize data stream in real-time basis is important in making a prompt decision by the decision maker. Financial market trading and surveillance, large-scale emergency response and crowd control are some example scenarios that require real-time analytic and data visualization. This situation has led to the development of techniques and tools that support humans in analyzing the source data. With the emergence of Big Data and social media, new techniques and tools are required in order to process the streaming data. Today, ranges of tools which implement some of these functionalities are available. In this paper, we present chronological evolution evaluation of technologies for supporting of real-time analytic and visualization of the data stream. Based on the past research papers published from 2002 to 2014, we gathered the general information, main techniques, challenges and open issues. The techniques for streaming text visualization are identified based on Text Visualization Browser in chronological order. This paper aims to review the evolution of streaming text visualization techniques and tools, as well as to discuss the problems and challenges for each of identified tools.
Abstract: In this paper, we present a method of applying
Independent Topic Analysis (ITA) to increasing the number of
document data. The number of document data has been increasing
since the spread of the Internet. ITA was presented as one method
to analyze the document data. ITA is a method for extracting the
independent topics from the document data by using the Independent
Component Analysis (ICA). ICA is a technique in the signal
processing; however, it is difficult to apply the ITA to increasing
number of document data. Because ITA must use the all document
data so temporal and spatial cost is very high. Therefore, we
present Incremental ITA which extracts the independent topics from
increasing number of document data. Incremental ITA is a method
of updating the independent topics when the document data is added
after extracted the independent topics from a just previous the data.
In addition, Incremental ITA updates the independent topics when the
document data is added. And we show the result applied Incremental
ITA to benchmark datasets.
Abstract: Urban regeneration projects have been actively promoted in Korea. In particular, Jeonju Hanok Village is evaluated as one of representative cases in terms of utilizing local cultural heritage sits in the urban regeneration project. However, recently, there has been a growing concern in this area, due to the ‘gentrification’, caused by the excessive commercialization and surging tourists. This trend was changing land and building use and resulted in the loss of identity of the region. In this regard, this study analyzed the land use transformation between 2010 and 2016 to identify the commercialization trend in Jeonju Hanok Village. In addition, it conducted SNS big data analysis on Jeonju Hanok Village from February 14th, 2016 to March 31st, 2016 to identify visitors’ awareness of the village. The study results demonstrate that rapid commercialization was underway, unlikely the initial intention, so that planners and officials in city government should reconsider the project direction and rebuild deliberate management strategies. This study is meaningful in that it analyzed the land use transformation and SNS big data to identify the current situation in urban regeneration area. Furthermore, it is expected that the study results will contribute to the vitalization of regeneration area.
Abstract: Arabic is one of the most ancient and critical languages in the world. It has over than 250 million Arabic native speakers and more than twenty countries having Arabic as one of its official languages. In the past decade, we have witnessed a rapid evolution in smart devices, social network and technology sector which led to the need to provide tools and libraries that properly tackle the Arabic language in different domains. Stemming is one of the most crucial linguistic fundamentals. It is used in many applications especially in information extraction and text mining fields. The motivation behind this work is to enhance the Arabic light stemmer to serve the data mining industry and leverage it in an open source community. The presented implementation works on enhancing the Arabic light stemmer by utilizing and enhancing an algorithm that provides an extension for a new set of rules and patterns accompanied by adjusted procedure. This study has proven a significant enhancement for better search accuracy with an average 10% improvement in comparison with previous works.
Abstract: Collaborative filtering (CF) algorithm has been popularly used for recommender systems in both academic and practical applications. It basically generates recommendation results using users’ numeric ratings. However, the additional use of the information other than user ratings may lead to better accuracy of CF. Considering that a lot of people are likely to share their honest opinion on the items they purchased recently due to the advent of the Web 2.0, user's review can be regarded as the new informative source for identifying user's preference with accuracy. Under this background, this study presents a hybrid recommender system that fuses CF and user's review mining. Our system adopts conventional memory-based CF, but it is designed to use both user’s numeric ratings and his/her text reviews on the items when calculating similarities between users.
Abstract: Online user-generated contents (UGC) significantly change the way customers behave (e.g., shop, travel), and a pressing need to handle the overwhelmingly plethora amount of various UGC is one of the paramount issues for management. However, a current approach (e.g., sentiment analysis) is often ineffective for leveraging textual information to detect the problems or issues that a certain management suffers from. In this paper, we employ text mining of Latent Dirichlet Allocation (LDA) on a popular online review site dedicated to complaint from users. We find that the employed LDA efficiently detects customer complaints, and a further inspection with the visualization technique is effective to categorize the problems or issues. As such, management can identify the issues at stake and prioritize them accordingly in a timely manner given the limited amount of resources. The findings provide managerial insights into how analytics on social media can help maintain and improve their reputation management. Our interdisciplinary approach also highlights several insights by applying machine learning techniques in marketing research domain. On a broader technical note, this paper illustrates the details of how to implement LDA in R program from a beginning (data collection in R) to an end (LDA analysis in R) since the instruction is still largely undocumented. In this regard, it will help lower the boundary for interdisciplinary researcher to conduct related research.
Abstract: With the advancement of information technology and development of group commerce, people have obviously changed in their lifestyle. However, group commerce faces some challenging problems. The products or services provided by vendors do not satisfactorily reflect customers’ opinions, so that the sale and revenue of group commerce gradually become lower. On the other hand, the process for a formed customer group to reach group-purchasing consensus is time-consuming and the final decision is not the best choice for each group members. In this paper, we design a social decision support mechanism, by using group discussion message to recommend suitable options for group members and we consider social influence and personal preference to generate option ranking list. The proposed mechanism can enhance the group purchasing decision making efficiently and effectively and venders can provide group products or services according to the group option ranking list.
Abstract: Traditional early warning systems that alarm against crisis are generally based on structured or numerical data; therefore, a system that can make predictions based on unstructured textual data, an uncorrelated data source, is a great complement to the traditional early warning systems. The Chicago Board Options Exchange (CBOE) Volatility Index (VIX), commonly referred to as the fear index, measures the cost of insurance against market crash, and spikes in the event of crisis. In this study, news data is consumed for prediction of whether there will be a market-wide crisis by predicting the movement of the fear index, and the historical references to similar events are presented in an unsupervised manner. Topic modeling-based prediction and representation are made based on daily news data between 1990 and 2015 from The Wall Street Journal against VIX index data from CBOE.
Abstract: Recently, many users have begun to frequently share
their opinions on diverse issues using various social media. Therefore,
numerous governments have attempted to establish or improve
national policies according to the public opinions captured from
various social media. In this paper, we indicate several limitations of
the traditional approaches to analyze public opinion on science and
technology and provide an alternative methodology to overcome these
limitations. First, we distinguish between the science and technology
analysis phase and the social issue analysis phase to reflect the fact that
public opinion can be formed only when a certain science and
technology is applied to a specific social issue. Next, we successively
apply a start list and a stop list to acquire clarified and interesting
results. Finally, to identify the most appropriate documents that fit
with a given subject, we develop a new logical filter concept that
consists of not only mere keywords but also a logical relationship
among the keywords. This study then analyzes the possibilities for the
practical use of the proposed methodology thorough its application to
discover core issues and public opinions from 1,700,886 documents
comprising SNS, blogs, news, and discussions.
Abstract: Despite the highly touted benefits, emerging
technologies have unleashed pervasive concerns regarding unintended
and unforeseen social impacts. Thus, those wishing to create safe and
socially acceptable products need to identify such side effects and
mitigate them prior to the market proliferation. Various methodologies
in the field of technology assessment (TA), namely Delphi, impact
assessment, and scenario planning, have been widely incorporated in
such a circumstance. However, literatures face a major limitation in
terms of sole reliance on participatory workshop activities. They
unfortunately missed out the availability of a massive untapped data
source of futuristic information flooding through the Internet. This
research thus seeks to gain insights into utilization of futuristic data,
future-oriented documents from the Internet, as a supplementary
method to generate social impact scenarios whilst capturing
perspectives of experts from a wide variety of disciplines. To this end,
network analysis is conducted based on the social keywords extracted
from the futuristic documents by text mining, which is then used as a
guide to produce a comprehensive set of detailed scenarios. Our
proposed approach facilitates harmonized depictions of possible
hazardous consequences of emerging technologies and thereby makes
decision makers more aware of, and responsive to, broad qualitative
uncertainties.