Abstract: Recently, online marketplaces in the e-commerce industry, such as Rakuten and Alibaba, have become some of the most popular online marketplaces in Asia. In these shopping websites, consumers can select purchase products from a large number of stores. Additionally, consumers of the e-commerce site have to register their name, age, gender, and other information in advance, to access their registered account. Therefore, establishing a method for analyzing consumer preferences from both the store and the product side is required. This study uses the Doc2Vec method, which has been studied in the field of natural language processing. Doc2Vec has been used in many cases to analyze the extraction of semantic relationships between documents (represented as consumers) and words (represented as products) in the field of document classification. This concept is applicable to represent the relationship between users and items; however, the problem is that one more factor (i.e., shops) needs to be considered in Doc2Vec. More precisely, a method for analyzing the relationship between consumers, stores, and products is required. The purpose of our study is to combine the analysis of the Doc2vec model for users and shops, and for users and items in the same feature space. This method enables the calculation of similar shops and items for each user. In this study, we derive the real data analysis accumulated in the online marketplace and demonstrate the efficiency of the proposal.
Abstract: Recently, numerous documents including large
volumes of unstructured data and text have been created because of the
rapid increase in the use of social media and the Internet. Usually,
these documents are categorized for the convenience of users. Because
the accuracy of manual categorization is not guaranteed, and such
categorization requires a large amount of time and incurs huge costs.
Many studies on automatic categorization have been conducted to help
mitigate the limitations of manual categorization. Unfortunately, most
of these methods cannot be applied to categorize complex documents
with multiple topics because they work on the assumption that
individual documents can be categorized into single categories only.
Therefore, to overcome this limitation, some studies have attempted to
categorize each document into multiple categories. However, the
learning process employed in these studies involves training using a
multi-categorized document set. These methods therefore cannot be
applied to the multi-categorization of most documents unless
multi-categorized training sets using traditional multi-categorization
algorithms are provided. To overcome this limitation, in this study, we
review our novel methodology for extending the category of a
single-categorized document to multiple categorizes, and then
introduce a survey-based verification scenario for estimating the
accuracy of our automatic categorization methodology.
Abstract: Field Association (FA) terms are a limited set of discriminating terms that give us the knowledge to identify document fields which are effective in document classification, similar file retrieval and passage retrieval. But the problem lies in the lack of an effective method to extract automatically relevant Arabic FA Terms to build a comprehensive dictionary. Moreover, all previous studies are based on FA terms in English and Japanese, and the extension of FA terms to other language such Arabic could be definitely strengthen further researches. This paper presents a new method to extract, Arabic FA Terms from domain-specific corpora using part-of-speech (POS) pattern rules and corpora comparison. Experimental evaluation is carried out for 14 different fields using 251 MB of domain-specific corpora obtained from Arabic Wikipedia dumps and Alhyah news selected average of 2,825 FA Terms (single and compound) per field. From the experimental results, recall and precision are 84% and 79% respectively. Therefore, this method selects higher number of relevant Arabic FA Terms at high precision and recall.
Abstract: This paper proposes an auto-classification algorithm
of Web pages using Data mining techniques. We consider the
problem of discovering association rules between terms in a set of
Web pages belonging to a category in a search engine database, and
present an auto-classification algorithm for solving this problem that
are fundamentally based on Apriori algorithm. The proposed
technique has two phases. The first phase is a training phase where
human experts determines the categories of different Web pages, and
the supervised Data mining algorithm will combine these categories
with appropriate weighted index terms according to the highest
supported rules among the most frequent words. The second phase is
the categorization phase where a web crawler will crawl through the
World Wide Web to build a database categorized according to the
result of the data mining approach. This database contains URLs and
their categories.
Abstract: This paper presents the design and implementation of
the WebGD, a CORBA-based document classification and retrieval
system on Internet. The WebGD makes use of such techniques as Web,
CORBA, Java, NLP, fuzzy technique, knowledge-based processing
and database technology. Unified classification and retrieval model,
classifying and retrieving with one reasoning engine and flexible
working mode configuration are some of its main features. The
architecture of WebGD, the unified classification and retrieval model,
the components of the WebGD server and the fuzzy inference engine
are discussed in this paper in detail.