Abstract: Methods for organizing web data into groups in order
to analyze web-based hypertext data and facilitate data availability
are very important in terms of the number of documents available
online. Thereby, the task of clustering web-based document structures
has many applications, e.g., improving information retrieval on the
web, better understanding of user navigation behavior, improving web
users requests servicing, and increasing web information accessibility.
In this paper we investigate a new approach for clustering web-based
hypertexts on the basis of their graph structures. The hypertexts will
be represented as so called generalized trees which are more general
than usual directed rooted trees, e.g., DOM-Trees. As a important
preprocessing step we measure the structural similarity between the
generalized trees on the basis of a similarity measure d. Then,
we apply agglomerative clustering to the obtained similarity matrix
in order to create clusters of hypertext graph patterns representing
navigation structures. In the present paper we will run our approach
on a data set of hypertext structures and obtain good results in
Web Structure Mining. Furthermore we outline the application of
our approach in Web Usage Mining as future work.
Abstract: The inherent flexibilities of XML in both structure
and semantics makes mining from XML data a complex task with
more challenges compared to traditional association rule mining in
relational databases. In this paper, we propose a new model for the
effective extraction of generalized association rules form a XML
document collection. We directly use frequent subtree mining
techniques in the discovery process and do not ignore the tree
structure of data in the final rules. The frequent subtrees based on the
user provided support are split to complement subtrees to form the
rules. We explain our model within multi-steps from data preparation
to rule generation.
Abstract: Typical Intelligent Decision Support System is 4-based, its design composes of Data Warehouse, Online Analytical Processing, Data Mining and Decision Supporting based on models, which is called Decision Support System Based on Data Warehouse (DSSBDW). This way takes ETL,OLAP and DM as its implementing means, and integrates traditional model-driving DSS and data-driving DSS into a whole. For this kind of problem, this paper analyzes the DSSBDW architecture and DW model, and discusses the following key issues: ETL designing and Realization; metadata managing technology using XML; SQL implementing, optimizing performance, data mapping in OLAP; lastly, it illustrates the designing principle and method of DW in DSSBDW.
Abstract: Web usage mining is an interesting application of data
mining which provides insight into customer behaviour on the Internet. An important technique to discover user access and navigation trails is based on sequential patterns mining. One of the
key challenges for web access patterns mining is tackling the problem
of mining richly structured patterns. This paper proposes a novel
model called Web Access Patterns Graph (WAP-Graph) to represent all of the access patterns from web mining graphically. WAP-Graph
also motivates the search for new structural relation patterns, i.e. Concurrent Access Patterns (CAP), to identify and predict more
complex web page requests. Corresponding CAP mining and modelling methods are proposed and shown to be effective in the
search for and representation of concurrency between access patterns
on the web. From experiments conducted on large-scale synthetic
sequence data as well as real web access data, it is demonstrated that
CAP mining provides a powerful method for structural knowledge discovery, which can be visualised through the CAP-Graph model.
Abstract: Opinion extraction about products from customer
reviews is becoming an interesting area of research. Customer
reviews about products are nowadays available from blogs and
review sites. Also tools are being developed for extraction of opinion
from these reviews to help the user as well merchants to track the
most suitable choice of product. Therefore efficient method and
techniques are needed to extract opinions from review and blogs. As
reviews of products mostly contains discussion about the features,
functions and services, therefore, efficient techniques are required to
extract user comments about the desired features, functions and
services. In this paper we have proposed a novel idea to find features
of product from user review in an efficient way. Our focus in this
paper is to get the features and opinion-oriented words about
products from text through auxiliary verbs (AV) {is, was, are, were,
has, have, had}. From the results of our experiments we found that
82% of features and 85% of opinion-oriented sentences include AVs.
Thus these AVs are good indicators of features and opinion
orientation in customer reviews.
Abstract: Data clustering is an important data exploration technique
with many applications in data mining. We present an enhanced
version of the well known single link clustering algorithm. We will
refer to this algorithm as DCBOR. The proposed algorithm alleviates
the chain effect by removing the outliers from the given dataset.
So this algorithm provides outlier detection and data clustering
simultaneously. This algorithm does not need to update the distance
matrix, since the algorithm depends on merging the most k-nearest
objects in one step and the cluster continues grow as long as possible
under specified condition. So the algorithm consists of two phases;
at the first phase, it removes the outliers from the input dataset. At
the second phase, it performs the clustering process. This algorithm
discovers clusters of different shapes, sizes, densities and requires
only one input parameter; this parameter represents a threshold for
outlier points. The value of the input parameter is ranging from 0 to
1. The algorithm supports the user in determining an appropriate
value for it. We have tested this algorithm on different datasets
contain outlier and connecting clusters by chain of density points,
and the algorithm discovers the correct clusters. The results of
our experiments demonstrate the effectiveness and the efficiency of
DCBOR.
Abstract: Sequential pattern mining is a challenging task in data mining area with large applications. One among those applications is mining patterns from weblog. Recent times, weblog is highly dynamic and some of them may become absolute over time. In addition, users may frequently change the threshold value during the data mining process until acquiring required output or mining interesting rules. Some of the recently proposed algorithms for mining weblog, build the tree with two scans and always consume large time and space. In this paper, we build Revised PLWAP with Non-frequent Items (RePLNI-tree) with single scan for all items. While mining sequential patterns, the links related to the nonfrequent items are not considered. Hence, it is not required to delete or maintain the information of nodes while revising the tree for mining updated transactions. The algorithm supports both incremental and interactive mining. It is not required to re-compute the patterns each time, while weblog is updated or minimum support changed. The performance of the proposed tree is better, even the size of incremental database is more than 50% of existing one. For evaluation purpose, we have used the benchmark weblog dataset and found that the performance of proposed tree is encouraging compared to some of the recently proposed approaches.
Abstract: Graph has become increasingly important in modeling
complicated structures and schemaless data such as proteins, chemical
compounds, and XML documents. Given a graph query, it is desirable
to retrieve graphs quickly from a large database via graph-based
indices. Different from the existing methods, our approach, called
VFM (Vertex to Frequent Feature Mapping), makes use of vertices
and decision features as the basic indexing feature. VFM constructs
two mappings between vertices and frequent features to answer graph
queries. The VFM approach not only provides an elegant solution to
the graph indexing problem, but also demonstrates how database
indexing and query processing can benefit from data mining,
especially frequent pattern mining. The results show that the proposed
method not only avoids the enumeration method of getting subgraphs
of query graph, but also effectively reduces the subgraph isomorphism
tests between the query graph and graphs in candidate answer set in
verification stage.
Abstract: In this paper, we present a methodology for finding
authoritative researchers by analyzing academic Web sites. We show
a case study in which we concentrate on a set of Czech computer
science departments- Web sites. We analyze the relations between
them via hyperlinks and find the most important ones using several
common ranking algorithms. We then examine the contents of the
research papers present on these sites and determine the most
authoritative Czech authors.
Abstract: Text data mining is a process of exploratory data
analysis. Classification maps data into predefined groups or classes.
It is often referred to as supervised learning because the classes are
determined before examining the data. This paper describes proposed
radial basis function Classifier that performs comparative crossvalidation
for existing radial basis function Classifier. The feasibility
and the benefits of the proposed approach are demonstrated by means
of data mining problem: direct Marketing. Direct marketing has
become an important application field of data mining. Comparative
Cross-validation involves estimation of accuracy by either stratified
k-fold cross-validation or equivalent repeated random subsampling.
While the proposed method may have high bias; its performance
(accuracy estimation in our case) may be poor due to high variance.
Thus the accuracy with proposed radial basis function Classifier was
less than with the existing radial basis function Classifier. However
there is smaller the improvement in runtime and larger improvement
in precision and recall. In the proposed method Classification
accuracy and prediction accuracy are determined where the
prediction accuracy is comparatively high.
Abstract: This paper focuses on the probabilistic numerical
solution of the problems in biomechanics and mining. Applications of
Simulation-Based Reliability Assessment (SBRA) Method are
presented in the solution of designing of the external fixators applied
in traumatology and orthopaedics (these fixators can be applied for
the treatment of open and unstable fractures etc.) and in the solution
of a hard rock (ore) disintegration process (i.e. the bit moves into the
ore and subsequently disintegrates it, the results are compared with
experiments, new design of excavation tool is proposed.
Abstract: Recently the use of data mining to scientific bibliographic data bases has been implemented to analyze the pathways of the knowledge or the core scientific relevances of a laureated novel or a country. This specific case of data mining has been named citation mining, and it is the integration of citation bibliometrics and text mining. In this paper we present an improved WEB implementation of statistical physics algorithms to perform the text mining component of citation mining. In particular we use an entropic like distance between the compression of text as an indicator of the similarity between them. Finally, we have included the recently proposed index h to characterize the scientific production. We have used this web implementation to identify users, applications and impact of the Mexican scientific institutions located in the State of Morelos.
Abstract: Clustering is a very well known technique in data mining. One of the most widely used clustering techniques is the k-means algorithm. Solutions obtained from this technique are dependent on the initialization of cluster centers. In this article we propose a new algorithm to initialize the clusters. The proposed algorithm is based on finding a set of medians extracted from a dimension with maximum variance. The algorithm has been applied to different data sets and good results are obtained.
Abstract: To investigate the correspondence of theory and
practice, a successfully implemented Knowledge Management
System (KMS) is explored through the lens of Alavi and Leidner-s
proposed KMS framework for the analysis of an information system
in knowledge management (Framework-AISKM). The applied KMS
system was designed to manage curricular knowledge in a distributed
university environment. The motivation for the KMS is discussed
along with the types of knowledge necessary in an academic setting.
Elements of the KMS involved in all phases of capturing and
disseminating knowledge are described. As the KMS matures the
resulting data stores form the precursor to and the potential for
knowledge mining. The findings from this exploratory study indicate
substantial correspondence between the successful KMS and the
theory-based framework providing provisional confirmation for the
framework while suggesting factors that contributed to the system-s
success. Avenues for future work are described.
Abstract: This paper deals with the application of a well-known neural network technique, multilayer back-propagation (BP) neural network, in financial data mining. A modified neural network forecasting model is presented, and an intelligent mining system is developed. The system can forecast the buying and selling signs according to the prediction of future trends to stock market, and provide decision-making for stock investors. The simulation result of seven years to Shanghai Composite Index shows that the return achieved by this mining system is about three times as large as that achieved by the buy and hold strategy, so it is advantageous to apply neural networks to forecast financial time series, the different investors could benefit from it.
Abstract: The running logs of a process hold valuable
information about its executed activity behavior and generated activity
logic structure. Theses informative logs can be extracted, analyzed and
utilized to improve the efficiencies of the process's execution and
conduction. One of the techniques used to accomplish the process
improvement is called as process mining. To mine similar processes is
such an improvement mission in process mining. Rather than directly
mining similar processes using a single comparing coefficient or a
complicate fitness function, this paper presents a simplified heuristic
process mining algorithm with two similarity comparisons that are
able to relatively conform the activity logic sequences (traces) of
mining processes with those of a normalized (regularized) one. The
relative process conformance is to find which of the mining processes
match the required activity sequences and relationships, further for
necessary and sufficient applications of the mined processes to process
improvements. One similarity presented is defined by the relationships
in terms of the number of similar activity sequences existing in
different processes; another similarity expresses the degree of the
similar (identical) activity sequences among the conforming processes.
Since these two similarities are with respect to certain typical behavior
(activity sequences) occurred in an entire process, the common
problems, such as the inappropriateness of an absolute comparison and
the incapability of an intrinsic information elicitation, which are often
appeared in other process conforming techniques, can be solved by the
relative process comparison presented in this paper. To demonstrate
the potentiality of the proposed algorithm, a numerical example is
illustrated.
Abstract: With the explosive growth of information sources available on the World Wide Web, it has become increasingly difficult to identify the relevant pieces of information, since web pages are often cluttered with irrelevant content like advertisements, navigation-panels, copyright notices etc., surrounding the main content of the web page. Hence, tools for the mining of data regions, data records and data items need to be developed in order to provide value-added services. Currently available automatic techniques to mine data regions from web pages are still unsatisfactory because of their poor performance and tag-dependence. In this paper a novel method to extract data items from the web pages automatically is proposed. It comprises of two steps: (1) Identification and Extraction of the data regions based on visual clues information. (2) Identification of data records and extraction of data items from a data region. For step1, a novel and more effective method is proposed based on visual clues, which finds the data regions formed by all types of tags using visual clues. For step2 a more effective method namely, Extraction of Data Items from web Pages (EDIP), is adopted to mine data items. The EDIP technique is a list-based approach in which the list is a linear data structure. The proposed technique is able to mine the non-contiguous data records and can correctly identify data regions, irrespective of the type of tag in which it is bound. Our experimental results show that the proposed technique performs better than the existing techniques.
Abstract: The SOM has several beneficial features which make
it a useful method for data mining. One of the most important
features is the ability to preserve the topology in the projection.
There are several measures that can be used to quantify the goodness
of the map in order to obtain the optimal projection, including the
average quantization error and many topological errors. Many
researches have studied how the topology preservation should be
measured. One option consists of using the topographic error which
considers the ratio of data vectors for which the first and second best
BMUs are not adjacent. In this work we present a study of the
behaviour of the topographic error in different kinds of maps. We
have found that this error devaluates the rectangular maps and we
have studied the reasons why this happens. Finally, we suggest a new
topological error to improve the deficiency of the topographic error.
Abstract: Most of the existing text mining approaches are
proposed, keeping in mind, transaction databases model. Thus, the
mined dataset is structured using just one concept: the “transaction",
whereas the whole dataset is modeled using the “set" abstract type. In
such cases, the structure of the whole dataset and the relationships
among the transactions themselves are not modeled and
consequently, not considered in the mining process.
We believe that taking into account structure properties of
hierarchically structured information (e.g. textual document, etc ...)
in the mining process, can leads to best results. For this purpose, an
hierarchical associations rule mining approach for textual documents
is proposed in this paper and the classical set-oriented mining
approach is reconsidered profits to a Direct Acyclic Graph (DAG)
oriented approach. Natural languages processing techniques are used
in order to obtain the DAG structure. Based on this graph model, an
hierarchical bottom up algorithm is proposed. The main idea is that
each node is mined with its parent node.
Abstract: DNA microarrays allow the measurement of expression levels for a large number of genes, perhaps all genes of an organism, within a number of different experimental samples. It is very much important to extract biologically meaningful information from this huge amount of expression data to know the current state of the cell because most cellular processes are regulated by changes in gene expression. Association rule mining techniques are helpful to find association relationship between genes. Numerous association rule mining algorithms have been developed to analyze and associate this huge amount of gene expression data. This paper focuses on some of the popular association rule mining algorithms developed to analyze gene expression data.