Abstract: The paper evaluates the ongoing reform of VAT in the Czech Republic in terms of impacts on individual households. The main objective is to analyse the impact of given changes on individual households. The adopted method is based on the data related to household consumption by individual household quintiles; obtained data are subjected to micro-simulation examining. Results are discussed in terms of vertical tax justice. Results of the analysis reveal that VAT behaves regressively and a sole consolidation of rates at a higher level only increases the regression of this tax in the Czech Republic.
Abstract: Classifying biomedical literature is a difficult and
challenging task, especially when a large number of biomedical
articles should be organized into a hierarchical structure. In this paper,
we present an approach for classifying a collection of biomedical text
abstracts downloaded from Medline database with the help of
ontology alignment. To accomplish our goal, we construct two types
of hierarchies, the OHSUMED disease hierarchy and the Medline
abstract disease hierarchies from the OHSUMED dataset and the
Medline abstracts, respectively. Then, we enrich the OHSUMED
disease hierarchy before adapting it to ontology alignment process for
finding probable concepts or categories. Subsequently, we compute
the cosine similarity between the vector in probable concepts (in the
“enriched" OHSUMED disease hierarchy) and the vector in Medline
abstract disease hierarchies. Finally, we assign category to the new
Medline abstracts based on the similarity score. The results obtained
from the experiments show the performance of our proposed approach
for hierarchical classification is slightly better than the performance of
the multi-class flat classification.
Abstract: In an era of knowledge explosion, the growth of data
increases rapidly day by day. Since data storage is a limited resource,
how to reduce the data space in the process becomes a challenge issue.
Data compression provides a good solution which can lower the
required space. Data mining has many useful applications in recent
years because it can help users discover interesting knowledge in large
databases. However, existing compression algorithms are not
appropriate for data mining. In [1, 2], two different approaches were
proposed to compress databases and then perform the data mining
process. However, they all lack the ability to decompress the data to
their original state and improve the data mining performance. In this
research a new approach called Mining Merged Transactions with the
Quantification Table (M2TQT) was proposed to solve these problems.
M2TQT uses the relationship of transactions to merge related
transactions and builds a quantification table to prune the candidate
itemsets which are impossible to become frequent in order to improve
the performance of mining association rules. The experiments show
that M2TQT performs better than existing approaches.
Abstract: Nowaday-s, many organizations use systems that
support business process as a whole or partially. However, in some
application domains, like software development and health care
processes, a normative Process Aware System (PAS) is not suitable,
because a flexible support is needed to respond rapidly to new
process models. On the other hand, a flexible Process Aware System
may be vulnerable to undesirable and fraudulent executions, which
imposes a tradeoff between flexibility and security. In order to make
this tradeoff available, a genetic-based anomaly detection model for
logs of Process Aware Systems is presented in this paper. The
detection of an anomalous trace is based on discovering an
appropriate process model by using genetic process mining and
detecting traces that do not fit the appropriate model as anomalous
trace; therefore, when used in PAS, this model is an automated
solution that can support coexistence of flexibility and security.
Abstract: There are several approaches in trying to solve the
Quantitative 1Structure-Activity Relationship (QSAR) problem.
These approaches are based either on statistical methods or on
predictive data mining. Among the statistical methods, one should
consider regression analysis, pattern recognition (such as cluster
analysis, factor analysis and principal components analysis) or partial
least squares. Predictive data mining techniques use either neural
networks, or genetic programming, or neuro-fuzzy knowledge. These
approaches have a low explanatory capability or non at all. This
paper attempts to establish a new approach in solving QSAR
problems using descriptive data mining. This way, the relationship
between the chemical properties and the activity of a substance
would be comprehensibly modeled.
Abstract: Mining frequent tree patterns have many useful
applications in XML mining, bioinformatics, network routing, etc.
Most of the frequent subtree mining algorithms (i.e. FREQT,
TreeMiner and CMTreeMiner) use anti-monotone property in the
phase of candidate subtree generation. However, none of these
algorithms have verified the correctness of this property in tree
structured data. In this research it is shown that anti-monotonicity
does not generally hold, when using weighed support in tree pattern
discovery. As a result, tree mining algorithms that are based on this
property would probably miss some of the valid frequent subtree
patterns in a collection of trees. In this paper, we investigate the
correctness of anti-monotone property for the problem of weighted
frequent subtree mining. In addition we propose W3-Miner, a new
algorithm for full extraction of frequent subtrees. The experimental
results confirm that W3-Miner finds some frequent subtrees that the
previously proposed algorithms are not able to discover.
Abstract: In this paper, a data mining model to SMEs for detecting financial and operational risk indicators by data mining is presenting. The identification of the risk factors by clarifying the relationship between the variables defines the discovery of knowledge from the financial and operational variables. Automatic and estimation oriented information discovery process coincides the definition of data mining. During the formation of model; an easy to understand, easy to interpret and easy to apply utilitarian model that is far from the requirement of theoretical background is targeted by the discovery of the implicit relationships between the data and the identification of effect level of every factor. In addition, this paper is based on a project which was funded by The Scientific and Technological Research Council of Turkey (TUBITAK).
Abstract: Data clustering is an important data exploration
technique with many applications in data mining. The k-means
algorithm is well known for its efficiency in clustering large data
sets. However, this algorithm is suitable for spherical shaped clusters
of similar sizes and densities. The quality of the resulting clusters
decreases when the data set contains spherical shaped with large
variance in sizes. In this paper, we introduce a competent procedure
to overcome this problem. The proposed method is based on shifting
the center of the large cluster toward the small cluster, and recomputing
the membership of small cluster points, the experimental
results reveal that the proposed algorithm produces satisfactory
results.
Abstract: Searching similar documents and document
management subjects have important place in text mining. One of the
most important parts of similar document research studies is the
process of classifying or clustering the documents. In this study, a
similar document search approach that includes discussion of out the
case of belonging to multiple categories (multiple categories
problem) has been carried. The proposed method that based on Fuzzy
Similarity Classification (FSC) has been compared with Rocchio
algorithm and naive Bayes method which are widely used in text
mining. Empirical results show that the proposed method is quite
successful and can be applied effectively. For the second stage,
multiple categories vector method based on information of categories
regarding to frequency of being seen together has been used.
Empirical results show that achievement is increased almost two
times, when proposed method is compared with classical approach.
Abstract: Data mining, which is the exploration of
knowledge from the large set of data, generated as a result of
the various data processing activities. Frequent Pattern Mining
is a very important task in data mining. The previous
approaches applied to generate frequent set generally adopt
candidate generation and pruning techniques for the
satisfaction of the desired objective. This paper shows how
the different approaches achieve the objective of frequent
mining along with the complexities required to perform the
job. This paper will also look for hardware approach of cache
coherence to improve efficiency of the above process. The
process of data mining is helpful in generation of support
systems that can help in Management, Bioinformatics,
Biotechnology, Medical Science, Statistics, Mathematics,
Banking, Networking and other Computer related
applications. This paper proposes the use of both upward and
downward closure property for the extraction of frequent item
sets which reduces the total number of scans required for the
generation of Candidate Sets.
Abstract: In large datasets, identifying exceptional or rare cases
with respect to a group of similar cases is considered very significant
problem. The traditional problem (Outlier Mining) is to find
exception or rare cases in a dataset irrespective of the class label of
these cases, they are considered rare events with respect to the whole
dataset. In this research, we pose the problem that is Class Outliers
Mining and a method to find out those outliers. The general
definition of this problem is “given a set of observations with class
labels, find those that arouse suspicions, taking into account the
class labels". We introduce a novel definition of Outlier that is Class
Outlier, and propose the Class Outlier Factor (COF) which measures
the degree of being a Class Outlier for a data object. Our work
includes a proposal of a new algorithm towards mining of the Class
Outliers, presenting experimental results applied on various domains
of real world datasets and finally a comparison study with other
related methods is performed.
Abstract: Human Resource (HR) applications can be used to
provide fair and consistent decisions, and to improve the
effectiveness of decision making processes. Besides that, among
the challenge for HR professionals is to manage organization
talents, especially to ensure the right person for the right job at the
right time. For that reason, in this article, we attempt to describe
the potential to implement one of the talent management tasks i.e.
identifying existing talent by predicting their performance as one of
HR application for talent management. This study suggests the
potential HR system architecture for talent forecasting by using
past experience knowledge known as Knowledge Discovery in
Database (KDD) or Data Mining. This article consists of three
main parts; the first part deals with the overview of HR
applications, the prediction techniques and application, the general
view of Data mining and the basic concept of talent management
in HRM. The second part is to understand the use of Data Mining
technique in order to solve one of the talent management tasks, and
the third part is to propose the potential HR system architecture for
talent forecasting.
Abstract: The problem of frequent itemset mining is considered in this paper. One new technique proposed to generate frequent patterns in large databases without time-consuming candidate generation. This technique is based on focusing on transaction instead of concentrating on itemset. This algorithm based on take intersection between one transaction and others transaction and the maximum shared items between transactions computed instead of creating itemset and computing their frequency. With applying real life transactions and some consumption is taken from real life data, the significant efficiency acquire from databases in generation association rules mining.
Abstract: Generator of hypotheses is a new method for data mining. It makes possible to classify the source data automatically and produces a particular enumeration of patterns. Pattern is an expression (in a certain language) describing facts in a subset of facts. The goal is to describe the source data via patterns and/or IF...THEN rules. Used evaluation criteria are deterministic (not probabilistic). The search results are trees - form that is easy to comprehend and interpret. Generator of hypotheses uses very effective algorithm based on the theory of monotone systems (MS) named MONSA (MONotone System Algorithm).
Abstract: In data mining, the association rules are used to find
for the associations between the different items of the transactions
database. As the data collected and stored, rules of value can be found
through association rules, which can be applied to help managers
execute marketing strategies and establish sound market frameworks.
This paper aims to use Fuzzy Frequent Pattern growth (FFP-growth)
to derive from fuzzy association rules. At first, we apply fuzzy
partition methods and decide a membership function of quantitative
value for each transaction item. Next, we implement FFP-growth
to deal with the process of data mining. In addition, in order to
understand the impact of Apriori algorithm and FFP-growth algorithm
on the execution time and the number of generated association
rules, the experiment will be performed by using different sizes of
databases and thresholds. Lastly, the experiment results show FFPgrowth
algorithm is more efficient than other existing methods.
Abstract: This paper investigates the problem of sampling from transactional data streams. We introduce CFISDS as a content based sampling algorithm that works on a landmark window model of data streams and preserve more informed sample in sample space. This algorithm that work based on closed frequent itemset mining tasks, first initiate a concept lattice using initial data, then update lattice structure using an incremental mechanism.Incremental mechanism insert, update and delete nodes in/from concept lattice in batch manner. Presented algorithm extracts the final samples on demand of user. Experimental results show the accuracy of CFISDS on synthetic and real datasets, despite on CFISDS algorithm is not faster than exist sampling algorithms such as Z and DSS.
Abstract: Generalized Center String (GCS) problem are
generalized from Common Approximate Substring problem
and Common substring problems. GCS are known to be
NP-hard allowing the problems lies in the explosion of
potential candidates. Finding longest center string without
concerning the sequence that may not contain any motifs is
not known in advance in any particular biological gene
process. GCS solved by frequent pattern-mining techniques
and known to be fixed parameter tractable based on the
fixed input sequence length and symbol set size. Efficient
method known as Bpriori algorithms can solve GCS with
reasonable time/space complexities. Bpriori 2 and Bpriori
3-2 algorithm are been proposed of any length and any
positions of all their instances in input sequences. In this
paper, we reduced the time/space complexity of Bpriori
algorithm by Constrained Based Frequent Pattern mining
(CBFP) technique which integrates the idea of Constraint
Based Mining and FP-tree mining. CBFP mining technique
solves the GCS problem works for all center string of any
length, but also for the positions of all their mutated copies
of input sequence. CBFP mining technique construct TRIE
like with FP tree to represent the mutated copies of center
string of any length, along with constraints to restraint
growth of the consensus tree. The complexity analysis for
Constrained Based FP mining technique and Bpriori
algorithm is done based on the worst case and average case
approach. Algorithm's correctness compared with the
Bpriori algorithm using artificial data is shown.
Abstract: Self-organizing map (SOM) is a well known data reduction technique used in data mining. Data visualization can reveal structure in data sets that is otherwise hard to detect from raw data alone. However, interpretation through visual inspection is prone to errors and can be very tedious. There are several techniques for the automatic detection of clusters of code vectors found by SOMs, but they generally do not take into account the distribution of code vectors; this may lead to unsatisfactory clustering and poor definition of cluster boundaries, particularly where the density of data points is low. In this paper, we propose the use of a generic particle swarm optimization (PSO) algorithm for finding cluster boundaries directly from the code vectors obtained from SOMs. The application of our method to unlabeled call data for a mobile phone operator demonstrates its feasibility. PSO algorithm utilizes U-matrix of SOMs to determine cluster boundaries; the results of this novel automatic method correspond well to boundary detection through visual inspection of code vectors and k-means algorithm.
Abstract: Clustering techniques have received attention in many areas including engineering, medicine, biology and data mining. The purpose of clustering is to group together data points, which are close to one another. The K-means algorithm is one of the most widely used techniques for clustering. However, K-means has two shortcomings: dependency on the initial state and convergence to local optima and global solutions of large problems cannot found with reasonable amount of computation effort. In order to overcome local optima problem lots of studies done in clustering. This paper is presented an efficient hybrid evolutionary optimization algorithm based on combining Particle Swarm Optimization (PSO) and Ant Colony Optimization (ACO), called PSO-ACO, for optimally clustering N object into K clusters. The new PSO-ACO algorithm is tested on several data sets, and its performance is compared with those of ACO, PSO and K-means clustering. The simulation results show that the proposed evolutionary optimization algorithm is robust and suitable for handing data clustering.
Abstract: Application of Information Technology (IT) has
revolutionized the functioning of business all over the world. Its
impact has been felt mostly among the information of dependent
industries. Tourism is one of such industry. The conceptual
framework in this study represents an innovation of travel
information searching system on mobile devices which is used as
tools to deliver travel information (such as hotels, restaurants, tourist
attractions and souvenir shops) for each user by travelers
segmentation based on data mining technique to segment the tourists-
behavior patterns then match them with tourism products and
services. This system innovation is designed to be a knowledge
incremental learning. It is a marketing strategy to support business to
respond traveler-s demand effectively.