Abstract: Cluster analysis divides data into groups that are
meaningful, useful, or both. Analysis of biological data is creating a
new generation of epidemiologic, prognostic, diagnostic and
treatment modalities. Clustering of protein sequences is one of the
current research topics in the field of computer science. Linear
relation is valuable in rule discovery for a given data, such as if value
X goes up 1, value Y will go down 3", etc. The classical linear
regression models the linear relation of two sequences perfectly.
However, if we need to cluster a large repository of protein sequences
into groups where sequences have strong linear relationship with
each other, it is prohibitively expensive to compare sequences one by
one. In this paper, we propose a new technique named General
Regression Model Technique Clustering Algorithm (GRMTCA) to
benignly handle the problem of linear sequences clustering. GRMT
gives a measure, GR*, to tell the degree of linearity of multiple
sequences without having to compare each pair of them.
Abstract: Utilization of diverse germplasm is needed to enhance
the genetic diversity of cultivars. The objective of this study was to
evaluate the genetic relationships of 98 alfalfa germplasm accessions
using morphological traits and SSR markers. From the 98 tested
populations, 81 were locals originating in Europe, 17 were introduced
from USA, Australia, New Zealand and Canada. Three primers
generated 67 polymorphic bands. The average polymorphic
information content (PIC) was very high (> 0.90) over all three used
primer combinations. Cluster analysis using Unweighted Pair Group
Method with Arithmetic Means (UPGMA) and Jaccard´s coefficient
grouped the accessions into 2 major clusters with 4 sub-clusters with
no correlation between genetic and morphological diversity. The SSR
analysis clearly indicated that even with three polymorphic primers,
reliable estimation of genetic diversity could be obtained.
Abstract: Today, the preferences and participation of the TD groups such as the elderly and disabled is still lacking in decision-making of transportation planning, and their reactions to certain type of policies are not well known. Thus, a clear methodology is needed. This study aimed to develop a method to extract the preferences of the disabled to be used in the policy-making stage that can also guide to future estimations. The method utilizes the combination of cluster analysis and data filtering using the data of the Arao city (Japan). The method is a process that follows: defining the TD group by the cluster analysis tool, their travel preferences in tabular form from the household surveys by policy variableimpact pairs, zones, and by trip purposes, and the final outcome is the preference probabilities of the disabled. The preferences vary by trip purpose; for the work trips, accessibility and transit system quality policies with the accompanying impacts of modal shifts towards public mode use as well as the decreasing travel costs, and the trip rate increase; for the social trips, the same accessibility and transit system policies leading to the same mode shift impact, together with the travel quality policy area leading to trip rate increase. These results explain the policies to focus and can be used in scenario generation in models, or any other planning purpose as decision support tool.
Abstract: Water quality and freshwater fish diversity from nine
waterfalls at Khao Luang National Park, Thailand was examined.
Streams were shallow, fast flowing with clear water and rocky and
sandy substrate. The mean water quality of waterfalls at Khao Luang
National Park were as following pH 7.50, air temperature 24.27 °C,
water temperature 26.37 °C, dissolved oxygen 7.88 mg/l, hardness
4.44-21.33 mg/l, alkalinity 3.55-11.88 mg/(as CaCO3). Twenty fish
species were found at Khao Luang National Park belonging to nine
families. A cluster analysis of water quality at Khao Luang National
Park revealed that waterfalls at Khao Luang National Park were
divided into two groups: A and B. Group A composed of two
waterfalls (i.e. Aie Kaew and Wangmaipak) that flew to the Gulf of
Thailand side. Group B composed of seven waterfalls (i.e. Promlok,
Kalom, Nuafa, Suankun, Soidaw, Suanhai, and Thapae) that flew to
the Andaman Sea side (Fig. 2) .The Cyprinids represented the major
species in all the waterfalls comprising of 45%.
Abstract: The study on the tree growth for four species groups of commercial timber in Koh Kong province, Cambodia-s tropical rainforest is described. The simulation for these four groups had been successfully developed in the 5-year interval through year-60. Data were obtained from twenty permanent sample plots in the duration of thirteen years. The aim for this study was to develop stand table simulation system of tree growth by the species group. There were five steps involved in the development of the tree growth simulation: aggregate the tree species into meaningful groups by using cluster analysis; allocate the trees in the diameter classes by the species group; observe the diameter movement of the species group. The diameter growth rate, mortality rate and recruitment rate were calculated by using some mathematical formula. Simulation equation had been created by combining those parameters. Result showed the dissimilarity of the diameter growth among species groups.
Abstract: The aim of this work was to detect genetic variability among the set of 40 castor genotypes using 8 RAPD markers. Amplification of genomic DNA of 40 genotypes, using RAPD analysis, yielded in 66 fragments, with an average of 8.25 polymorphic fragments per primer. Number of amplified fragments ranged from 3 to 13, with the size of amplicons ranging from 100 to 1200 bp. Values of the polymorphic information content (PIC) value ranged from 0.556 to 0.895 with an average of 0.784 and diversity index (DI) value ranged from 0.621 to 0.896 with an average of 0.798. The dendrogram based on hierarchical cluster analysis using UPGMA algorithm was prepared and analyzed genotypes were grouped into two main clusters and only two genotypes could not be distinguished. Knowledge on the genetic diversity of castor can be used for future breeding programs for increased oil production for industrial uses.
Abstract: The paper contains a review of the literature in terms of the critical analysis of methodologies of university ranking systems. Furthermore, the initiatives supported by the European Commission (U-Map, U-Multirank) and CHE Ranking are described. Special attention is paid to the tendencies in the development of ranking systems. According to the author, the ranking organizations should abandon the classic form of ranking, namely a hierarchical ordering of universities from “the best" to “the worse". In the empirical part of this paper, using one of the method of cluster analysis called k-means clustering, the author presents university classifications of the top universities from the Shanghai Jiao Tong University-s (SJTU) Academic Ranking of World Universities (ARWU).
Abstract: Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, a density based clustering algorithm (DCBRD) is presented, relying on a knowledge acquired from the data by dividing the data space into overlapped regions. The proposed algorithm discovers arbitrary shaped clusters, requires no input parameters and uses the same definitions of DBSCAN algorithm. We performed an experimental evaluation of the effectiveness and efficiency of it, and compared this results with that of DBSCAN. The results of our experiments demonstrate that the proposed algorithm is significantly efficient in discovering clusters of arbitrary shape and size.
Abstract: Cluster analysis is the name given to a diverse collection of techniques that can be used to classify objects (e.g. individuals, quadrats, species etc). While Kohonen's Self-Organizing Feature Map (SOFM) or Self-Organizing Map (SOM) networks have been successfully applied as a classification tool to various problem domains, including speech recognition, image data compression, image or character recognition, robot control and medical diagnosis, its potential as a robust substitute for clustering analysis remains relatively unresearched. SOM networks combine competitive learning with dimensionality reduction by smoothing the clusters with respect to an a priori grid and provide a powerful tool for data visualization. In this paper, SOM is used for creating a toroidal mapping of two-dimensional lattice to perform cluster analysis on results of a chemical analysis of wines produced in the same region in Italy but derived from three different cultivators, referred to as the “wine recognition data" located in the University of California-Irvine database. The results are encouraging and it is believed that SOM would make an appealing and powerful decision-support system tool for clustering tasks and for data visualization.
Abstract: Measurement of competitiveness between countries or regions is an important topic of many economic analysis and scientific papers. In European Union (EU), there is no mainstream approach of competitiveness evaluation and measuring. There are many opinions and methods of measurement and evaluation of competitiveness between states or regions at national and European level. The methods differ in structure of using the indicators of competitiveness and ways of their processing. The aim of the paper is to analyze main sources of competitive potential of the EU Member States with the help of Factor analysis (FA) and to classify the EU Member States to homogeneous units (clusters) according to the similarity of selected indicators of competitiveness factors by Cluster analysis (CA) in reference years 2000 and 2011. The theoretical part of the paper is devoted to the fundamental bases of competitiveness and the methodology of FA and CA methods. The empirical part of the paper deals with the evaluation of competitiveness factors in the EU Member States and cluster comparison of evaluated countries by cluster analysis.
Abstract: The prevalence of non organic constipation differs
from country to country and the reliability of the estimate rates is
uncertain. Moreover, the clinical relevance of subdividing the
heterogeneous functional constipation disorders into pre-defined
subgroups is largely unknown.. Aim: to estimate the prevalence of
constipation in a population-based sample and determine whether
clinical subgroups can be identified. An age and gender stratified
sample population from 5 Italian cities was evaluated using a
previously validated questionnaire. Data mining by cluster analysis
was used to determine constipation subgroups. Results: 1,500
complete interviews were obtained from 2,083 contacted households
(72%). Self-reported constipation correlated poorly with symptombased
constipation found in 496 subjects (33.1%). Cluster analysis
identified four constipation subgroups which correlated to subgroups
identified according to pre-defined symptom criteria. Significant
differences in socio-demographics and lifestyle were observed
among subgroups.
Abstract: The concentrations of As, Hg, Co, Cr and Cd were
tested for each soil sample, and their spatial patterns were analyzed
by the semivariogram approach of geostatistics and geographical
information system technology. Multivariate statistic approaches
(principal component analysis and cluster analysis) were used to
identify heavy metal sources and their spatial pattern. Principal
component analysis coupled with correlation between heavy metals
showed that primary inputs of As, Hg and Cd were due to
anthropogenic while, Co, and Cr were associated with pedogenic
factors. Ordinary kriging was carried out to map the spatial patters of
heavy metals. The high pollution sources evaluated was related with
usage of urban and industrial wastewater. The results of this study
helpful for risk assessment of environmental pollution for decision
making for industrial adjustment and remedy soil pollution.
Abstract: MATCH project [1] entitle the development of an
automatic diagnosis system that aims to support treatment of colon
cancer diseases by discovering mutations that occurs to tumour
suppressor genes (TSGs) and contributes to the development of
cancerous tumours. The constitution of the system is based on a)
colon cancer clinical data and b) biological information that will be
derived by data mining techniques from genomic and proteomic
sources The core mining module will consist of the popular, well
tested hybrid feature extraction methods, and new combined
algorithms, designed especially for the project. Elements of rough
sets, evolutionary computing, cluster analysis, self-organization maps
and association rules will be used to discover the annotations
between genes, and their influence on tumours [2]-[11].
The methods used to process the data have to address their high
complexity, potential inconsistency and problems of dealing with the
missing values. They must integrate all the useful information
necessary to solve the expert's question. For this purpose, the system
has to learn from data, or be able to interactively specify by a domain
specialist, the part of the knowledge structure it needs to answer a
given query. The program should also take into account the
importance/rank of the particular parts of data it analyses, and adjusts
the used algorithms accordingly.
Abstract: This paper makes a contribution to the on-going
debate on conceptualization and lexicalization of cutting and
breaking (C&B) verbs by discussing data from Telugu, a language of
India belonging to the Dravidian family. Five Telugu native speakers-
verbalizations of agentive actions depicted in 43 short video-clips
were analyzed. It was noted that verbalization of C&B events in
Telugu requires formal units such as simple lexical verbs, explicator
compound verbs, and other complex verb forms. The properties of
the objects involved, the kind of instruments used, and the manner of
action had differential influence on the lexicalization patterns.
Further, it was noted that all the complex verb forms encode 'result'
and 'cause' sub-events in that order. Due to the polysemy associated
with some of the verb forms, our data does not support the
straightforward bipartition of this semantic domain.
Abstract: Clustering is one of an interesting data mining topics
that can be applied in many fields. Recently, the problem of cluster
analysis is formulated as a problem of nonsmooth, nonconvex optimization,
and an algorithm for solving the cluster analysis problem
based on nonsmooth optimization techniques is developed. This
optimization problem has a number of characteristics that make it
challenging: it has many local minimum, the optimization variables
can be either continuous or categorical, and there are no exact
analytical derivatives. In this study we show how to apply a particular
class of optimization methods known as pattern search methods
to address these challenges. These methods do not explicitly use
derivatives, an important feature that has not been addressed in
previous studies. Results of numerical experiments are presented
which demonstrate the effectiveness of the proposed method.
Abstract: Clustering is a very well known technique in data mining. One of the most widely used clustering techniques is the kmeans algorithm. Solutions obtained from this technique depend on the initialization of cluster centers and the final solution converges to local minima. In order to overcome K-means algorithm shortcomings, this paper proposes a hybrid evolutionary algorithm based on the combination of PSO, SA and K-means algorithms, called PSO-SA-K, which can find better cluster partition. The performance is evaluated through several benchmark data sets. The simulation results show that the proposed algorithm outperforms previous approaches, such as PSO, SA and K-means for partitional clustering problem.
Abstract: This paper introduces new algorithms (Fuzzy relative
of the CLARANS algorithm FCLARANS and Fuzzy c Medoids
based on randomized search FCMRANS) for fuzzy clustering of
relational data. Unlike existing fuzzy c-medoids algorithm (FCMdd)
in which the within cluster dissimilarity of each cluster is minimized
in each iteration by recomputing new medoids given current
memberships, FCLARANS minimizes the same objective function
minimized by FCMdd by changing current medoids in such away
that that the sum of the within cluster dissimilarities is minimized.
Computing new medoids may be effected by noise because outliers
may join the computation of medoids while the choice of medoids in
FCLARANS is dictated by the location of a predominant fraction of
points inside a cluster and, therefore, it is less sensitive to the
presence of outliers. In FCMRANS the step of computing new
medoids in FCMdd is modified to be based on randomized search.
Furthermore, a new initialization procedure is developed that add
randomness to the initialization procedure used with FCMdd. Both
FCLARANS and FCMRANS are compared with the robust and
linearized version of fuzzy c-medoids (RFCMdd). Experimental
results with different samples of the Reuter-21578, Newsgroups
(20NG) and generated datasets with noise show that FCLARANS is
more robust than both RFCMdd and FCMRANS. Finally, both
FCMRANS and FCLARANS are more efficient and their outputs
are almost the same as that of RFCMdd in terms of classification
rate.
Abstract: Droughts are complex, natural hazards that, to a
varying degree, affect some parts of the world every year. The range
of drought impacts is related to drought occurring in different stages
of the hydrological cycle and usually different types of droughts,
such as meteorological, agricultural, hydrological, and socioeconomical
are distinguished. Streamflow drought was analyzed by
the method of truncation level (at 70% level) on daily discharges
measured in 54 hydrometric stations in southwestern Iran. Frequency
analysis was carried out for annual maximum series (AMS) of
drought deficit volume and duration series. Some factors including
physiographic, climatic, geologic, and vegetation cover were studied
as influential factors in the regional analysis. According to the results
of factor analysis, six most effective factors were identified as area,
rainfall from December to February, the percent of area with
Normalized Difference Vegetation Index (NDVI)
Abstract: This research is a comparative study of complexity, as a multidimensional concept, in the context of streetscape composition in Algeria and Japan. 80 streetscapes visual arrays have been collected and then presented to 20 participants, with different cultural backgrounds, in order to be categorized and classified according to their degrees of complexity. Three analysis methods have been used in this research: cluster analysis, ranking method and Hayashi Quantification method (Method III). The results showed that complexity, disorder, irregularity and disorganization are often conflicting concepts in the urban context. Algerian daytime streetscapes seem to be balanced, ordered and regular, and Japanese daytime streetscapes seem to be unbalanced, regular and vivid. Variety, richness and irregularity with some aspects of order and organization seem to characterize Algerian night streetscapes. Japanese night streetscapes seem to be more related to balance, regularity, order and organization with some aspects of confusion and ambiguity. Complexity characterized mainly Algerian avenues with green infrastructure. Therefore, for Japanese participants, Japanese traditional night streetscapes were complex. And for foreigners, Algerian and Japanese avenues nightscapes were the most complex visual arrays.
Abstract: The paper deals with an application of quantitative analysis – the Data Envelopment Analysis (DEA) method to performance evaluation of the European Union Member States, in the reference years 2000 and 2011. The main aim of the paper is to measure efficiency changes over the reference years and to analyze a level of productivity in individual countries based on DEA method and to classify the EU Member States to homogeneous units (clusters) according to efficiency results. The theoretical part is devoted to the fundamental basis of performance theory and the methodology of DEA. The empirical part is aimed at measuring degree of productivity and level of efficiency changes of evaluated countries by basic DEA model – CCR CRS model, and specialized DEA approach – the Malmquist Index measuring the change of technical efficiency and the movement of production possibility frontier. Here, DEA method becomes a suitable tool for setting a competitive/uncompetitive position of each country because there is not only one factor evaluated, but a set of different factors that determine the degree of economic development.