Clustering Protein Sequences with Tailored General Regression Model Technique

Cluster analysis divides data into groups that are meaningful, useful, or both. Analysis of biological data is creating a new generation of epidemiologic, prognostic, diagnostic and treatment modalities. Clustering of protein sequences is one of the current research topics in the field of computer science. Linear relation is valuable in rule discovery for a given data, such as if value X goes up 1, value Y will go down 3", etc. The classical linear regression models the linear relation of two sequences perfectly. However, if we need to cluster a large repository of protein sequences into groups where sequences have strong linear relationship with each other, it is prohibitively expensive to compare sequences one by one. In this paper, we propose a new technique named General Regression Model Technique Clustering Algorithm (GRMTCA) to benignly handle the problem of linear sequences clustering. GRMT gives a measure, GR*, to tell the degree of linearity of multiple sequences without having to compare each pair of them.

Using Morphological and Microsatellite (SSR) Markers to Assess the Genetic Diversity in Alfalfa (Medicago sativa L.)

Utilization of diverse germplasm is needed to enhance the genetic diversity of cultivars. The objective of this study was to evaluate the genetic relationships of 98 alfalfa germplasm accessions using morphological traits and SSR markers. From the 98 tested populations, 81 were locals originating in Europe, 17 were introduced from USA, Australia, New Zealand and Canada. Three primers generated 67 polymorphic bands. The average polymorphic information content (PIC) was very high (> 0.90) over all three used primer combinations. Cluster analysis using Unweighted Pair Group Method with Arithmetic Means (UPGMA) and Jaccard´s coefficient grouped the accessions into 2 major clusters with 4 sub-clusters with no correlation between genetic and morphological diversity. The SSR analysis clearly indicated that even with three polymorphic primers, reliable estimation of genetic diversity could be obtained.

Info-participation of the Disabled Using the Mixed Preference Data in Improving Their Travel Quality

Today, the preferences and participation of the TD groups such as the elderly and disabled is still lacking in decision-making of transportation planning, and their reactions to certain type of policies are not well known. Thus, a clear methodology is needed. This study aimed to develop a method to extract the preferences of the disabled to be used in the policy-making stage that can also guide to future estimations. The method utilizes the combination of cluster analysis and data filtering using the data of the Arao city (Japan). The method is a process that follows: defining the TD group by the cluster analysis tool, their travel preferences in tabular form from the household surveys by policy variableimpact pairs, zones, and by trip purposes, and the final outcome is the preference probabilities of the disabled. The preferences vary by trip purpose; for the work trips, accessibility and transit system quality policies with the accompanying impacts of modal shifts towards public mode use as well as the decreasing travel costs, and the trip rate increase; for the social trips, the same accessibility and transit system policies leading to the same mode shift impact, together with the travel quality policy area leading to trip rate increase. These results explain the policies to focus and can be used in scenario generation in models, or any other planning purpose as decision support tool.

Water Quality and Freshwater Fish Diversity at Khao Luang National Park, Thailand

Water quality and freshwater fish diversity from nine waterfalls at Khao Luang National Park, Thailand was examined. Streams were shallow, fast flowing with clear water and rocky and sandy substrate. The mean water quality of waterfalls at Khao Luang National Park were as following pH 7.50, air temperature 24.27 °C, water temperature 26.37 °C, dissolved oxygen 7.88 mg/l, hardness 4.44-21.33 mg/l, alkalinity 3.55-11.88 mg/(as CaCO3). Twenty fish species were found at Khao Luang National Park belonging to nine families. A cluster analysis of water quality at Khao Luang National Park revealed that waterfalls at Khao Luang National Park were divided into two groups: A and B. Group A composed of two waterfalls (i.e. Aie Kaew and Wangmaipak) that flew to the Gulf of Thailand side. Group B composed of seven waterfalls (i.e. Promlok, Kalom, Nuafa, Suankun, Soidaw, Suanhai, and Thapae) that flew to the Andaman Sea side (Fig. 2) .The Cyprinids represented the major species in all the waterfalls comprising of 45%.

Forest Growth Simulation: Tropical Rain Forest Stand Table Projection

The study on the tree growth for four species groups of commercial timber in Koh Kong province, Cambodia-s tropical rainforest is described. The simulation for these four groups had been successfully developed in the 5-year interval through year-60. Data were obtained from twenty permanent sample plots in the duration of thirteen years. The aim for this study was to develop stand table simulation system of tree growth by the species group. There were five steps involved in the development of the tree growth simulation: aggregate the tree species into meaningful groups by using cluster analysis; allocate the trees in the diameter classes by the species group; observe the diameter movement of the species group. The diameter growth rate, mortality rate and recruitment rate were calculated by using some mathematical formula. Simulation equation had been created by combining those parameters. Result showed the dissimilarity of the diameter growth among species groups.

RAPD Analysis of Genetic Diversity of Castor Bean

The aim of this work was to detect genetic variability among the set of 40 castor genotypes using 8 RAPD markers. Amplification of genomic DNA of 40 genotypes, using RAPD analysis, yielded in 66 fragments, with an average of 8.25 polymorphic fragments per primer. Number of amplified fragments ranged from 3 to 13, with the size of amplicons ranging from 100 to 1200 bp. Values of the polymorphic information content (PIC) value ranged from 0.556 to 0.895 with an average of 0.784 and diversity index (DI) value ranged from 0.621 to 0.896 with an average of 0.798. The dendrogram based on hierarchical cluster analysis using UPGMA algorithm was prepared and analyzed genotypes were grouped into two main clusters and only two genotypes could not be distinguished. Knowledge on the genetic diversity of castor can be used for future breeding programs for increased oil production for industrial uses.

University Ranking Systems – From League Table to Homogeneous Groups of Universities

The paper contains a review of the literature in terms of the critical analysis of methodologies of university ranking systems. Furthermore, the initiatives supported by the European Commission (U-Map, U-Multirank) and CHE Ranking are described. Special attention is paid to the tendencies in the development of ranking systems. According to the author, the ranking organizations should abandon the classic form of ranking, namely a hierarchical ordering of universities from “the best" to “the worse". In the empirical part of this paper, using one of the method of cluster analysis called k-means clustering, the author presents university classifications of the top universities from the Shanghai Jiao Tong University-s (SJTU) Academic Ranking of World Universities (ARWU).

Density Clustering Based On Radius of Data (DCBRD)

Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, a density based clustering algorithm (DCBRD) is presented, relying on a knowledge acquired from the data by dividing the data space into overlapped regions. The proposed algorithm discovers arbitrary shaped clusters, requires no input parameters and uses the same definitions of DBSCAN algorithm. We performed an experimental evaluation of the effectiveness and efficiency of it, and compared this results with that of DBSCAN. The results of our experiments demonstrate that the proposed algorithm is significantly efficient in discovering clusters of arbitrary shape and size.

Enhanced Clustering Analysis and Visualization Using Kohonen's Self-Organizing Feature Map Networks

Cluster analysis is the name given to a diverse collection of techniques that can be used to classify objects (e.g. individuals, quadrats, species etc). While Kohonen's Self-Organizing Feature Map (SOFM) or Self-Organizing Map (SOM) networks have been successfully applied as a classification tool to various problem domains, including speech recognition, image data compression, image or character recognition, robot control and medical diagnosis, its potential as a robust substitute for clustering analysis remains relatively unresearched. SOM networks combine competitive learning with dimensionality reduction by smoothing the clusters with respect to an a priori grid and provide a powerful tool for data visualization. In this paper, SOM is used for creating a toroidal mapping of two-dimensional lattice to perform cluster analysis on results of a chemical analysis of wines produced in the same region in Italy but derived from three different cultivators, referred to as the “wine recognition data" located in the University of California-Irvine database. The results are encouraging and it is believed that SOM would make an appealing and powerful decision-support system tool for clustering tasks and for data visualization.

Assessment of EU Competitiveness Factors by Multivariate Methods

Measurement of competitiveness between countries or regions is an important topic of many economic analysis and scientific papers. In European Union (EU), there is no mainstream approach of competitiveness evaluation and measuring. There are many opinions and methods of measurement and evaluation of competitiveness between states or regions at national and European level. The methods differ in structure of using the indicators of competitiveness and ways of their processing. The aim of the paper is to analyze main sources of competitive potential of the EU Member States with the help of Factor analysis (FA) and to classify the EU Member States to homogeneous units (clusters) according to the similarity of selected indicators of competitiveness factors by Cluster analysis (CA) in reference years 2000 and 2011. The theoretical part of the paper is devoted to the fundamental bases of competitiveness and the methodology of FA and CA methods. The empirical part of the paper deals with the evaluation of competitiveness factors in the EU Member States and cluster comparison of evaluated countries by cluster analysis. 

Idiopathic Constipation can be Subdivided in Clinical Subtypes: Data Mining by Cluster Analysis on a Population based Study

The prevalence of non organic constipation differs from country to country and the reliability of the estimate rates is uncertain. Moreover, the clinical relevance of subdividing the heterogeneous functional constipation disorders into pre-defined subgroups is largely unknown.. Aim: to estimate the prevalence of constipation in a population-based sample and determine whether clinical subgroups can be identified. An age and gender stratified sample population from 5 Italian cities was evaluated using a previously validated questionnaire. Data mining by cluster analysis was used to determine constipation subgroups. Results: 1,500 complete interviews were obtained from 2,083 contacted households (72%). Self-reported constipation correlated poorly with symptombased constipation found in 496 subjects (33.1%). Cluster analysis identified four constipation subgroups which correlated to subgroups identified according to pre-defined symptom criteria. Significant differences in socio-demographics and lifestyle were observed among subgroups.

Spatial Distribution and Risk Assessment of As, Hg, Co and Cr in Kaveh Industrial City, using Geostatistic and GIS

The concentrations of As, Hg, Co, Cr and Cd were tested for each soil sample, and their spatial patterns were analyzed by the semivariogram approach of geostatistics and geographical information system technology. Multivariate statistic approaches (principal component analysis and cluster analysis) were used to identify heavy metal sources and their spatial pattern. Principal component analysis coupled with correlation between heavy metals showed that primary inputs of As, Hg and Cd were due to anthropogenic while, Co, and Cr were associated with pedogenic factors. Ordinary kriging was carried out to map the spatial patters of heavy metals. The high pollution sources evaluated was related with usage of urban and industrial wastewater. The results of this study helpful for risk assessment of environmental pollution for decision making for industrial adjustment and remedy soil pollution.

Mining Genes Relations in Microarray Data Combined with Ontology in Colon Cancer Automated Diagnosis System

MATCH project [1] entitle the development of an automatic diagnosis system that aims to support treatment of colon cancer diseases by discovering mutations that occurs to tumour suppressor genes (TSGs) and contributes to the development of cancerous tumours. The constitution of the system is based on a) colon cancer clinical data and b) biological information that will be derived by data mining techniques from genomic and proteomic sources The core mining module will consist of the popular, well tested hybrid feature extraction methods, and new combined algorithms, designed especially for the project. Elements of rough sets, evolutionary computing, cluster analysis, self-organization maps and association rules will be used to discover the annotations between genes, and their influence on tumours [2]-[11]. The methods used to process the data have to address their high complexity, potential inconsistency and problems of dealing with the missing values. They must integrate all the useful information necessary to solve the expert's question. For this purpose, the system has to learn from data, or be able to interactively specify by a domain specialist, the part of the knowledge structure it needs to answer a given query. The program should also take into account the importance/rank of the particular parts of data it analyses, and adjusts the used algorithms accordingly.

Cutting and Breaking Events in Telugu

This paper makes a contribution to the on-going debate on conceptualization and lexicalization of cutting and breaking (C&B) verbs by discussing data from Telugu, a language of India belonging to the Dravidian family. Five Telugu native speakers- verbalizations of agentive actions depicted in 43 short video-clips were analyzed. It was noted that verbalization of C&B events in Telugu requires formal units such as simple lexical verbs, explicator compound verbs, and other complex verb forms. The properties of the objects involved, the kind of instruments used, and the manner of action had differential influence on the lexicalization patterns. Further, it was noted that all the complex verb forms encode 'result' and 'cause' sub-events in that order. Due to the polysemy associated with some of the verb forms, our data does not support the straightforward bipartition of this semantic domain.

Using Pattern Search Methods for Minimizing Clustering Problems

Clustering is one of an interesting data mining topics that can be applied in many fields. Recently, the problem of cluster analysis is formulated as a problem of nonsmooth, nonconvex optimization, and an algorithm for solving the cluster analysis problem based on nonsmooth optimization techniques is developed. This optimization problem has a number of characteristics that make it challenging: it has many local minimum, the optimization variables can be either continuous or categorical, and there are no exact analytical derivatives. In this study we show how to apply a particular class of optimization methods known as pattern search methods to address these challenges. These methods do not explicitly use derivatives, an important feature that has not been addressed in previous studies. Results of numerical experiments are presented which demonstrate the effectiveness of the proposed method.

A New Evolutionary Algorithm for Cluster Analysis

Clustering is a very well known technique in data mining. One of the most widely used clustering techniques is the kmeans algorithm. Solutions obtained from this technique depend on the initialization of cluster centers and the final solution converges to local minima. In order to overcome K-means algorithm shortcomings, this paper proposes a hybrid evolutionary algorithm based on the combination of PSO, SA and K-means algorithms, called PSO-SA-K, which can find better cluster partition. The performance is evaluated through several benchmark data sets. The simulation results show that the proposed algorithm outperforms previous approaches, such as PSO, SA and K-means for partitional clustering problem.

Fuzzy Relatives of the CLARANS Algorithm With Application to Text Clustering

This paper introduces new algorithms (Fuzzy relative of the CLARANS algorithm FCLARANS and Fuzzy c Medoids based on randomized search FCMRANS) for fuzzy clustering of relational data. Unlike existing fuzzy c-medoids algorithm (FCMdd) in which the within cluster dissimilarity of each cluster is minimized in each iteration by recomputing new medoids given current memberships, FCLARANS minimizes the same objective function minimized by FCMdd by changing current medoids in such away that that the sum of the within cluster dissimilarities is minimized. Computing new medoids may be effected by noise because outliers may join the computation of medoids while the choice of medoids in FCLARANS is dictated by the location of a predominant fraction of points inside a cluster and, therefore, it is less sensitive to the presence of outliers. In FCMRANS the step of computing new medoids in FCMdd is modified to be based on randomized search. Furthermore, a new initialization procedure is developed that add randomness to the initialization procedure used with FCMdd. Both FCLARANS and FCMRANS are compared with the robust and linearized version of fuzzy c-medoids (RFCMdd). Experimental results with different samples of the Reuter-21578, Newsgroups (20NG) and generated datasets with noise show that FCLARANS is more robust than both RFCMdd and FCMRANS. Finally, both FCMRANS and FCLARANS are more efficient and their outputs are almost the same as that of RFCMdd in terms of classification rate.

Regional Analysis of Streamflow Drought: A Case Study for Southwestern Iran

Droughts are complex, natural hazards that, to a varying degree, affect some parts of the world every year. The range of drought impacts is related to drought occurring in different stages of the hydrological cycle and usually different types of droughts, such as meteorological, agricultural, hydrological, and socioeconomical are distinguished. Streamflow drought was analyzed by the method of truncation level (at 70% level) on daily discharges measured in 54 hydrometric stations in southwestern Iran. Frequency analysis was carried out for annual maximum series (AMS) of drought deficit volume and duration series. Some factors including physiographic, climatic, geologic, and vegetation cover were studied as influential factors in the regional analysis. According to the results of factor analysis, six most effective factors were identified as area, rainfall from December to February, the percent of area with Normalized Difference Vegetation Index (NDVI)

Comparative Study of Complexity in Streetscape Composition

This research is a comparative study of complexity, as a multidimensional concept, in the context of streetscape composition in Algeria and Japan. 80 streetscapes visual arrays have been collected and then presented to 20 participants, with different cultural backgrounds, in order to be categorized and classified according to their degrees of complexity. Three analysis methods have been used in this research: cluster analysis, ranking method and Hayashi Quantification method (Method III). The results showed that complexity, disorder, irregularity and disorganization are often conflicting concepts in the urban context. Algerian daytime streetscapes seem to be balanced, ordered and regular, and Japanese daytime streetscapes seem to be unbalanced, regular and vivid. Variety, richness and irregularity with some aspects of order and organization seem to characterize Algerian night streetscapes. Japanese night streetscapes seem to be more related to balance, regularity, order and organization with some aspects of confusion and ambiguity. Complexity characterized mainly Algerian avenues with green infrastructure. Therefore, for Japanese participants, Japanese traditional night streetscapes were complex. And for foreigners, Algerian and Japanese avenues nightscapes were the most complex visual arrays.

DEA Method for Evaluation of EU Performance

The paper deals with an application of quantitative analysis – the Data Envelopment Analysis (DEA) method to performance evaluation of the European Union Member States, in the reference years 2000 and 2011. The main aim of the paper is to measure efficiency changes over the reference years and to analyze a level of productivity in individual countries based on DEA method and to classify the EU Member States to homogeneous units (clusters) according to efficiency results. The theoretical part is devoted to the fundamental basis of performance theory and the methodology of DEA. The empirical part is aimed at measuring degree of productivity and level of efficiency changes of evaluated countries by basic DEA model – CCR CRS model, and specialized DEA approach – the Malmquist Index measuring the change of technical efficiency and the movement of production possibility frontier. Here, DEA method becomes a suitable tool for setting a competitive/uncompetitive position of each country because there is not only one factor evaluated, but a set of different factors that determine the degree of economic development.