Abstract: Clustering is an unsupervised learning technique for aggregating data objects into meaningful classes so that intra cluster similarity is maximized and inter cluster similarity is minimized in data mining. However, no single clustering algorithm proves to be the most effective in producing the best result. As a result, a new challenging technique known as the cluster ensemble approach has blossomed in order to determine the solution to this problem. For the cluster analysis issue, this new technique is a successful approach. The cluster ensemble's main goal is to combine similar clustering solutions in a way that achieves the precision while also improving the quality of individual data clustering. Because of the massive and rapid creation of new approaches in the field of data mining, the ongoing interest in inventing novel algorithms necessitates a thorough examination of current techniques and future innovation. This paper presents a comparative analysis of various cluster ensemble approaches, including their methodologies, formal working process, and standard accuracy and error rates. As a result, the society of clustering practitioners will benefit from this exploratory and clear research, which will aid in determining the most appropriate solution to the problem at hand.
Abstract: Over the last years, a new paradigm known as Physical Internet has been developed, and studied in logistics management. The purpose of this global and open system is to deal with logistics grand challenge by setting up an efficient and sustainable Logistics Web. The purpose of this paper is to review scientific articles dedicated to Physical Internet topic, and to provide a clustering strategy enabling to classify the literature on the Physical Internet, to follow its evolution, as well as to criticize it. The classification is based on three factors: Logistics Web, organization, and resources. Several papers about Physical Internet have been classified and analyzed along the Logistics Web, resources and organization views at a strategic, tactical and operational level, respectively. A developed cluster analysis shows which topics of the Physical Internet that are the less covered actually. Future researches are outlined for these topics.
Abstract: A key issue in stock investment is how to select representative features for stock selection. The objective of this paper is to firstly determine whether an automated stock investment system, using machine learning techniques, may be used to identify a portfolio of growth stocks that are highly likely to provide returns better than the stock market index. The second objective is to identify the technical features that best characterize whether a stock’s price is likely to go up and to identify the most important factors and their contribution to predicting the likelihood of the stock price going up. Unsupervised machine learning techniques, such as cluster analysis, were applied to the stock data to identify a cluster of stocks that was likely to go up in price – portfolio 1. Next, the principal component analysis technique was used to select stocks that were rated high on component one and component two – portfolio 2. Thirdly, a supervised machine learning technique, the logistic regression method, was used to select stocks with a high probability of their price going up – portfolio 3. The predictive models were validated with metrics such as, sensitivity (recall), specificity and overall accuracy for all models. All accuracy measures were above 70%. All portfolios outperformed the market by more than eight times. The top three stocks were selected for each of the three stock portfolios and traded in the market for one month. After one month the return for each stock portfolio was computed and compared with the stock market index returns. The returns for all three stock portfolios was 23.87% for the principal component analysis stock portfolio, 11.65% for the logistic regression portfolio and 8.88% for the K-means cluster portfolio while the stock market performance was 0.38%. This study confirms that an automated stock investment system using machine learning techniques can identify top performing stock portfolios that outperform the stock market.
Abstract: In the higher education setting, there is a current trend in society toward greater openness and transparency. The economic, social and political changes that have occurred in recent years in public sector universities (particularly the New Public Management, the Bologna Process and the emergence of the “third mission”) call for a wider disclosure of value created by universities to support fundraising activities, to ensure accountability in the use of public funds and the outcomes of research and teaching, as well as close relationships with industries and territories. The paper has two purposes: 1) to explore the intellectual capital (IC) disclosure in Spanish universities through their websites, and 2) to identify university profiles. This study applies a content analysis to analyze the institutional websites of Spanish public universities and a cluster analysis. The analysis reveals that Spanish universities’ website content usually relates to human capital, while structural and relational capitals are less widely disclosed. Our research identifies three behavioral profiles of Spanish universities with regard to the online disclosure of IC (universities more proactive, universities less proactive and universities adopt a middle position in this regard. The results can serve as encouragement to university managers to enhance online IC disclosure to meet the information needs of university stakeholders.
Abstract: Data on various aspects of education are collected at the institutional and government level regularly. In Australia, for example, students at various levels of schooling undertake examinations in numeracy and literacy as part of NAPLAN testing, enabling longitudinal assessment of such data as well as comparisons between schools and states within Australia. Another source of educational data collected internationally is via the PISA study which collects data from several countries when students are approximately 15 years of age and enables comparisons in the performance of science, mathematics and English between countries as well as ranking of countries based on performance in these standardised tests. As well as student and school outcomes based on the tests taken as part of the PISA study, there is a wealth of other data collected in the study including parental demographics data and data related to teaching strategies used by educators. Overall, an abundance of educational data is available which has the potential to be used to help improve educational attainment and teaching of content in order to improve learning outcomes. A multivariate assessment of such data enables multiple variables to be considered simultaneously and will be used in the present study to help develop profiles of students based on performance in mathematics using data obtained from the PISA study.
Abstract: In the design cycle, a main design task is to determine the external shape of the product. The external shape of a product is one of the key factors that can affect the customers’ preferences linking to the motivation to buy the product, especially in the case of a consumer electronic product such as a mobile phone. The relationship between the external shape and the customer preferences needs to be studied to enhance the customer’s purchase desire and action. In this research, a design for customer preferences model is developed for investigating the relationships between the external shape and the customer preferences of a product. In the first stage, the names of the geometric features are collected and evaluated from the data of the specified internet web pages using the developed text miner. The key geometric features can be determined if the number of occurrence on the web pages is relatively high. For each key geometric feature, the numerical values are explored using the text miner to collect the internet data from the web pages. In the second stage, a cluster analysis model is developed to evaluate the numerical values of the key geometric features to divide the external shapes into several groups. Several design suggestion cases can be proposed, for example, large model, mid-size model, and mini model, for designing a mobile phone. A customer preference index is developed by evaluating the numerical data of each of the key geometric features of the design suggestion cases. The design suggestion case with the top ranking of the customer preference index can be selected as the final design of the product. In this paper, an example product of a notebook computer is illustrated. It shows that the external shape of a product can be used to drive customer preferences. The presented design for customer preferences model is useful for determining a suitable external shape of the product to increase customer preferences.
Abstract: Delays in the construction industry are a global phenomenon. Many construction projects experience extensive delays exceeding the initially estimated completion time. The main purpose of this study is to identify construction projects typical behaviors in order to develop a prognosis and management tool. Being able to know a construction projects schedule tendency will enable evidence-based decision-making to allow resolutions to be made before delays occur. This study presents an innovative approach that uses Cluster Analysis Method to support predictions during Earned Value Analyses. A clustering analysis was used to predict future scheduling, Earned Value Management (EVM), and Earned Schedule (ES) principal Indexes behaviors in construction projects. The analysis was made using a database with 90 different construction projects. It was validated with additional data extracted from literature and with another 15 contrasting projects. For all projects, planned and executed schedules were collected and the EVM and ES principal indexes were calculated. A complete linkage classification method was used. In this way, the cluster analysis made considers that the distance (or similarity) between two clusters must be measured by its most disparate elements, i.e. that the distance is given by the maximum span among its components. Finally, through the use of EVM and ES Indexes and Tukey and Fisher Pairwise Comparisons, the statistical dissimilarity was verified and four clusters were obtained. It can be said that construction projects show an average delay of 35% of its planned completion time. Furthermore, four typical behaviors were found and for each of the obtained clusters, the interim milestones and the necessary rhythms of construction were identified. In general, detected typical behaviors are: (1) Projects that perform a 5% of work advance in the first two tenths and maintain a constant rhythm until completion (greater than 10% for each remaining tenth), being able to finish on the initially estimated time. (2) Projects that start with an adequate construction rate but suffer minor delays culminating with a total delay of almost 27% of the planned time. (3) Projects which start with a performance below the planned rate and end up with an average delay of 64%, and (4) projects that begin with a poor performance, suffer great delays and end up with an average delay of a 120% of the planned completion time. The obtained clusters compose a tool to identify the behavior of new construction projects by comparing their current work performance to the validated database, thus allowing the correction of initial estimations towards more accurate completion schedules.
Abstract: The research examines the factors that affect customer churn (CC) in the Jordanian telecom industry. A total of 700 surveys were distributed. Cluster analysis revealed three main clusters. Results showed that CC and customer satisfaction (CS) were the key determinants in forming the three clusters. In two clusters, the center values of CC were high, indicating that the customers were loyal and SC was expensive and time- and energy-consuming. Still, the mobile service provider (MSP) should enhance its communication (COM), and value added services (VASs), as well as customer complaint management systems (CCMS). Finally, for the third cluster the center of the CC indicates a poor level of loyalty, which facilitates customers churn to another MSP. The results of this study provide valuable feedback for MSP decision makers regarding approaches to improving their performance and reducing CC.
Abstract: A number of studies discussed the topic of benefits of retailers-manufacturers cooperation and coopetition. However, there are only few publications focused on the benefits of cooperation and coopetition between retailers and their suppliers of durable consumer goods; especially in the context of business model of cooperating partners. This paper aims to provide a clustering approach to segment retailers selling consumer durables according to the benefits they obtain from their cooperation with key manufacturers and differentiate the said retailers’ in term of the business models of cooperating partners. For the purpose of the study, a survey (with a CATI method) collected data on 603 consumer durables retailers present on the Polish market. Retailers are clustered both, with hierarchical and non-hierarchical methods. Five distinctive groups of consumer durables’ retailers are (based on the studied benefits) identified using the two-stage clustering approach. The clusters are then characterized with a set of exogenous variables, key of which are business models employed by the retailer and its partnering key manufacturer. The paper finds that the a combination of a medium sized retailer classified as an Integrator with a chiefly domestic capital and a manufacturer categorized as a Market Player will yield the highest benefits. On the other side of the spectrum is medium sized Distributor retailer with solely domestic capital – in this case, the business model of the cooperating manufactrer appears to be irreleveant. This paper is the one of the first empirical study using cluster analysis on primary data that defines the types of cooperation between consumer durables’ retailers and manufacturers – their key suppliers. The analysis integrates a perspective of both retailers’ and manufacturers’ business models and matches them with individual and joint benefits.
Abstract: Renewable energy is referred to as "clean energy" and common popular support for the use of renewable energy (RE) is to provide electricity with zero carbon dioxide emissions. This study provides useful insight into the European Union (EU) RE, especially, into electricity generation obtained from renewables, and their targets. The objective of this study is to identify groups of European countries, using multivariate statistical analysis and selected indicators. The hierarchical clustering method is used to decide the number of clusters for EU countries. The conducted statistical hierarchical cluster analysis is based on the Ward’s clustering method and squared Euclidean distances. Hierarchical cluster analysis identified eight distinct clusters of European countries. Then, non-hierarchical clustering (k-means) method was applied. Discriminant analysis was used to determine the validity of the results with data normalized by Z score transformation. To explore the relationship between the selected indicators, correlation coefficients were computed. The results of the study reveal the current situation of RE in European Union Member States.
Abstract: As genetic diversity is most important for existing, breeding and production of any fish; this study was undertaken for investigating genetic diversity of freshwater mud eel, Monopterus cuchia at population level where three ecological populations such as flooded area of Sylhet (P1), open water of Moulvibazar (P2) and open water of Sunamganj (P3) districts of Bangladesh were considered. Four arbitrary RAPD primers (OPB-12, C0-4, B-03 and OPB-08) were screened and RAPD banding patterns were analyzed among the populations considering 15 individuals of each population. In total 174, 138 and 149 bands were detected in the populations of P1, P2 and P3 respectively; however, each primer revealed less number of bands in each population. 100% polymorphic loci were recorded in P2 and P3 whereas only one monomorphic locus was observed in P1, recorded 97.5% polymorphism. Different genetic parameters such as inter-individual pairwise similarity, genetic distance, Nei genetic similarity, linkage distances, cluster analysis and allelic information, etc. were considered for measuring genetic diversity. The average inter-individual pairwise similarity was recorded 2.98, 1.47 and 1.35 in P1, P2 and P3 respectively. Considering genetic distance analysis, the highest distance 1 was recorded in P2 and P3 and the lowest genetic distance 0.444 was found in P2. The average Nei genetic similarity was observed 0.19, 0.16 and 0.13 in P1, P2 and P3, respectively; however, the average linkage distance was recorded 24.92, 17.14 and 15.28 in P1, P3 and P2 respectively. Based on linkage distance, genetic clusters were generated in three populations where 6 clades and 7 clusters were found in P1, 3 clades and 5 clusters were observed in P2 and 4 clades and 7 clusters were detected in P3. In addition, allelic information was observed where the frequency of p and q alleles were observed 0.093 and 0.907 in P1, 0.076 and 0.924 in P2, 0.074 and 0.926 in P3 respectively. The average gene diversity was observed highest in P2 (0.132) followed by P3 (0.131) and P1 (0.121) respectively.
Abstract: The identification of lipid and soluble sugar components in flour samples of different cultivars belonging to common oat species (Avena sativa L.) was performed: spring oat, winter oat and hulless oat. Fatty acids were extracted from flour samples with n-hexane, and derivatized into volatile methyl esters, using TMSH (trimethylsulfonium hydroxide in methanol). Soluble sugars were then extracted from defatted and dried samples of oat flour with 96% ethanol, and further derivatized into corresponding TMS-oximes, using hydroxylamine hydrochloride solution and BSTFA (N,O-bis-(trimethylsilyl)-trifluoroacetamide). The hexane and ethanol extracts of each oat cultivar were analyzed using GC-MS system. Lipid and simple sugar compositions are very similar in all samples of investigated cultivars. Chemometric tool was applied to numeric values of automatically integrated surface areas of detected lipid and simple sugar components in their corresponding derivatized forms. Hierarchical cluster analysis shows a very high similarity between the investigated flour samples of oat cultivars, according to the fatty acid content (0.9955). Moderate similarity was observed according to the content of soluble sugars (0.50). These preliminary results support the idea of establishing methods for oat flour authentication, and provide the means for distinguishing oat flour samples, regardless of the variety, from flour samples made of other cereal species, just by lipid and simple sugar profile analysis.
Abstract: In the knowledge-based economy, innovation is considered essential in order to achieve survival and growth in organizations. On the other hand, knowledge management is currently understood as one of the keys to innovation process. Both factors are generally admitted as generators of competitive advantage in organizations. Specifically, activities on R&D&I and those that generate internal knowledge have a positive influence in innovation results. This paper examines this effect and if it is similar or not is what we aimed to quantify in this paper. We focus on the impact that proportion of knowledge workers, the R&D&I investment, the amounts destined for ICTs and training for innovation have on the variation of tangible and intangibles returns for the sector of high and medium technology in Spain. To do this, we have performed an empirical analysis on the results of questionnaires about innovation in enterprises in Spain, collected by the National Statistics Institute. First, using clusters methodology, the behavior of these enterprises regarding knowledge management is identified. Then, using SEM methodology, we performed, for each cluster, the study about cause-effect relationships among constructs defined through variables, setting its type and quantification. The cluster analysis results in four groups in which cluster number 1 and 3 presents the best performance in innovation with differentiating nuances among them, while clusters 2 and 4 obtained divergent results to a similar innovative effort. However, the results of SEM analysis for each cluster show that, in all cases, knowledge workers are those that affect innovation performance most, regardless of the level of investment, and that there is a strong correlation between knowledge workers and investment in knowledge generation. The main findings reached is that Spanish high and medium technology companies improve their innovation performance investing in internal knowledge generation measures, specially, in terms of R&D activities, and underinvest in external ones. This, and the strong correlation between knowledge workers and the set of activities that promote the knowledge generation, should be taken into account by managers of companies, when making decisions about their investments for innovation, since they are key for improving their opportunities in the global market.
Abstract: River Hindon is an important river catering the
demand of highly populated rural and industrial cluster of western
Uttar Pradesh, India. Water quality of river Hindon is deteriorating at
an alarming rate due to various industrial, municipal and agricultural
activities. The present study aimed at identifying the pollution
sources and quantifying the degree to which these sources are
responsible for the deteriorating water quality of the river. Various
water quality parameters, like pH, temperature, electrical
conductivity, total dissolved solids, total hardness, calcium, chloride,
nitrate, sulphate, biological oxygen demand, chemical oxygen
demand, and total alkalinity were assessed. Water quality data
obtained from eight study sites for one year has been subjected to the
two multivariate techniques, namely, principal component analysis
and cluster analysis. Principal component analysis was applied with
the aim to find out spatial variability and to identify the sources
responsible for the water quality of the river. Three Varifactors were
obtained after varimax rotation of initial principal components using
principal component analysis. Cluster analysis was carried out to
classify sampling stations of certain similarity, which grouped eight
different sites into two clusters. The study reveals that the
anthropogenic influence (municipal, industrial, waste water and
agricultural runoff) was the major source of river water pollution.
Thus, this study illustrates the utility of multivariate statistical
techniques for analysis and elucidation of multifaceted data sets,
recognition of pollution sources/factors and understanding
temporal/spatial variations in water quality for effective river water
quality management.
Abstract: Given the increase in the number of e-commerce sites,
the number of competitors has become very important. This means
that companies have to take appropriate decisions in order to meet the
expectations of their customers and satisfy their needs. In this paper,
we present a case study of applying LRFM (length, recency,
frequency and monetary) model and clustering techniques in the
sector of electronic commerce with a view to evaluating customers’
values of the Moroccan e-commerce websites and then developing
effective marketing strategies. To achieve these objectives, we adopt
LRFM model by applying a two-stage clustering method. In the first
stage, the self-organizing maps method is used to determine the best
number of clusters and the initial centroid. In the second stage, kmeans
method is applied to segment 730 customers into nine clusters
according to their L, R, F and M values. The results show that the
cluster 6 is the most important cluster because the average values of
L, R, F and M are higher than the overall average value. In addition,
this study has considered another variable that describes the mode of
payment used by customers to improve and strengthen clusters’
analysis. The clusters’ analysis demonstrates that the payment method is
one of the key indicators of a new index which allows to assess the
level of customers’ confidence in the company's Website.
Abstract: Polycyclic Aromatic Hydrocarbons (PAHs) are
formed mainly because of incomplete combustion of organic
materials during industrial, domestic activities or natural occurrence.
Their toxicity and contamination of terrestrial and aquatic ecosystem
have been established. However, with limited validity index, previous
research has focused on PAHs isomer pair ratios of variable
physicochemical properties in source identification. The objective of
this investigation was to determine the empirical validity of Pearson
Correlation Coefficient (PCC) and Cluster Analysis (CA) in PAHs
source identification along soil samples of different land uses.
Therefore, 16 PAHs grouped, as Endocrine Disruption Substances
(EDSs) were determined in 10 sample stations in top and sub soils
seasonally. PAHs was determined the use of Varian 300 gas
chromatograph interfaced with flame ionization detector. Instruments
and reagents used are of standard and chromatographic grades
respectively. PCC and CA results showed that the classification of
PAHs along pyrolitic and petrogenic organics used in source
signature is about the predominance PAHs in environmental matrix.
Therefore, the distribution of PAHs in the studied stations revealed
the presence of trace quantities of the vast majority of the sixteen
PAHs, which may ultimately inhabit the actual source signature
authentication. Therefore, factors to be considered when evaluating
possible sources of PAHs could be; type and extent of bacterial
metabolism, transformation products/substrates, and environmental
factors such as salinity, pH, oxygen concentration, nutrients, light
intensity, temperature, co-substrates, and environmental medium are
hereby recommended as factors to be considered when evaluating
possible sources of PAHs.
Abstract: The aim of this investigation is to elaborate nearinfrared
methods for testing and recognition of chemical components
and quality in “Pannon wheat” allied (i.e. true to variety or variety
identified) milling fractions as well as to develop spectroscopic
methods following the milling processes and evaluate the stability of
the milling technology by different types of milling products and
according to sampling times, respectively. These wheat categories
produced under industrial conditions where samples were collected
versus sampling time and maximum or minimum yields. The changes
of the main chemical components (such as starch, protein, lipid) and
physical properties of fractions (particle size) were analysed by
dispersive spectrophotometers using visible (VIS) and near-infrared
(NIR) regions of the electromagnetic radiation. Close correlation
were obtained between the data of spectroscopic measurement
techniques processed by various chemometric methods (e.g. principal
component analysis [PCA], cluster analysis [CA]) and operation
condition of milling technology. It is obvious that NIR methods are
able to detect the deviation of the yield parameters and differences of
the sampling times by a wide variety of fractions, respectively. NIR
technology can be used in the sensitive monitoring of milling
technology.
Abstract: During the post-Civil War era, the city of Nashville,
Tennessee, had the highest mortality rate in the United States. The
elevated death and disease rates among former slaves were
attributable to lack of quality healthcare. To address the paucity of
healthcare services, Meharry Medical College, an institution with the
mission of educating minority professionals and serving the
underserved population, was established in 1876.
Purpose: The social ecological framework and partial least squares
(PLS) path modeling were used to quantify the impact of
socioeconomic status and adverse health outcome on primary care
professionals serving the disadvantaged community. Thus, the study
results could demonstrate the accomplishment of the College’s
mission of training primary care professionals to serve in underserved
areas.
Methods: Various statistical methods were used to analyze alumni
data from 1975 – 2013. K-means cluster analysis was utilized to
identify individual medical and dental graduates in the cluster groups
of the practice communities (Disadvantaged or Non-disadvantaged
Communities). Discriminant analysis was implemented to verify the
classification accuracy of cluster analysis. The independent t-test was
performed to detect the significant mean differences of respective
clustering and criterion variables. Chi-square test was used to test if
the proportions of primary care and non-primary care specialists are
consistent with those of medical and dental graduates practicing in
the designated community clusters. Finally, the PLS path model was
constructed to explore the construct validity of analytic model by
providing the magnitude effects of socioeconomic status and adverse
health outcome on primary care professionals serving the
disadvantaged community.
Results: Approximately 83% (3,192/3,864) of Meharry Medical
College’s medical and dental graduates from 1975 to 2013 were
practicing in disadvantaged communities. Independent t-test confirmed the content validity of the cluster analysis model. Also, the
PLS path modeling demonstrated that alumni served as primary care
professionals in communities with significantly lower socioeconomic
status and higher adverse health outcome (p < .001). The PLS path
modeling exhibited the meaningful interrelation between primary
care professionals practicing communities and surrounding
environments (socioeconomic statues and adverse health outcome),
which yielded model reliability, validity, and applicability.
Conclusion: This study applied social ecological theory and
analytic modeling approaches to assess the attainment of Meharry
Medical College’s mission of training primary care professionals to
serve in underserved areas, particularly in communities with low
socioeconomic status and high rates of adverse health outcomes. In
summary, the majority of medical and dental graduates from Meharry
Medical College provided primary care services to disadvantaged
communities with low socioeconomic status and high adverse health
outcome, which demonstrated that Meharry Medical College has
fulfilled its mission. The high reliability, validity, and applicability of
this model imply that it could be replicated for comparable
universities and colleges elsewhere.
Abstract: We have been grouping and developing various kinds
of practical, promising sensing applied systems concerning
agricultural advancement and technical tradition (guidance). These
include advanced devices to secure real-time data related to worker
motion, and we analyze by methods of various advanced statistics and
human dynamics (e.g. primary component analysis, Ward system
based cluster analysis, and mapping). What is more, we have been
considering worker daily health and safety issues. Targeted fields are
mainly common farms, meadows, and gardens. After then, we
observed and discussed time-line style, changing data. And, we made
some suggestions. The entire plan makes it possible to improve both
the aforementioned applied systems and farms.
Abstract: An extensive amount of work has been done in data
clustering research under the unsupervised learning technique in Data
Mining during the past two decades. Moreover, several approaches
and methods have been emerged focusing on clustering diverse data
types, features of cluster models and similarity rates of clusters.
However, none of the single clustering algorithm exemplifies its best
nature in extracting efficient clusters. Consequently, in order to
rectify this issue, a new challenging technique called Cluster
Ensemble method was bloomed. This new approach tends to be the
alternative method for the cluster analysis problem. The main
objective of the Cluster Ensemble is to aggregate the diverse
clustering solutions in such a way to attain accuracy and also to
improve the eminence the individual clustering algorithms. Due to
the massive and rapid development of new methods in the globe of
data mining, it is highly mandatory to scrutinize a vital analysis of
existing techniques and the future novelty. This paper shows the
comparative analysis of different cluster ensemble methods along
with their methodologies and salient features. Henceforth this
unambiguous analysis will be very useful for the society of clustering
experts and also helps in deciding the most appropriate one to resolve
the problem in hand.