Abstract: In this paper, we used data mining to extract
biomedical knowledge. In general, complex biomedical data
collected in studies of populations are treated by statistical methods,
although they are robust, they are not sufficient in themselves to
harness the potential wealth of data. For that you used in step two
learning algorithms: the Decision Trees and Support Vector Machine
(SVM). These supervised classification methods are used to make the
diagnosis of thyroid disease. In this context, we propose to promote
the study and use of symbolic data mining techniques.
Abstract: In this paper, we study the rainfall using a time series
for weather stations in Nakhon Ratchasima province in Thailand by
various statistical methods to enable us to analyse the behaviour of
rainfall in the study areas. Time-series analysis is an important tool in
modelling and forecasting rainfall. The ARIMA and Holt-Winter
models were built on the basis of exponential smoothing. All the
models proved to be adequate. Therefore it is possible to give
information that can help decision makers establish strategies for the
proper planning of agriculture, drainage systems and other water
resource applications in Nakhon Ratchasima province. We obtained
the best performance from forecasting with the ARIMA
Model(1,0,1)(1,0,1)12.
Abstract: Curcuma longa L. (Zingiberaceae), commonly known
as turmeric, has a long history of traditional uses for culinary
purposes as a spice and a food colorant. The present study aimed to
document the ethnobotanical knowledge about Curcuma longa, and
to assess the variation in the herbalists’ experience in Northeastern
Algeria. Data were collected using semi-structured questionnaires
and direct interviews with 30 herbalists. Ethnobotanical indices,
including the fidelity level (FL%), the relative frequency citation
(RFC), and use value (UV) were determined by quantitative methods.
Diversity in the level of knowledge was analyzed using univariate,
non-parametric, and multivariate statistical methods. Three main
categories of uses were recorded for C. longa: for food, for medicine,
and for cosmetic purposes. As a medicine, turmeric was used for the
treatment of gastrointestinal, dermatological, and hepatic diseases.
Medicinal and food uses were correlated with both forms of
preparation (rhizome and powder). The age group did not influence
the use. Multivariate analyses showed a significant variation in
traditional knowledge, associated with the use value, origin, quality,
and efficacy of the drug. The findings suggested that the geographical
origin of C. longa affected the use in Algeria.
Abstract: Molluca Collision Zone is located at the junction of
the Eurasian, Australian, Pacific and the Philippines plates. Between
the Sangihe arc, west of the collision zone, and to the east of
Halmahera arc is active collision and convex toward the Molluca Sea.
This research will analyze the behavior of earthquake occurrence in
Molluca Collision Zone related to the distributions of an earthquake
in each partition regions, determining the type of distribution of a
occurrence earthquake of partition regions, and the mean occurence
of earthquakes each partition regions, and the correlation between the
partitions region. We calculate number of earthquakes using partition
method and its behavioral using conventional statistical methods. In
this research, we used data of shallow earthquakes type and its
magnitudes ≥4 SR (period 1964-2013). From the results, we can
classify partitioned regions based on the correlation into two classes:
strong and very strong. This classification can be used for early
warning system in disaster management.
Abstract: The purpose of this study is to follow – up the graduated students of Bachelor of Science in Applied Statistics from Suan Sunandha Rajabhat University (SSRU) during the 1999 – 2012 academic years and to provide the fundamental guideline for developing the current curriculum according to Thai Qualifications Framework for Higher Education (TQF: HEd). The sample was collected from 75 graduates by interview and online questionnaire. The content covered 5 subjects were Ethics and Moral, Knowledge, Cognitive Skills, Interpersonal Skill and Responsibility, Numerical Analysis as well as Communication and Information Technology Skills. Data were analyzed by using statistical methods as percentiles, means, standard deviation, t- tests, and F- tests. The findings showed that samples were mostly female had less than 26 years old. The majority of graduates had income in the range of 10,001-20,000 Baht and experience range were 2-5 years. In addition, overall opinions from receiving knowledge to apply to work were at agree; mean score was 3.97 and standard deviation was 0.40. In terms of, the hypothesis testing’s result indicate gender only had different opinion at a significance level of 0.05.
Abstract: The medical data statistical analysis often requires the
using of some special techniques, because of the particularities of
these data. The principal components analysis and the data clustering
are two statistical methods for data mining very useful in the medical
field, the first one as a method to decrease the number of studied
parameters, and the second one as a method to analyze the
connections between diagnosis and the data about the patient-s
condition. In this paper we investigate the implications obtained from
a specific data analysis technique: the data clustering preceded by a
selection of the most relevant parameters, made using the principal
components analysis. Our assumption was that, using the principal
components analysis before data clustering - in order to select and to
classify only the most relevant parameters – the accuracy of
clustering is improved, but the practical results showed the opposite
fact: the clustering accuracy decreases, with a percentage
approximately equal with the percentage of information loss reported
by the principal components analysis.
Abstract: Hospital staff and managers are under pressure and
concerned for effective use and management of scarce resources. The
hospital admissions require many decisions that have complex and
uncertain consequences for hospital resource utilization and patient
flow. It is challenging to predict risk of admissions and length of stay
of a patient due to their vague nature. There is no method to capture
the vague definition of admission of a patient. Also, current methods
and tools used to predict patients at risk of admission fail to deal with
uncertainty in unplanned admission, LOS, patients- characteristics.
The main objective of this paper is to deal with uncertainty in
health system variables, and handles uncertain relationship among
variables. An introduction of machine learning techniques along with
statistical methods like Regression methods can be a proposed
solution approach to handle uncertainty in health system variables. A
model that adapts fuzzy methods to handle uncertain data and
uncertain relationships can be an efficient solution to capture the
vague definition of admission of a patient.
Abstract: Neural networks are well known for their ability to
model non linear functions, but as statistical methods usually does,
they use a no parametric approach thus, a priori knowledge is not
obvious to be taken into account no more than the a posteriori
knowledge. In order to deal with these problematics, an original way
to encode the knowledge inside the architecture is proposed. This
method is applied to the problem of the evapotranspiration inside
karstic aquifer which is a problem of huge utility in order to deal
with water resource.
Abstract: Researches show that probability-statistical methods application, especially at the early stage of the aviation Gas Turbine Engine (GTE) technical condition diagnosing, when the flight information has property of the fuzzy, limitation and uncertainty is unfounded. Hence the efficiency of application of new technology Soft Computing at these diagnosing stages with the using of the Fuzzy Logic and Neural Networks methods is considered. According to the purpose of this problem training with high accuracy of fuzzy multiple linear and non-linear models (fuzzy regression equations) which received on the statistical fuzzy data basis is made. For GTE technical condition more adequate model making dynamics of skewness and kurtosis coefficients- changes are analysed. Researches of skewness and kurtosis coefficients values- changes show that, distributions of GTE work parameters have fuzzy character. Hence consideration of fuzzy skewness and kurtosis coefficients is expedient. Investigation of the basic characteristics changes- dynamics of GTE work parameters allows drawing conclusion on necessity of the Fuzzy Statistical Analysis at preliminary identification of the engines' technical condition. Researches of correlation coefficients values- changes shows also on their fuzzy character. Therefore for models choice the application of the Fuzzy Correlation Analysis results is offered. At the information sufficiency is offered to use recurrent algorithm of aviation GTE technical condition identification (Hard Computing technology is used) on measurements of input and output parameters of the multiple linear and non-linear generalised models at presence of noise measured (the new recursive Least Squares Method (LSM)). The developed GTE condition monitoring system provides stageby- stage estimation of engine technical conditions. As application of the given technique the estimation of the new operating aviation engine technical condition was made.
Abstract: Ice cover County has a significant impact on rivers as it affects with the ice melting capacity which results in flooding, restrict navigation, modify the ecosystem and microclimate. River ices are made up of different ice types with varying ice thickness, so surveillance of river ice plays an important role. River ice types are captured using infrared imaging camera which captures the images even during the night times. In this paper the river ice infrared texture images are analysed using first-order statistical methods and secondorder statistical methods. The second order statistical methods considered are spatial gray level dependence method, gray level run length method and gray level difference method. The performance of the feature extraction methods are evaluated by using Probabilistic Neural Network classifier and it is found that the first-order statistical method and second-order statistical method yields low accuracy. So the features extracted from the first-order statistical method and second-order statistical method are combined and it is observed that the result of these combined features (First order statistical method + gray level run length method) provides higher accuracy when compared with the features from the first-order statistical method and second-order statistical method alone.
Abstract: Forecasting the values of the indicators, which
characterize the effectiveness of performance of organizations is of
great importance for their successful development. Such forecasting
is necessary in order to assess the current state and to foresee future
developments, so that measures to improve the organization-s
activity could be undertaken in time. The article presents an
overview of the applied mathematical and statistical methods for
developing forecasts. Special attention is paid to artificial neural
networks as a forecasting tool. Their strengths and weaknesses are
analyzed and a synopsis is made of the application of artificial neural
networks in the field of forecasting of the values of different
education efficiency indicators. A method of evaluation of the
activity of universities using the Balanced Scorecard is proposed and
Key Performance Indicators for assessment of e-learning are
selected. Resulting indicators for the evaluation of efficiency of the
activity are proposed. An artificial neural network is constructed and
applied in the forecasting of the values of indicators for e-learning
efficiency on the basis of the KPI values.
Abstract: There are several approaches in trying to solve the
Quantitative 1Structure-Activity Relationship (QSAR) problem.
These approaches are based either on statistical methods or on
predictive data mining. Among the statistical methods, one should
consider regression analysis, pattern recognition (such as cluster
analysis, factor analysis and principal components analysis) or partial
least squares. Predictive data mining techniques use either neural
networks, or genetic programming, or neuro-fuzzy knowledge. These
approaches have a low explanatory capability or non at all. This
paper attempts to establish a new approach in solving QSAR
problems using descriptive data mining. This way, the relationship
between the chemical properties and the activity of a substance
would be comprehensibly modeled.
Abstract: Microarrays have become the effective, broadly used tools in biological and medical research to address a wide range of problems, including classification of disease subtypes and tumors. Many statistical methods are available for analyzing and systematizing these complex data into meaningful information, and one of the main goals in analyzing gene expression data is the detection of samples or genes with similar expression patterns. In this paper, we express and compare the performance of several clustering methods based on data preprocessing including strategies of normalization or noise clearness. We also evaluate each of these clustering methods with validation measures for both simulated data and real gene expression data. Consequently, clustering methods which are common used in microarray data analysis are affected by normalization and degree of noise and clearness for datasets.
Abstract: Computer based geostatistical methods can offer effective data analysis possibilities for agricultural areas by using
vectorial data and their objective informations. These methods will help to detect the spatial changes on different locations of the large
agricultural lands, which will lead to effective fertilization for optimal yield with reduced environmental pollution. In this study, topsoil (0-20 cm) and subsoil (20-40 cm) samples were taken from a
sugar beet field by 20 x 20 m grids. Plant samples were also collected
from the same plots. Some physical and chemical analyses for these
samples were made by routine methods. According to derived variation coefficients, topsoil organic matter (OM) distribution was more than subsoil OM distribution. The highest C.V. value of
17.79% was found for topsoil OM. The data were analyzed
comparatively according to kriging methods which are also used
widely in geostatistic. Several interpolation methods (Ordinary,Simple and Universal) and semivariogram models (Spherical,
Exponential and Gaussian) were tested in order to choose the suitable
methods. Average standard deviations of values estimated by simple
kriging interpolation method were less than average standard
deviations (topsoil OM ± 0.48, N ± 0.37, subsoil OM ± 0.18) of measured values. The most suitable interpolation method was simple
kriging method and exponantial semivariogram model for topsoil,
whereas the best optimal interpolation method was simple kriging
method and spherical semivariogram model for subsoil. The results
also showed that these computer based geostatistical methods should
be tested and calibrated for different experimental conditions and semivariogram models.
Abstract: Human genome is not only the evolutionary
summation of all advantageous events, but also houses lesions of
deleterious foot prints. A single gene mutation sometimes may
express multiple consequences in numerous tissues and a linear
relationship of the genotype and the phenotype may often be obscure.
ß Thalassemia minor, a transfusion independent mild anaemia,
coupled with environment among other factors may articulate into
phenotypic pleotropy with Hypocholesterolemia, Vitamin D
deficiency, Tissue hypoxia, Hyper-parathyroidism and Psychological
alterations. Occurrence of Pancreatic insufficiency, resultant
steatorrhoea, Vitamin-D (25-OH) deficiency (13.86 ngm/ml) with
Hypocholesterolemia (85mg/dl) in a 30 years old male ß Thal-minor
patient (Hemoglobin 11mg/dl with Fetal Hemoglobin 2.10%, Hb A2
4.60% and Hb Adult 84.80% and altered Hemogram) with increased
Para thyroid hormone (62 pg/ml) & moderate Serum Ca+2
(9.5mg/ml) indicate towards a cascade of phenotypic pleotropy
where the ß Thalassemia mutation ,be it in the 5’ cap site of the
mRNA , differential splicing etc in heterozygous state is effecting
several metabolic pathways. Compensatory extramedulary
hematopoiesis may not coped up well with the stressful life style of
the young individual and increased erythropoietic stress with high
demand for cholesterol for RBC membrane synthesis may have
resulted in Hypocholesterolemia.Oxidative stress and tissue hypoxia
may have caused the pancreatic insufficiency, leading to Vitamin D
deficiency. This may in turn have caused the secondary
hyperparathyroidism to sustain serum Calcium level. Irritability and
stress intolerance of the patient was a cumulative effect of the vicious
cycle of metabolic compromises. From these findings we propose
that the metabolic deficiencies in the ß Thalassemia mutations may
be considered as the phenotypic display of the pleotropy to explain
the genetic epidemiology.
According to the recommendations from the NIH Workshop on
Gene-Environment Interplay in Common Complex Diseases: Forging
an Integrative Model, study design of observations should be
informed by gene-environment hypotheses and results of a study
(genetic diseases) should be published to inform future hypotheses.
Variety of approaches is needed to capture data on all possible
aspects, each of which is likely to contribute to the etiology of
disease. Speakers also agreed that there is a need for development of
new statistical methods and measurement tools to appraise
information that may be missed out by conventional method where
large sample size is needed to segregate considerable effect.
A meta analytic cohort study in future may bring about significant
insight on to the title comment.
Abstract: It has become crucial over the years for nations to
improve their credit scoring methods and techniques in light of the
increasing volatility of the global economy. Statistical methods or
tools have been the favoured means for this; however artificial
intelligence or soft computing based techniques are becoming
increasingly preferred due to their proficient and precise nature and
relative simplicity. This work presents a comparison between Support
Vector Machines and Artificial Neural Networks two popular soft
computing models when applied to credit scoring. Amidst the
different criteria-s that can be used for comparisons; accuracy,
computational complexity and processing times are the selected
criteria used to evaluate both models. Furthermore the German credit
scoring dataset which is a real world dataset is used to train and test
both developed models. Experimental results obtained from our study
suggest that although both soft computing models could be used with
a high degree of accuracy, Artificial Neural Networks deliver better
results than Support Vector Machines.
Abstract: Various methods of geofield parameters restoration (by algebraic polynoms; filters; rational fractions; interpolation splines; geostatistical methods – kriging; search methods of nearest points – inverse distance, minimum curvature, local – polynomial interpolation; neural networks) have been analyzed and some possible mistakes arising during geofield surface modeling have been presented.
Abstract: Fault-proneness of a software module is the
probability that the module contains faults. To predict faultproneness
of modules different techniques have been proposed which
includes statistical methods, machine learning techniques, neural
network techniques and clustering techniques. The aim of proposed
study is to explore whether metrics available in the early lifecycle
(i.e. requirement metrics), metrics available in the late lifecycle (i.e.
code metrics) and metrics available in the early lifecycle (i.e.
requirement metrics) combined with metrics available in the late
lifecycle (i.e. code metrics) can be used to identify fault prone
modules using Genetic Algorithm technique. This approach has been
tested with real time defect C Programming language datasets of
NASA software projects. The results show that the fusion of
requirement and code metric is the best prediction model for
detecting the faults as compared with commonly used code based
model.
Abstract: Increasing growth of information volume in the
internet causes an increasing need to develop new (semi)automatic
methods for retrieval of documents and ranking them according to
their relevance to the user query. In this paper, after a brief review
on ranking models, a new ontology based approach for ranking
HTML documents is proposed and evaluated in various
circumstances. Our approach is a combination of conceptual,
statistical and linguistic methods. This combination reserves the
precision of ranking without loosing the speed. Our approach
exploits natural language processing techniques for extracting
phrases and stemming words. Then an ontology based conceptual
method will be used to annotate documents and expand the query.
To expand a query the spread activation algorithm is improved so
that the expansion can be done in various aspects. The annotated
documents and the expanded query will be processed to compute
the relevance degree exploiting statistical methods. The outstanding
features of our approach are (1) combining conceptual, statistical
and linguistic features of documents, (2) expanding the query with
its related concepts before comparing to documents, (3) extracting
and using both words and phrases to compute relevance degree, (4)
improving the spread activation algorithm to do the expansion based
on weighted combination of different conceptual relationships and
(5) allowing variable document vector dimensions. A ranking
system called ORank is developed to implement and test the
proposed model. The test results will be included at the end of the
paper.
Abstract: Water pipe network is installed underground and once equipped, it is difficult to recognize the state of pipes when the leak or burst happens. Accordingly, post management is often delayed
after the fault occurs. Therefore, the systematic fault management system of water pipe network is required to prevent the accident and
minimize the loss. In this work, we develop online fault detection system of water pipe network using data of pipes such as flow rate
or pressure. The transient model describing water flow in pipelines
is presented and simulated using MATLAB. The fault situations such
as the leak or burst can be also simulated and flow rate or pressure data when the fault happens are collected. Faults are detected using
statistical methods of fast Fourier transform and discrete wavelet transform, and they are compared to find which method shows the
better fault detection performance.