Data Mining in Medicine Domain Using Decision Trees and Vector Support Machine

In this paper, we used data mining to extract biomedical knowledge. In general, complex biomedical data collected in studies of populations are treated by statistical methods, although they are robust, they are not sufficient in themselves to harness the potential wealth of data. For that you used in step two learning algorithms: the Decision Trees and Support Vector Machine (SVM). These supervised classification methods are used to make the diagnosis of thyroid disease. In this context, we propose to promote the study and use of symbolic data mining techniques.

Forecasting Rainfall in Thailand: A Case Study of Nakhon Ratchasima Province

In this paper, we study the rainfall using a time series for weather stations in Nakhon Ratchasima province in Thailand by various statistical methods to enable us to analyse the behaviour of rainfall in the study areas. Time-series analysis is an important tool in modelling and forecasting rainfall. The ARIMA and Holt-Winter models were built on the basis of exponential smoothing. All the models proved to be adequate. Therefore it is possible to give information that can help decision makers establish strategies for the proper planning of agriculture, drainage systems and other water resource applications in Nakhon Ratchasima province. We obtained the best performance from forecasting with the ARIMA Model(1,0,1)(1,0,1)12.

Variation in the Traditional Knowledge of Curcuma longa L. in North-Eastern Algeria

Curcuma longa L. (Zingiberaceae), commonly known as turmeric, has a long history of traditional uses for culinary purposes as a spice and a food colorant. The present study aimed to document the ethnobotanical knowledge about Curcuma longa, and to assess the variation in the herbalists’ experience in Northeastern Algeria. Data were collected using semi-structured questionnaires and direct interviews with 30 herbalists. Ethnobotanical indices, including the fidelity level (FL%), the relative frequency citation (RFC), and use value (UV) were determined by quantitative methods. Diversity in the level of knowledge was analyzed using univariate, non-parametric, and multivariate statistical methods. Three main categories of uses were recorded for C. longa: for food, for medicine, and for cosmetic purposes. As a medicine, turmeric was used for the treatment of gastrointestinal, dermatological, and hepatic diseases. Medicinal and food uses were correlated with both forms of preparation (rhizome and powder). The age group did not influence the use. Multivariate analyses showed a significant variation in traditional knowledge, associated with the use value, origin, quality, and efficacy of the drug. The findings suggested that the geographical origin of C. longa affected the use in Algeria.

Earthquake Classification in Molluca Collision Zone Using Conventional Statistical Methods

Molluca Collision Zone is located at the junction of the Eurasian, Australian, Pacific and the Philippines plates. Between the Sangihe arc, west of the collision zone, and to the east of Halmahera arc is active collision and convex toward the Molluca Sea. This research will analyze the behavior of earthquake occurrence in Molluca Collision Zone related to the distributions of an earthquake in each partition regions, determining the type of distribution of a occurrence earthquake of partition regions, and the mean occurence of earthquakes each partition regions, and the correlation between the partitions region. We calculate number of earthquakes using partition method and its behavioral using conventional statistical methods. In this research, we used data of shallow earthquakes type and its magnitudes ≥4 SR (period 1964-2013). From the results, we can classify partitioned regions based on the correlation into two classes: strong and very strong. This classification can be used for early warning system in disaster management.

A Follow–Up Study of Bachelor of Science Graduates in Applied Statistics from Suan Sunandha Rajabhat University during the 1999-2012 Academic Years

The purpose of this study is to follow – up the graduated students of Bachelor of Science in Applied Statistics from Suan Sunandha Rajabhat University (SSRU) during the 1999 – 2012 academic years and to provide the fundamental guideline for developing the current curriculum according to Thai Qualifications Framework for Higher Education (TQF: HEd). The sample was collected from 75 graduates by interview and online questionnaire. The content covered 5 subjects were Ethics and Moral, Knowledge, Cognitive Skills, Interpersonal Skill and Responsibility, Numerical Analysis as well as Communication and Information Technology Skills. Data were analyzed by using statistical methods as percentiles, means, standard deviation, t- tests, and F- tests. The findings showed that samples were mostly female had less than 26 years old. The majority of graduates had income in the range of 10,001-20,000 Baht and experience range were 2-5 years. In addition, overall opinions from receiving knowledge to apply to work were at agree; mean score was 3.97 and standard deviation was 0.40. In terms of, the hypothesis testing’s result indicate gender only had different opinion at a significance level of 0.05.

Observations about the Principal Components Analysis and Data Clustering Techniques in the Study of Medical Data

The medical data statistical analysis often requires the using of some special techniques, because of the particularities of these data. The principal components analysis and the data clustering are two statistical methods for data mining very useful in the medical field, the first one as a method to decrease the number of studied parameters, and the second one as a method to analyze the connections between diagnosis and the data about the patient-s condition. In this paper we investigate the implications obtained from a specific data analysis technique: the data clustering preceded by a selection of the most relevant parameters, made using the principal components analysis. Our assumption was that, using the principal components analysis before data clustering - in order to select and to classify only the most relevant parameters – the accuracy of clustering is improved, but the practical results showed the opposite fact: the clustering accuracy decreases, with a percentage approximately equal with the percentage of information loss reported by the principal components analysis.

A Novel Approach to Handle Uncertainty in Health System Variables for Hospital Admissions

Hospital staff and managers are under pressure and concerned for effective use and management of scarce resources. The hospital admissions require many decisions that have complex and uncertain consequences for hospital resource utilization and patient flow. It is challenging to predict risk of admissions and length of stay of a patient due to their vague nature. There is no method to capture the vague definition of admission of a patient. Also, current methods and tools used to predict patients at risk of admission fail to deal with uncertainty in unplanned admission, LOS, patients- characteristics. The main objective of this paper is to deal with uncertainty in health system variables, and handles uncertain relationship among variables. An introduction of machine learning techniques along with statistical methods like Regression methods can be a proposed solution approach to handle uncertainty in health system variables. A model that adapts fuzzy methods to handle uncertain data and uncertain relationships can be an efficient solution to capture the vague definition of admission of a patient.

Neural Networks: From Black Box towards Transparent Box Application to Evapotranspiration Modeling

Neural networks are well known for their ability to model non linear functions, but as statistical methods usually does, they use a no parametric approach thus, a priori knowledge is not obvious to be taken into account no more than the a posteriori knowledge. In order to deal with these problematics, an original way to encode the knowledge inside the architecture is proposed. This method is applied to the problem of the evapotranspiration inside karstic aquifer which is a problem of huge utility in order to deal with water resource.

Multistage Condition Monitoring System of Aircraft Gas Turbine Engine

Researches show that probability-statistical methods application, especially at the early stage of the aviation Gas Turbine Engine (GTE) technical condition diagnosing, when the flight information has property of the fuzzy, limitation and uncertainty is unfounded. Hence the efficiency of application of new technology Soft Computing at these diagnosing stages with the using of the Fuzzy Logic and Neural Networks methods is considered. According to the purpose of this problem training with high accuracy of fuzzy multiple linear and non-linear models (fuzzy regression equations) which received on the statistical fuzzy data basis is made. For GTE technical condition more adequate model making dynamics of skewness and kurtosis coefficients- changes are analysed. Researches of skewness and kurtosis coefficients values- changes show that, distributions of GTE work parameters have fuzzy character. Hence consideration of fuzzy skewness and kurtosis coefficients is expedient. Investigation of the basic characteristics changes- dynamics of GTE work parameters allows drawing conclusion on necessity of the Fuzzy Statistical Analysis at preliminary identification of the engines' technical condition. Researches of correlation coefficients values- changes shows also on their fuzzy character. Therefore for models choice the application of the Fuzzy Correlation Analysis results is offered. At the information sufficiency is offered to use recurrent algorithm of aviation GTE technical condition identification (Hard Computing technology is used) on measurements of input and output parameters of the multiple linear and non-linear generalised models at presence of noise measured (the new recursive Least Squares Method (LSM)). The developed GTE condition monitoring system provides stageby- stage estimation of engine technical conditions. As application of the given technique the estimation of the new operating aviation engine technical condition was made.

Texture Feature Extraction of Infrared River Ice Images using Second-Order Spatial Statistics

Ice cover County has a significant impact on rivers as it affects with the ice melting capacity which results in flooding, restrict navigation, modify the ecosystem and microclimate. River ices are made up of different ice types with varying ice thickness, so surveillance of river ice plays an important role. River ice types are captured using infrared imaging camera which captures the images even during the night times. In this paper the river ice infrared texture images are analysed using first-order statistical methods and secondorder statistical methods. The second order statistical methods considered are spatial gray level dependence method, gray level run length method and gray level difference method. The performance of the feature extraction methods are evaluated by using Probabilistic Neural Network classifier and it is found that the first-order statistical method and second-order statistical method yields low accuracy. So the features extracted from the first-order statistical method and second-order statistical method are combined and it is observed that the result of these combined features (First order statistical method + gray level run length method) provides higher accuracy when compared with the features from the first-order statistical method and second-order statistical method alone.

Forecasting e-Learning Efficiency by Using Artificial Neural Networks and a Balanced Score Card

Forecasting the values of the indicators, which characterize the effectiveness of performance of organizations is of great importance for their successful development. Such forecasting is necessary in order to assess the current state and to foresee future developments, so that measures to improve the organization-s activity could be undertaken in time. The article presents an overview of the applied mathematical and statistical methods for developing forecasts. Special attention is paid to artificial neural networks as a forecasting tool. Their strengths and weaknesses are analyzed and a synopsis is made of the application of artificial neural networks in the field of forecasting of the values of different education efficiency indicators. A method of evaluation of the activity of universities using the Balanced Scorecard is proposed and Key Performance Indicators for assessment of e-learning are selected. Resulting indicators for the evaluation of efficiency of the activity are proposed. An artificial neural network is constructed and applied in the forecasting of the values of indicators for e-learning efficiency on the basis of the KPI values.

Model Discovery and Validation for the Qsar Problem using Association Rule Mining

There are several approaches in trying to solve the Quantitative 1Structure-Activity Relationship (QSAR) problem. These approaches are based either on statistical methods or on predictive data mining. Among the statistical methods, one should consider regression analysis, pattern recognition (such as cluster analysis, factor analysis and principal components analysis) or partial least squares. Predictive data mining techniques use either neural networks, or genetic programming, or neuro-fuzzy knowledge. These approaches have a low explanatory capability or non at all. This paper attempts to establish a new approach in solving QSAR problems using descriptive data mining. This way, the relationship between the chemical properties and the activity of a substance would be comprehensibly modeled.

Evaluation of Clustering Based on Preprocessing in Gene Expression Data

Microarrays have become the effective, broadly used tools in biological and medical research to address a wide range of problems, including classification of disease subtypes and tumors. Many statistical methods are available for analyzing and systematizing these complex data into meaningful information, and one of the main goals in analyzing gene expression data is the detection of samples or genes with similar expression patterns. In this paper, we express and compare the performance of several clustering methods based on data preprocessing including strategies of normalization or noise clearness. We also evaluate each of these clustering methods with validation measures for both simulated data and real gene expression data. Consequently, clustering methods which are common used in microarray data analysis are affected by normalization and degree of noise and clearness for datasets.

Simulation of Organic Matter Variability on a Sugarbeet Field Using the Computer Based Geostatistical Methods

Computer based geostatistical methods can offer effective data analysis possibilities for agricultural areas by using vectorial data and their objective informations. These methods will help to detect the spatial changes on different locations of the large agricultural lands, which will lead to effective fertilization for optimal yield with reduced environmental pollution. In this study, topsoil (0-20 cm) and subsoil (20-40 cm) samples were taken from a sugar beet field by 20 x 20 m grids. Plant samples were also collected from the same plots. Some physical and chemical analyses for these samples were made by routine methods. According to derived variation coefficients, topsoil organic matter (OM) distribution was more than subsoil OM distribution. The highest C.V. value of 17.79% was found for topsoil OM. The data were analyzed comparatively according to kriging methods which are also used widely in geostatistic. Several interpolation methods (Ordinary,Simple and Universal) and semivariogram models (Spherical, Exponential and Gaussian) were tested in order to choose the suitable methods. Average standard deviations of values estimated by simple kriging interpolation method were less than average standard deviations (topsoil OM ± 0.48, N ± 0.37, subsoil OM ± 0.18) of measured values. The most suitable interpolation method was simple kriging method and exponantial semivariogram model for topsoil, whereas the best optimal interpolation method was simple kriging method and spherical semivariogram model for subsoil. The results also showed that these computer based geostatistical methods should be tested and calibrated for different experimental conditions and semivariogram models.

Investigation of Genetic Epidemiology of Metabolic Compromises in ß Thalassemia Minor Mutation: Phenotypic Pleiotropy

Human genome is not only the evolutionary summation of all advantageous events, but also houses lesions of deleterious foot prints. A single gene mutation sometimes may express multiple consequences in numerous tissues and a linear relationship of the genotype and the phenotype may often be obscure. ß Thalassemia minor, a transfusion independent mild anaemia, coupled with environment among other factors may articulate into phenotypic pleotropy with Hypocholesterolemia, Vitamin D deficiency, Tissue hypoxia, Hyper-parathyroidism and Psychological alterations. Occurrence of Pancreatic insufficiency, resultant steatorrhoea, Vitamin-D (25-OH) deficiency (13.86 ngm/ml) with Hypocholesterolemia (85mg/dl) in a 30 years old male ß Thal-minor patient (Hemoglobin 11mg/dl with Fetal Hemoglobin 2.10%, Hb A2 4.60% and Hb Adult 84.80% and altered Hemogram) with increased Para thyroid hormone (62 pg/ml) & moderate Serum Ca+2 (9.5mg/ml) indicate towards a cascade of phenotypic pleotropy where the ß Thalassemia mutation ,be it in the 5’ cap site of the mRNA , differential splicing etc in heterozygous state is effecting several metabolic pathways. Compensatory extramedulary hematopoiesis may not coped up well with the stressful life style of the young individual and increased erythropoietic stress with high demand for cholesterol for RBC membrane synthesis may have resulted in Hypocholesterolemia.Oxidative stress and tissue hypoxia may have caused the pancreatic insufficiency, leading to Vitamin D deficiency. This may in turn have caused the secondary hyperparathyroidism to sustain serum Calcium level. Irritability and stress intolerance of the patient was a cumulative effect of the vicious cycle of metabolic compromises. From these findings we propose that the metabolic deficiencies in the ß Thalassemia mutations may be considered as the phenotypic display of the pleotropy to explain the genetic epidemiology. According to the recommendations from the NIH Workshop on Gene-Environment Interplay in Common Complex Diseases: Forging an Integrative Model, study design of observations should be informed by gene-environment hypotheses and results of a study (genetic diseases) should be published to inform future hypotheses. Variety of approaches is needed to capture data on all possible aspects, each of which is likely to contribute to the etiology of disease. Speakers also agreed that there is a need for development of new statistical methods and measurement tools to appraise information that may be missed out by conventional method where large sample size is needed to segregate considerable effect. A meta analytic cohort study in future may bring about significant insight on to the title comment.

A Comparison of Different Soft Computing Models for Credit Scoring

It has become crucial over the years for nations to improve their credit scoring methods and techniques in light of the increasing volatility of the global economy. Statistical methods or tools have been the favoured means for this; however artificial intelligence or soft computing based techniques are becoming increasingly preferred due to their proficient and precise nature and relative simplicity. This work presents a comparison between Support Vector Machines and Artificial Neural Networks two popular soft computing models when applied to credit scoring. Amidst the different criteria-s that can be used for comparisons; accuracy, computational complexity and processing times are the selected criteria used to evaluate both models. Furthermore the German credit scoring dataset which is a real world dataset is used to train and test both developed models. Experimental results obtained from our study suggest that although both soft computing models could be used with a high degree of accuracy, Artificial Neural Networks deliver better results than Support Vector Machines.

Interpolation of Geofield Parameters

Various methods of geofield parameters restoration (by algebraic polynoms; filters; rational fractions; interpolation splines; geostatistical methods – kriging; search methods of nearest points – inverse distance, minimum curvature, local – polynomial interpolation; neural networks) have been analyzed and some possible mistakes arising during geofield surface modeling have been presented.

A Study on Early Prediction of Fault Proneness in Software Modules using Genetic Algorithm

Fault-proneness of a software module is the probability that the module contains faults. To predict faultproneness of modules different techniques have been proposed which includes statistical methods, machine learning techniques, neural network techniques and clustering techniques. The aim of proposed study is to explore whether metrics available in the early lifecycle (i.e. requirement metrics), metrics available in the late lifecycle (i.e. code metrics) and metrics available in the early lifecycle (i.e. requirement metrics) combined with metrics available in the late lifecycle (i.e. code metrics) can be used to identify fault prone modules using Genetic Algorithm technique. This approach has been tested with real time defect C Programming language datasets of NASA software projects. The results show that the fusion of requirement and code metric is the best prediction model for detecting the faults as compared with commonly used code based model.

ORank: An Ontology Based System for Ranking Documents

Increasing growth of information volume in the internet causes an increasing need to develop new (semi)automatic methods for retrieval of documents and ranking them according to their relevance to the user query. In this paper, after a brief review on ranking models, a new ontology based approach for ranking HTML documents is proposed and evaluated in various circumstances. Our approach is a combination of conceptual, statistical and linguistic methods. This combination reserves the precision of ranking without loosing the speed. Our approach exploits natural language processing techniques for extracting phrases and stemming words. Then an ontology based conceptual method will be used to annotate documents and expand the query. To expand a query the spread activation algorithm is improved so that the expansion can be done in various aspects. The annotated documents and the expanded query will be processed to compute the relevance degree exploiting statistical methods. The outstanding features of our approach are (1) combining conceptual, statistical and linguistic features of documents, (2) expanding the query with its related concepts before comparing to documents, (3) extracting and using both words and phrases to compute relevance degree, (4) improving the spread activation algorithm to do the expansion based on weighted combination of different conceptual relationships and (5) allowing variable document vector dimensions. A ranking system called ORank is developed to implement and test the proposed model. The test results will be included at the end of the paper.

Fault Detection of Pipeline in Water Distribution Network System

Water pipe network is installed underground and once equipped, it is difficult to recognize the state of pipes when the leak or burst happens. Accordingly, post management is often delayed after the fault occurs. Therefore, the systematic fault management system of water pipe network is required to prevent the accident and minimize the loss. In this work, we develop online fault detection system of water pipe network using data of pipes such as flow rate or pressure. The transient model describing water flow in pipelines is presented and simulated using MATLAB. The fault situations such as the leak or burst can be also simulated and flow rate or pressure data when the fault happens are collected. Faults are detected using statistical methods of fast Fourier transform and discrete wavelet transform, and they are compared to find which method shows the better fault detection performance.