Abstract: Governments are playing an increasingly active role in reducing carbon emissions, and a key strategy has been the introduction of mandatory energy disclosure policies. These policies have resulted in a significant amount of publicly available data, providing researchers with a unique opportunity to develop location-specific energy and carbon emission benchmarks from this data set, which can then be used to develop building archetypes and used to inform urban energy models. This study presents the development of such a benchmark using the public reporting data. The data from Ontario’s Ministry of Energy for Post-Secondary Educational Institutions are being used to develop a series of building archetype dynamic building loads and energy benchmarks to fill a gap in the currently available building database. This paper presents the development of a benchmark for college and university residences within ASHRAE climate zone 6 areas in Ontario using the mandatory disclosure energy and greenhouse gas emissions data. The methodology presented includes data cleaning, statistical analysis, and benchmark development, and lessons learned from this investigation are presented and discussed to inform the development of future energy benchmarks from this larger data set. The key findings from this initial benchmarking study are: (1) the importance of careful data screening and outlier identification to develop a valid dataset; (2) the key features used to develop a model of the data are building age, size, and occupancy schedules and these can be used to estimate energy consumption; and (3) policy changes affecting the primary energy generation significantly affected greenhouse gas emissions, and consideration of these factors was critical to evaluate the validity of the reported data.
Abstract: Part of Speech Tagging has always been a challenging task in the era of Natural Language Processing. This article presents POS tagging for Nepali text using Hidden Markov Model and Viterbi algorithm. From the Nepali text, annotated corpus training and testing data set are randomly separated. Both methods are employed on the data sets. Viterbi algorithm is found to be computationally faster and accurate as compared to HMM. The accuracy of 95.43% is achieved using Viterbi algorithm. Error analysis where the mismatches took place is elaborately discussed.
Abstract: The traditional k-means algorithm has been widely used as a simple and efficient clustering method. However, the algorithm often converges to local minima for the reason that it is sensitive to the initial cluster centers. In this paper, an algorithm for selecting initial cluster centers on the basis of minimum spanning tree (MST) is presented. The set of vertices in MST with same degree are regarded as a whole which is used to find the skeleton data points. Furthermore, a distance measure between the skeleton data points with consideration of degree and Euclidean distance is presented. Finally, MST-based initialization method for the k-means algorithm is presented, and the corresponding time complexity is analyzed as well. The presented algorithm is tested on five data sets from the UCI Machine Learning Repository. The experimental results illustrate the effectiveness of the presented algorithm compared to three existing initialization methods.
Abstract: 3D models obtained with advanced survey techniques such as close-range photogrammetry and laser scanner are nowadays particularly appreciated in Cultural Heritage and Archaeology fields. In order to produce high quality models representing archaeological evidences and anthropological artifacts, the appearance of the model (i.e. color) beyond the geometric accuracy, is not a negligible aspect. The integration of the close-range photogrammetry survey techniques with the laser scanner is still a topic of study and research. By combining point cloud data sets of the same object generated with both technologies, or with the same technology but registered in different moment and/or natural light condition, could construct a final point cloud with accentuated color dissimilarities. In this paper, a methodology to uniform the different data sets, to improve the chromatic quality and to highlight further details by balancing the point color will be presented.
Abstract: The primary approach for estimating bridge deterioration uses Markov-chain models and regression analysis. Traditional Markov models have problems in estimating the required transition probabilities when a small sample size is used. Often, reliable bridge data have not been taken over large periods, thus large data sets may not be available. This study presents an important change to the traditional approach by using the Small Data Method to estimate transition probabilities. The results illustrate that the Small Data Method and traditional approach both provide similar estimates; however, the former method provides results that are more conservative. That is, Small Data Method provided slightly lower than expected bridge condition ratings compared with the traditional approach. Considering that bridges are critical infrastructures, the Small Data Method, which uses more information and provides more conservative estimates, may be more appropriate when the available sample size is small. In addition, regression analysis was used to calculate bridge deterioration. Condition ratings were determined for bridge groups, and the best regression model was selected for each group. The results obtained were very similar to those obtained when using Markov chains; however, it is desirable to use more data for better results.
Abstract: Within geostatistics research, effective estimation of
the variogram points has been examined, particularly in developing
robust alternatives. The parametric fit of these variogram points which
eventually defines the kriging weights, however, has not received
the same attention from a robust perspective. This paper proposes
the use of the non-linear Wilcoxon norm over weighted non-linear
least squares as a robust variogram fitting alternative. First, we
introduce the concept of variogram estimation and fitting. Then, as
an alternative to non-linear weighted least squares, we discuss the
non-linear Wilcoxon estimator. Next, the robustness properties of the
non-linear Wilcoxon are demonstrated using a contaminated spatial
data set. Finally, under simulated conditions, increasing levels of
contaminated spatial processes have their variograms points estimated
and fit. In the fitting of these variogram points, both non-linear
Weighted Least Squares and non-linear Wilcoxon fits are examined
for efficiency. At all levels of contamination (including 0%), using
a robust estimation and robust fitting procedure, the non-weighted
Wilcoxon outperforms weighted Least Squares.
Abstract: A significant percentage of production costs is the maintenance costs, and analysis of maintenance costs could to achieve greater productivity and competitiveness. With this is mind, the maintenance of machines and installations is considered as an essential part of organizational functions and applying effective strategies causes significant added value in manufacturing activities. Organizations are trying to achieve performance levels on a global scale with emphasis on creating competitive advantage by different methods consist of RCM (Reliability-Center-Maintenance), TPM (Total Productivity Maintenance) etc. In this study, increasing the capacity of Concentration Plant of Golgohar Iron Ore Mining & Industrial Company (GEG) was examined by using of reliability and maintainability analyses. The results of this research showed that instead of increasing the number of machines (in order to solve the bottleneck problems), the improving of reliability and maintainability would solve bottleneck problems in the best way. It should be mention that in the abovementioned study, the data set of Concentration Plant of GEG as a case study, was applied and analyzed.
Abstract: The rapid development of the Confucius Institute is attracting more and more attention from mainstream media around the world. Mainstream media plays a large role in public information dissemination and public opinion. This study presents efforts to analyze the correlation and functional relationship between domestic and foreign mainstream media by analyzing the amount of reports on the Confucius Institute. Three kinds of correlation calculation methods, the Pearson correlation coefficient (PCC), the Spearman correlation coefficient (SCC), and the Kendall rank correlation coefficient (KCC), were applied to analyze the correlations among mainstream media from three regions: mainland of China; Hong Kong and Macao (the two special administration regions of China denoted as SARs); and overseas countries excluding China, such as the United States, England, and Canada. Further, the paper measures the functional relationships among the regions using a regression model. The experimental analyses found high correlations among mainstream media from the different regions. Additionally, we found that there is a linear relationship between the mainstream media of overseas countries and those of the SARs by analyzing the amount of reports on the Confucius Institute based on a data set obtained by crawling the websites of 106 mainstream media during the years 2004 to 2014.
Abstract: Entrepreneurship is mostly related to the beginning of organization. In growing business organizations, entrepreneurship expands its conceptualization. It reveals itself through new business creation in the active organization, through renewal, change, innovation, creation and development of current organization, through breaking and changing of established rules inside or outside the organization and becomes more flexible, adaptive and competitive, also improving effectiveness of organization activity. Therefore, the topic of entrepreneurship, relates the creation of firms to personal / individual characteristics of the entrepreneurs and their social context. This paper is an empirical study, which aims to address these two gaps in the literature. For this endeavor, we use the latest available data from the Global Entrepreneurship Monitor (GEM) project. This data set is widely regarded as a unique source of information about entrepreneurial activity, as well as the aspirations and attitudes of individuals across a wide number of countries and territories worldwide. This paper tries to contribute to fill this gap, by exploring the key drivers of innovative entrepreneurship in the tourism sector. Our findings are consistent with the existing literature in terms of the individual characteristics of entrepreneurs, but quite surprisingly we find an inverted U-shape relation between human development and innovative entrepreneurship in tourism sector. It has been revealed that tourism entrepreneurs are less likely to have innovative products, compared with entrepreneurs in medium developed countries.
Abstract: Speaker recognition is performed in high Additive White Gaussian Noise (AWGN) environments using principals of Computational Auditory Scene Analysis (CASA). CASA methods often classify sounds from images in the time-frequency (T-F) plane using spectrograms or cochleargrams as the image. In this paper atomic decomposition implemented by matching pursuit performs a transform from time series speech signals to the T-F plane. The atomic decomposition creates a sparsely populated T-F vector in “weight space” where each populated T-F position contains an amplitude weight. The weight space vector along with the atomic dictionary represents a denoised, compressed version of the original signal. The arraignment or of the atomic indices in the T-F vector are used for classification. Unsupervised feature learning implemented by a sparse autoencoder learns a single dictionary of basis features from a collection of envelope samples from all speakers. The approach is demonstrated using pairs of speakers from the TIMIT data set. Pairs of speakers are selected randomly from a single district. Each speak has 10 sentences. Two are used for training and 8 for testing. Atomic index probabilities are created for each training sentence and also for each test sentence. Classification is performed by finding the lowest Euclidean distance between then probabilities from the training sentences and the test sentences. Training is done at a 30dB Signal-to-Noise Ratio (SNR). Testing is performed at SNR’s of 0 dB, 5 dB, 10 dB and 30dB. The algorithm has a baseline classification accuracy of ~93% averaged over 10 pairs of speakers from the TIMIT data set. The baseline accuracy is attributable to short sequences of training and test data as well as the overall simplicity of the classification algorithm. The accuracy is not affected by AWGN and produces ~93% accuracy at 0dB SNR.
Abstract: Vector-borne diseases are transmitted to humans by mosquitos, sandflies, bugs, ticks, and other vectors. Some are re-transmitted between vectors, if the infected human has a new contact when his levels of infection are high. The vector is infected for lifetime and can transmit infectious diseases not only between humans but also from animals to humans. Some vector borne diseases are very disabling and globally account for more than one million deaths worldwide. The mosquitoes from the complex Culex pipiens sl. are the most abundant in Portugal, and we dispose in this moment of a data set from the surveillance program that has been carried on since 2006 across the country. All mosquitos’ species are included, but the large coverage of Culex pipiens sl. and its importance for public health make this vector an interesting candidate to assess risk of disease amplification. This work focus on ports and airports identified as key areas of high density of vectors. Mosquitoes being ectothermic organisms, the main factor for vector survival and pathogen development is temperature. Minima and maxima local air temperatures for each area of interest are averaged by month from data gathered on a daily basis at the national network of meteorological stations, and interpolated in a geographic information system (GIS). The range of temperatures ideal for several pathogens are known and this work shows how to use it with the meteorological data in each port and airport facility, to focus an efficient implementation of countermeasures and reduce simultaneously risk transmission and mitigation costs. The results show an increased alert with decreasing latitude, which corresponds to higher minimum and maximum temperatures and a lower amplitude range of the daily temperature.
Abstract: The development of change prediction models can help the software practitioners in planning testing and inspection resources at early phases of software development. However, a major challenge faced during the training process of any classification model is the imbalanced nature of the software quality data. A data with very few minority outcome categories leads to inefficient learning process and a classification model developed from the imbalanced data generally does not predict these minority categories correctly. Thus, for a given dataset, a minority of classes may be change prone whereas a majority of classes may be non-change prone. This study explores various alternatives for adeptly handling the imbalanced software quality data using different sampling methods and effective MetaCost learners. The study also analyzes and justifies the use of different performance metrics while dealing with the imbalanced data. In order to empirically validate different alternatives, the study uses change data from three application packages of open-source Android data set and evaluates the performance of six different machine learning techniques. The results of the study indicate extensive improvement in the performance of the classification models when using resampling method and robust performance measures.
Abstract: The objectives of this research were to study the level
of influencing factors, leadership, supply chain management,
innovation, competitive advantages, business success, and affecting
factors to the business success of the mobile phone system service
providers in Bangkok Metropolitan. This research was done by the
quantitative approach and the qualitative approach. The quantitative
approach was used for questionnaires to collect data from the 331
mobile service shop managers franchised by AIS, Dtac and
TrueMove. The mobile phone system service providers/shop
managers were randomly stratified and proportionally allocated into
subgroups exclusive to the number of the providers in each network.
In terms of qualitative method, there were in-depth interviews of 6
mobile service providers/managers of Telewiz and Dtac and
TrueMove shop to find the agreement or disagreement with the
content analysis method. Descriptive Statistics, including Frequency,
Percentage, Means and Standard Deviation were employed; also, the
Structural Equation Model (SEM) was used as a tool for data
analysis. The content analysis method was applied to identify key
patterns emerging from the interview responses. The two data sets
were brought together for comparing and contrasting to make the
findings, providing triangulation to enrich result interpretation. It
revealed that the level of the influencing factors – leadership,
innovation management, supply chain management, and business
competitiveness had an impact at a great level, but that the level of
factors, innovation and the business, financial success and nonbusiness
financial success of the mobile phone system service
providers in Bangkok Metropolitan, is at the highest level. Moreover,
the business influencing factors, competitive advantages in the
business of mobile system service providers which were leadership,
supply chain management, innovation management, business
advantages, and business success, had statistical significance at .01
which corresponded to the data from the interviews.
Abstract: Electrical impedance tomography is a non-invasive medical imaging technique suitable for medical applications. This paper describes an electrical impedance tomography device with the ability to navigate a robotic arm to manipulate a target object. The design of the device includes various hardware and software sections to perform medical imaging and control the robotic arm. In its hardware section an image is formed by 16 electrodes which are located around a container. This image is used to navigate a 3DOF robotic arm to reach the exact location of the target object. The data set to form the impedance imaging is obtained by having repeated current injections and voltage measurements between all electrode pairs. After performing the necessary calculations to obtain the impedance, information is transmitted to the computer. This data is fed and then executed in MATLAB which is interfaced with EIDORS (Electrical Impedance Tomography Reconstruction Software) to reconstruct the image based on the acquired data. In the next step, the coordinates of the center of the target object are calculated by image processing toolbox of MATLAB (IPT). Finally, these coordinates are used to calculate the angles of each joint of the robotic arm. The robotic arm moves to the desired tissue with the user command.
Abstract: As DNA microarray data contain relatively small
sample size compared to the number of genes, high dimensional
models are often employed. In high dimensional models, the selection
of tuning parameter (or, penalty parameter) is often one of the crucial
parts of the modeling. Cross-validation is one of the most common
methods for the tuning parameter selection, which selects a parameter
value with the smallest cross-validated score. However, selecting a
single value as an ‘optimal’ value for the parameter can be very
unstable due to the sampling variation since the sample sizes of
microarray data are often small. Our approach is to choose multiple candidates of tuning parameter
first, then average the candidates with different weights depending
on their performance. The additional step of estimating the weights
and averaging the candidates rarely increase the computational cost,
while it can considerably improve the traditional cross-validation. We
show that the selected value from the suggested methods often lead to
stable parameter selection as well as improved detection of significant
genetic variables compared to the tradition cross-validation via real
data and simulated data sets.
Abstract: Mining big data represents a big challenge nowadays. Many types of research are concerned with mining massive amounts of data and big data streams. Mining big data faces a lot of challenges including scalability, speed, heterogeneity, accuracy, provenance and privacy. In telecommunication industry, mining big data is like a mining for gold; it represents a big opportunity and maximizing the revenue streams in this industry. This paper discusses the characteristics of big data (volume, variety, velocity and veracity), data mining techniques and tools for handling very large data sets, mining big data in telecommunication and the benefits and opportunities gained from them.
Abstract: Sentiment analysis (SA) has received growing
attention in Arabic language research. However, few studies have yet
to directly apply SA to Arabic due to lack of a publicly available
dataset for this language. This paper partially bridges this gap due to
its focus on one of the Arabic dialects which is the Saudi dialect. This
paper presents annotated data set of 4700 for Saudi dialect sentiment
analysis with (K= 0.807). Our next work is to extend this corpus and
creation a large-scale lexicon for Saudi dialect from the corpus.
Abstract: Intellectual capital is one of the most valuable and
important parts of the intangible assets of enterprises especially in
knowledge-based enterprises. With respect to increasing gap between
the market value and the book value of the companies, intellectual
capital is one of the components that can be placed in this gap. This
paper uses the value added efficiency of the three components,
capital employed, human capital and structural capital, to measure the
intellectual capital efficiency of Iranian industries groups, listed in
the Tehran Stock Exchange (TSE), using a 8 years period data set
from 2005 to 2012. In order to analyze the effect of intellectual
capital on the market-to-book value ratio of the companies, the data
set was divided into 10 industries, Banking, Pharmaceutical, Metals
& Mineral Nonmetallic, Food, Computer, Building, Investments,
Chemical, Cement and Automotive, and the panel data method was
applied to estimating pooled OLS. The results exhibited that value
added of capital employed has a positive significant relation with
increasing market value in the industries, Banking, Metals & Mineral
Nonmetallic, Food, Computer, Chemical and Cement, and also,
showed that value added efficiency of structural capital has a positive
significant relation with increasing market value in the Banking,
Pharmaceutical and Computer industries groups. The results of the
value added showed a negative relation with the Banking and
Pharmaceutical industries groups and a positive relation with
computer and Automotive industries groups. Among the studied
industries, computer industry has placed the widest gap between the
market value and book value in its intellectual capital.
Abstract: The statistical study has become indispensable for various fields of knowledge. Not any different, in Geotechnics the study of probabilistic and statistical methods has gained power considering its use in characterizing the uncertainties inherent in soil properties. One of the situations where engineers are constantly faced is the definition of a probability distribution that represents significantly the sampled data. To be able to discard bad distributions, goodness-of-fit tests are necessary. In this paper, three non-parametric goodness-of-fit tests are applied to a data set computationally generated to test the goodness-of-fit of them to a series of known distributions. It is shown that the use of normal distribution does not always provide satisfactory results regarding physical and behavioral representation of the modeled parameters.
Abstract: River Hindon is an important river catering the
demand of highly populated rural and industrial cluster of western
Uttar Pradesh, India. Water quality of river Hindon is deteriorating at
an alarming rate due to various industrial, municipal and agricultural
activities. The present study aimed at identifying the pollution
sources and quantifying the degree to which these sources are
responsible for the deteriorating water quality of the river. Various
water quality parameters, like pH, temperature, electrical
conductivity, total dissolved solids, total hardness, calcium, chloride,
nitrate, sulphate, biological oxygen demand, chemical oxygen
demand, and total alkalinity were assessed. Water quality data
obtained from eight study sites for one year has been subjected to the
two multivariate techniques, namely, principal component analysis
and cluster analysis. Principal component analysis was applied with
the aim to find out spatial variability and to identify the sources
responsible for the water quality of the river. Three Varifactors were
obtained after varimax rotation of initial principal components using
principal component analysis. Cluster analysis was carried out to
classify sampling stations of certain similarity, which grouped eight
different sites into two clusters. The study reveals that the
anthropogenic influence (municipal, industrial, waste water and
agricultural runoff) was the major source of river water pollution.
Thus, this study illustrates the utility of multivariate statistical
techniques for analysis and elucidation of multifaceted data sets,
recognition of pollution sources/factors and understanding
temporal/spatial variations in water quality for effective river water
quality management.