Development of Energy Benchmarks Using Mandatory Energy and Emissions Reporting Data: Ontario Post-Secondary Residences

Governments are playing an increasingly active role in reducing carbon emissions, and a key strategy has been the introduction of mandatory energy disclosure policies. These policies have resulted in a significant amount of publicly available data, providing researchers with a unique opportunity to develop location-specific energy and carbon emission benchmarks from this data set, which can then be used to develop building archetypes and used to inform urban energy models. This study presents the development of such a benchmark using the public reporting data. The data from Ontario’s Ministry of Energy for Post-Secondary Educational Institutions are being used to develop a series of building archetype dynamic building loads and energy benchmarks to fill a gap in the currently available building database. This paper presents the development of a benchmark for college and university residences within ASHRAE climate zone 6 areas in Ontario using the mandatory disclosure energy and greenhouse gas emissions data. The methodology presented includes data cleaning, statistical analysis, and benchmark development, and lessons learned from this investigation are presented and discussed to inform the development of future energy benchmarks from this larger data set. The key findings from this initial benchmarking study are: (1) the importance of careful data screening and outlier identification to develop a valid dataset; (2) the key features used to develop a model of the data are building age, size, and occupancy schedules and these can be used to estimate energy consumption; and (3) policy changes affecting the primary energy generation significantly affected greenhouse gas emissions, and consideration of these factors was critical to evaluate the validity of the reported data.

Part of Speech Tagging Using Statistical Approach for Nepali Text

Part of Speech Tagging has always been a challenging task in the era of Natural Language Processing. This article presents POS tagging for Nepali text using Hidden Markov Model and Viterbi algorithm. From the Nepali text, annotated corpus training and testing data set are randomly separated. Both methods are employed on the data sets. Viterbi algorithm is found to be computationally faster and accurate as compared to HMM. The accuracy of 95.43% is achieved using Viterbi algorithm. Error analysis where the mismatches took place is elaborately discussed.

A Minimum Spanning Tree-Based Method for Initializing the K-Means Clustering Algorithm

The traditional k-means algorithm has been widely used as a simple and efficient clustering method. However, the algorithm often converges to local minima for the reason that it is sensitive to the initial cluster centers. In this paper, an algorithm for selecting initial cluster centers on the basis of minimum spanning tree (MST) is presented. The set of vertices in MST with same degree are regarded as a whole which is used to find the skeleton data points. Furthermore, a distance measure between the skeleton data points with consideration of degree and Euclidean distance is presented. Finally, MST-based initialization method for the k-means algorithm is presented, and the corresponding time complexity is analyzed as well. The presented algorithm is tested on five data sets from the UCI Machine Learning Repository. The experimental results illustrate the effectiveness of the presented algorithm compared to three existing initialization methods.

3D Point Cloud Model Color Adjustment by Combining Terrestrial Laser Scanner and Close Range Photogrammetry Datasets

3D models obtained with advanced survey techniques such as close-range photogrammetry and laser scanner are nowadays particularly appreciated in Cultural Heritage and Archaeology fields. In order to produce high quality models representing archaeological evidences and anthropological artifacts, the appearance of the model (i.e. color) beyond the geometric accuracy, is not a negligible aspect. The integration of the close-range photogrammetry survey techniques with the laser scanner is still a topic of study and research. By combining point cloud data sets of the same object generated with both technologies, or with the same technology but registered in different moment and/or natural light condition, could construct a final point cloud with accentuated color dissimilarities. In this paper, a methodology to uniform the different data sets, to improve the chromatic quality and to highlight further details by balancing the point color will be presented.

Estimating Bridge Deterioration for Small Data Sets Using Regression and Markov Models

The primary approach for estimating bridge deterioration uses Markov-chain models and regression analysis. Traditional Markov models have problems in estimating the required transition probabilities when a small sample size is used. Often, reliable bridge data have not been taken over large periods, thus large data sets may not be available. This study presents an important change to the traditional approach by using the Small Data Method to estimate transition probabilities. The results illustrate that the Small Data Method and traditional approach both provide similar estimates; however, the former method provides results that are more conservative. That is, Small Data Method provided slightly lower than expected bridge condition ratings compared with the traditional approach. Considering that bridges are critical infrastructures, the Small Data Method, which uses more information and provides more conservative estimates, may be more appropriate when the available sample size is small. In addition, regression analysis was used to calculate bridge deterioration. Condition ratings were determined for bridge groups, and the best regression model was selected for each group. The results obtained were very similar to those obtained when using Markov chains; however, it is desirable to use more data for better results.

Variogram Fitting Based on the Wilcoxon Norm

Within geostatistics research, effective estimation of the variogram points has been examined, particularly in developing robust alternatives. The parametric fit of these variogram points which eventually defines the kriging weights, however, has not received the same attention from a robust perspective. This paper proposes the use of the non-linear Wilcoxon norm over weighted non-linear least squares as a robust variogram fitting alternative. First, we introduce the concept of variogram estimation and fitting. Then, as an alternative to non-linear weighted least squares, we discuss the non-linear Wilcoxon estimator. Next, the robustness properties of the non-linear Wilcoxon are demonstrated using a contaminated spatial data set. Finally, under simulated conditions, increasing levels of contaminated spatial processes have their variograms points estimated and fit. In the fitting of these variogram points, both non-linear Weighted Least Squares and non-linear Wilcoxon fits are examined for efficiency. At all levels of contamination (including 0%), using a robust estimation and robust fitting procedure, the non-weighted Wilcoxon outperforms weighted Least Squares.

Increasing the Capacity of Plant Bottlenecks by Using of Improving the Ratio of Mean Time between Failures to Mean Time to Repair

A significant percentage of production costs is the maintenance costs, and analysis of maintenance costs could to achieve greater productivity and competitiveness. With this is mind, the maintenance of machines and installations is considered as an essential part of organizational functions and applying effective strategies causes significant added value in manufacturing activities. Organizations are trying to achieve performance levels on a global scale with emphasis on creating competitive advantage by different methods consist of RCM (Reliability-Center-Maintenance), TPM (Total Productivity Maintenance) etc. In this study, increasing the capacity of Concentration Plant of Golgohar Iron Ore Mining & Industrial Company (GEG) was examined by using of reliability and maintainability analyses. The results of this research showed that instead of increasing the number of machines (in order to solve the bottleneck problems), the improving of reliability and maintainability would solve bottleneck problems in the best way. It should be mention that in the abovementioned study, the data set of Concentration Plant of GEG as a case study, was applied and analyzed.

Analysis of Attention to the Confucius Institute from Domestic and Foreign Mainstream Media

The rapid development of the Confucius Institute is attracting more and more attention from mainstream media around the world. Mainstream media plays a large role in public information dissemination and public opinion. This study presents efforts to analyze the correlation and functional relationship between domestic and foreign mainstream media by analyzing the amount of reports on the Confucius Institute. Three kinds of correlation calculation methods, the Pearson correlation coefficient (PCC), the Spearman correlation coefficient (SCC), and the Kendall rank correlation coefficient (KCC), were applied to analyze the correlations among mainstream media from three regions: mainland of China; Hong Kong and Macao (the two special administration regions of China denoted as SARs); and overseas countries excluding China, such as the United States, England, and Canada. Further, the paper measures the functional relationships among the regions using a regression model. The experimental analyses found high correlations among mainstream media from the different regions. Additionally, we found that there is a linear relationship between the mainstream media of overseas countries and those of the SARs by analyzing the amount of reports on the Confucius Institute based on a data set obtained by crawling the websites of 106 mainstream media during the years 2004 to 2014.

Innovative Entrepreneurship in Tourism Business: An International Comparative Study of Key Drivers

Entrepreneurship is mostly related to the beginning of organization. In growing business organizations, entrepreneurship expands its conceptualization. It reveals itself through new business creation in the active organization, through renewal, change, innovation, creation and development of current organization, through breaking and changing of established rules inside or outside the organization and becomes more flexible, adaptive and competitive, also improving effectiveness of organization activity. Therefore, the topic of entrepreneurship, relates the creation of firms to personal / individual characteristics of the entrepreneurs and their social context. This paper is an empirical study, which aims to address these two gaps in the literature. For this endeavor, we use the latest available data from the Global Entrepreneurship Monitor (GEM) project. This data set is widely regarded as a unique source of information about entrepreneurial activity, as well as the aspirations and attitudes of individuals across a wide number of countries and territories worldwide. This paper tries to contribute to fill this gap, by exploring the key drivers of innovative entrepreneurship in the tourism sector. Our findings are consistent with the existing literature in terms of the individual characteristics of entrepreneurs, but quite surprisingly we find an inverted U-shape relation between human development and innovative entrepreneurship in tourism sector. It has been revealed that tourism entrepreneurs are less likely to have innovative products, compared with entrepreneurs in medium developed countries.

Speaker Identification by Atomic Decomposition of Learned Features Using Computational Auditory Scene Analysis Principals in Noisy Environments

Speaker recognition is performed in high Additive White Gaussian Noise (AWGN) environments using principals of Computational Auditory Scene Analysis (CASA). CASA methods often classify sounds from images in the time-frequency (T-F) plane using spectrograms or cochleargrams as the image. In this paper atomic decomposition implemented by matching pursuit performs a transform from time series speech signals to the T-F plane. The atomic decomposition creates a sparsely populated T-F vector in “weight space” where each populated T-F position contains an amplitude weight. The weight space vector along with the atomic dictionary represents a denoised, compressed version of the original signal. The arraignment or of the atomic indices in the T-F vector are used for classification. Unsupervised feature learning implemented by a sparse autoencoder learns a single dictionary of basis features from a collection of envelope samples from all speakers. The approach is demonstrated using pairs of speakers from the TIMIT data set. Pairs of speakers are selected randomly from a single district. Each speak has 10 sentences. Two are used for training and 8 for testing. Atomic index probabilities are created for each training sentence and also for each test sentence. Classification is performed by finding the lowest Euclidean distance between then probabilities from the training sentences and the test sentences. Training is done at a 30dB Signal-to-Noise Ratio (SNR). Testing is performed at SNR’s of 0 dB, 5 dB, 10 dB and 30dB. The algorithm has a baseline classification accuracy of ~93% averaged over 10 pairs of speakers from the TIMIT data set. The baseline accuracy is attributable to short sequences of training and test data as well as the overall simplicity of the classification algorithm. The accuracy is not affected by AWGN and produces ~93% accuracy at 0dB SNR.

Ports and Airports: Gateways to Vector-Borne Diseases in Portugal Mainland

Vector-borne diseases are transmitted to humans by mosquitos, sandflies, bugs, ticks, and other vectors. Some are re-transmitted between vectors, if the infected human has a new contact when his levels of infection are high. The vector is infected for lifetime and can transmit infectious diseases not only between humans but also from animals to humans. Some vector borne diseases are very disabling and globally account for more than one million deaths worldwide. The mosquitoes from the complex Culex pipiens sl. are the most abundant in Portugal, and we dispose in this moment of a data set from the surveillance program that has been carried on since 2006 across the country. All mosquitos’ species are included, but the large coverage of Culex pipiens sl. and its importance for public health make this vector an interesting candidate to assess risk of disease amplification. This work focus on ports and airports identified as key areas of high density of vectors. Mosquitoes being ectothermic organisms, the main factor for vector survival and pathogen development is temperature. Minima and maxima local air temperatures for each area of interest are averaged by month from data gathered on a daily basis at the national network of meteorological stations, and interpolated in a geographic information system (GIS). The range of temperatures ideal for several pathogens are known and this work shows how to use it with the meteorological data in each port and airport facility, to focus an efficient implementation of countermeasures and reduce simultaneously risk transmission and mitigation costs. The results show an increased alert with decreasing latitude, which corresponds to higher minimum and maximum temperatures and a lower amplitude range of the daily temperature.

An Empirical Evaluation of Performance of Machine Learning Techniques on Imbalanced Software Quality Data

The development of change prediction models can help the software practitioners in planning testing and inspection resources at early phases of software development. However, a major challenge faced during the training process of any classification model is the imbalanced nature of the software quality data. A data with very few minority outcome categories leads to inefficient learning process and a classification model developed from the imbalanced data generally does not predict these minority categories correctly. Thus, for a given dataset, a minority of classes may be change prone whereas a majority of classes may be non-change prone. This study explores various alternatives for adeptly handling the imbalanced software quality data using different sampling methods and effective MetaCost learners. The study also analyzes and justifies the use of different performance metrics while dealing with the imbalanced data. In order to empirically validate different alternatives, the study uses change data from three application packages of open-source Android data set and evaluates the performance of six different machine learning techniques. The results of the study indicate extensive improvement in the performance of the classification models when using resampling method and robust performance measures.

Facilitating Factors for the Success of Mobile Service Providers in Bangkok Metropolitan

The objectives of this research were to study the level of influencing factors, leadership, supply chain management, innovation, competitive advantages, business success, and affecting factors to the business success of the mobile phone system service providers in Bangkok Metropolitan. This research was done by the quantitative approach and the qualitative approach. The quantitative approach was used for questionnaires to collect data from the 331 mobile service shop managers franchised by AIS, Dtac and TrueMove. The mobile phone system service providers/shop managers were randomly stratified and proportionally allocated into subgroups exclusive to the number of the providers in each network. In terms of qualitative method, there were in-depth interviews of 6 mobile service providers/managers of Telewiz and Dtac and TrueMove shop to find the agreement or disagreement with the content analysis method. Descriptive Statistics, including Frequency, Percentage, Means and Standard Deviation were employed; also, the Structural Equation Model (SEM) was used as a tool for data analysis. The content analysis method was applied to identify key patterns emerging from the interview responses. The two data sets were brought together for comparing and contrasting to make the findings, providing triangulation to enrich result interpretation. It revealed that the level of the influencing factors – leadership, innovation management, supply chain management, and business competitiveness had an impact at a great level, but that the level of factors, innovation and the business, financial success and nonbusiness financial success of the mobile phone system service providers in Bangkok Metropolitan, is at the highest level. Moreover, the business influencing factors, competitive advantages in the business of mobile system service providers which were leadership, supply chain management, innovation management, business advantages, and business success, had statistical significance at .01 which corresponded to the data from the interviews.

Using Electrical Impedance Tomography to Control a Robot

Electrical impedance tomography is a non-invasive medical imaging technique suitable for medical applications. This paper describes an electrical impedance tomography device with the ability to navigate a robotic arm to manipulate a target object. The design of the device includes various hardware and software sections to perform medical imaging and control the robotic arm. In its hardware section an image is formed by 16 electrodes which are located around a container. This image is used to navigate a 3DOF robotic arm to reach the exact location of the target object. The data set to form the impedance imaging is obtained by having repeated current injections and voltage measurements between all electrode pairs. After performing the necessary calculations to obtain the impedance, information is transmitted to the computer. This data is fed and then executed in MATLAB which is interfaced with EIDORS (Electrical Impedance Tomography Reconstruction Software) to reconstruct the image based on the acquired data. In the next step, the coordinates of the center of the target object are calculated by image processing toolbox of MATLAB (IPT). Finally, these coordinates are used to calculate the angles of each joint of the robotic arm. The robotic arm moves to the desired tissue with the user command.

Efficient Tuning Parameter Selection by Cross-Validated Score in High Dimensional Models

As DNA microarray data contain relatively small sample size compared to the number of genes, high dimensional models are often employed. In high dimensional models, the selection of tuning parameter (or, penalty parameter) is often one of the crucial parts of the modeling. Cross-validation is one of the most common methods for the tuning parameter selection, which selects a parameter value with the smallest cross-validated score. However, selecting a single value as an ‘optimal’ value for the parameter can be very unstable due to the sampling variation since the sample sizes of microarray data are often small. Our approach is to choose multiple candidates of tuning parameter first, then average the candidates with different weights depending on their performance. The additional step of estimating the weights and averaging the candidates rarely increase the computational cost, while it can considerably improve the traditional cross-validation. We show that the selected value from the suggested methods often lead to stable parameter selection as well as improved detection of significant genetic variables compared to the tradition cross-validation via real data and simulated data sets.

Mining Big Data in Telecommunications Industry: Challenges, Techniques, and Revenue Opportunity

Mining big data represents a big challenge nowadays. Many types of research are concerned with mining massive amounts of data and big data streams. Mining big data faces a lot of challenges including scalability, speed, heterogeneity, accuracy, provenance and privacy. In telecommunication industry, mining big data is like a mining for gold; it represents a big opportunity and maximizing the revenue streams in this industry. This paper discusses the characteristics of big data (volume, variety, velocity and veracity), data mining techniques and tools for handling very large data sets, mining big data in telecommunication and the benefits and opportunities gained from them.

Saudi Twitter Corpus for Sentiment Analysis

Sentiment analysis (SA) has received growing attention in Arabic language research. However, few studies have yet to directly apply SA to Arabic due to lack of a publicly available dataset for this language. This paper partially bridges this gap due to its focus on one of the Arabic dialects which is the Saudi dialect. This paper presents annotated data set of 4700 for Saudi dialect sentiment analysis with (K= 0.807). Our next work is to extend this corpus and creation a large-scale lexicon for Saudi dialect from the corpus.

The Influence of the Intellectual Capital on the Firms’ Market Value: A Study of Listed Firms in the Tehran Stock Exchange (TSE)

Intellectual capital is one of the most valuable and important parts of the intangible assets of enterprises especially in knowledge-based enterprises. With respect to increasing gap between the market value and the book value of the companies, intellectual capital is one of the components that can be placed in this gap. This paper uses the value added efficiency of the three components, capital employed, human capital and structural capital, to measure the intellectual capital efficiency of Iranian industries groups, listed in the Tehran Stock Exchange (TSE), using a 8 years period data set from 2005 to 2012. In order to analyze the effect of intellectual capital on the market-to-book value ratio of the companies, the data set was divided into 10 industries, Banking, Pharmaceutical, Metals & Mineral Nonmetallic, Food, Computer, Building, Investments, Chemical, Cement and Automotive, and the panel data method was applied to estimating pooled OLS. The results exhibited that value added of capital employed has a positive significant relation with increasing market value in the industries, Banking, Metals & Mineral Nonmetallic, Food, Computer, Chemical and Cement, and also, showed that value added efficiency of structural capital has a positive significant relation with increasing market value in the Banking, Pharmaceutical and Computer industries groups. The results of the value added showed a negative relation with the Banking and Pharmaceutical industries groups and a positive relation with computer and Automotive industries groups. Among the studied industries, computer industry has placed the widest gap between the market value and book value in its intellectual capital.

A Brief Study about Nonparametric Adherence Tests

The statistical study has become indispensable for various fields of knowledge. Not any different, in Geotechnics the study of probabilistic and statistical methods has gained power considering its use in characterizing the uncertainties inherent in soil properties. One of the situations where engineers are constantly faced is the definition of a probability distribution that represents significantly the sampled data. To be able to discard bad distributions, goodness-of-fit tests are necessary. In this paper, three non-parametric goodness-of-fit tests are applied to a data set computationally generated to test the goodness-of-fit of them to a series of known distributions. It is shown that the use of normal distribution does not always provide satisfactory results regarding physical and behavioral representation of the modeled parameters.

A Multivariate Statistical Approach for Water Quality Assessment of River Hindon, India

River Hindon is an important river catering the demand of highly populated rural and industrial cluster of western Uttar Pradesh, India. Water quality of river Hindon is deteriorating at an alarming rate due to various industrial, municipal and agricultural activities. The present study aimed at identifying the pollution sources and quantifying the degree to which these sources are responsible for the deteriorating water quality of the river. Various water quality parameters, like pH, temperature, electrical conductivity, total dissolved solids, total hardness, calcium, chloride, nitrate, sulphate, biological oxygen demand, chemical oxygen demand, and total alkalinity were assessed. Water quality data obtained from eight study sites for one year has been subjected to the two multivariate techniques, namely, principal component analysis and cluster analysis. Principal component analysis was applied with the aim to find out spatial variability and to identify the sources responsible for the water quality of the river. Three Varifactors were obtained after varimax rotation of initial principal components using principal component analysis. Cluster analysis was carried out to classify sampling stations of certain similarity, which grouped eight different sites into two clusters. The study reveals that the anthropogenic influence (municipal, industrial, waste water and agricultural runoff) was the major source of river water pollution. Thus, this study illustrates the utility of multivariate statistical techniques for analysis and elucidation of multifaceted data sets, recognition of pollution sources/factors and understanding temporal/spatial variations in water quality for effective river water quality management.