Measuring Text-Based Semantics Relatedness Using WordNet

Measuring semantic similarity between texts is calculating semantic relatedness between texts using various techniques. Our web application (Measuring Relatedness of Concepts-MRC) allows user to input two text corpuses and get semantic similarity percentage between both using WordNet. Our application goes through five stages for the computation of semantic relatedness. Those stages are: Preprocessing (extracts keywords from content), Feature Extraction (classification of words into Parts-of-Speech), Synonyms Extraction (retrieves synonyms against each keyword), Measuring Similarity (using keywords and synonyms, similarity is measured) and Visualization (graphical representation of similarity measure). Hence the user can measure similarity on basis of features as well. The end result is a percentage score and the word(s) which form the basis of similarity between both texts with use of different tools on same platform. In future work we look forward for a Web as a live corpus application that provides a simpler and user friendly tool to compare documents and extract useful information.

Measuring Text-Based Semantics Relatedness Using WordNet

Measuring semantic similarity between texts is calculating semantic relatedness between texts using various techniques. Our web application (Measuring Relatedness of Concepts-MRC) allows user to input two text corpuses and get semantic similarity percentage between both using WordNet. Our application goes through five stages for the computation of semantic relatedness. Those stages are: Preprocessing (extracts keywords from content), Feature Extraction (classification of words into Parts-of-Speech), Synonyms Extraction (retrieves synonyms against each keyword), Measuring Similarity (using keywords and synonyms, similarity is measured) and Visualization (graphical representation of similarity measure). Hence the user can measure similarity on basis of features as well. The end result is a percentage score and the word(s) which form the basis of similarity between both texts with use of different tools on same platform. In future work we look forward for a Web as a live corpus application that provides a simpler and user friendly tool to compare documents and extract useful information.

Graph Cuts Segmentation Approach Using a Patch-Based Similarity Measure Applied for Interactive CT Lung Image Segmentation

Lung CT image segmentation is a prerequisite in lung CT image analysis. Most of the conventional methods need a post-processing to deal with the abnormal lung CT scans such as lung nodules or other lesions. The simplest similarity measure in the standard Graph Cuts Algorithm consists of directly comparing the pixel values of the two neighboring regions, which is not accurate because this kind of metrics is extremely sensitive to minor transformations such as noise or other artifacts problems. In this work, we propose an improved version of the standard graph cuts algorithm based on the Patch-Based similarity metric. The boundary penalty term in the graph cut algorithm is defined Based on Patch-Based similarity measurement instead of the simple intensity measurement in the standard method. The weights between each pixel and its neighboring pixels are Based on the obtained new term. The graph is then created using theses weights between its nodes. Finally, the segmentation is completed with the minimum cut/Max-Flow algorithm. Experimental results show that the proposed method is very accurate and efficient, and can directly provide explicit lung regions without any post-processing operations compared to the standard method.

Quick Similarity Measurement of Binary Images via Probabilistic Pixel Mapping

In this paper we present a quick technique to measure the similarity between binary images. The technique is based on a probabilistic mapping approach and is fast because only a minute percentage of the image pixels need to be compared to measure the similarity, and not the whole image. We exploit the power of the Probabilistic Matching Model for Binary Images (PMMBI) to arrive at an estimate of the similarity. We show that the estimate is a good approximation of the actual value, and the quality of the estimate can be improved further with increased image mappings. Furthermore, the technique is image size invariant; the similarity between big images can be measured as fast as that for small images. Examples of trials conducted on real images are presented.

Hybrid Reliability-Similarity-Based Approach for Supervised Machine Learning

Data mining has, over recent years, seen big advances because of the spread of internet, which generates everyday a tremendous volume of data, and also the immense advances in technologies which facilitate the analysis of these data. In particular, classification techniques are a subdomain of Data Mining which determines in which group each data instance is related within a given dataset. It is used to classify data into different classes according to desired criteria. Generally, a classification technique is either statistical or machine learning. Each type of these techniques has its own limits. Nowadays, current data are becoming increasingly heterogeneous; consequently, current classification techniques are encountering many difficulties. This paper defines new measure functions to quantify the resemblance between instances and then combines them in a new approach which is different from actual algorithms by its reliability computations. Results of the proposed approach exceeded most common classification techniques with an f-measure exceeding 97% on the IRIS Dataset.

Approximately Similarity Measurement of Web Sites Using Genetic Algorithms and Binary Trees

In this paper, we determine the similarity of two HTML web applications. We are going to use a genetic algorithm in order to determine the most significant web pages of each application (we are not going to use every web page of a site). Using these significant web pages, we will find the similarity value between the two applications. The algorithm is going to be efficient because we are going to use a reduced number of web pages for comparisons but it will return an approximate value of the similarity. The binary trees are used to keep the tags from the significant pages. The algorithm was implemented in Java language.

On-line Recognition of Isolated Gestures of Flight Deck Officers (FDO)

The paper presents an on-line recognition machine (RM) for continuous/isolated, dynamic and static gestures that arise in Flight Deck Officer (FDO) training. RM is based on generic pattern recognition framework. Gestures are represented as templates using summary statistics. The proposed recognition algorithm exploits temporal and spatial characteristics of gestures via dynamic programming and Markovian process. The algorithm predicts corresponding index of incremental input data in the templates in an on-line mode. Accumulated consistency in the sequence of prediction provides a similarity measurement (Score) between input data and the templates. The algorithm provides an intuitive mechanism for automatic detection of start/end frames of continuous gestures. In the present paper, we consider isolated gestures. The performance of RM is evaluated using four datasets - artificial (W TTest), hand motion (Yang) and FDO (tracker, vision-based ). RM achieves comparable results which are in agreement with other on-line and off-line algorithms such as hidden Markov model (HMM) and dynamic time warping (DTW). The proposed algorithm has the additional advantage of providing timely feedback for training purposes.

Graph-Based Text Similarity Measurement by Exploiting Wikipedia as Background Knowledge

Text similarity measurement is a fundamental issue in many textual applications such as document clustering, classification, summarization and question answering. However, prevailing approaches based on Vector Space Model (VSM) more or less suffer from the limitation of Bag of Words (BOW), which ignores the semantic relationship among words. Enriching document representation with background knowledge from Wikipedia is proven to be an effective way to solve this problem, but most existing methods still cannot avoid similar flaws of BOW in a new vector space. In this paper, we propose a novel text similarity measurement which goes beyond VSM and can find semantic affinity between documents. Specifically, it is a unified graph model that exploits Wikipedia as background knowledge and synthesizes both document representation and similarity computation. The experimental results on two different datasets show that our approach significantly improves VSM-based methods in both text clustering and classification.

Arabic Word Semantic Similarity

This paper is concerned with the production of an Arabic word semantic similarity benchmark dataset. It is the first of its kind for Arabic which was particularly developed to assess the accuracy of word semantic similarity measurements. Semantic similarity is an essential component to numerous applications in fields such as natural language processing, artificial intelligence, linguistics, and psychology. Most of the reported work has been done for English. To the best of our knowledge, there is no word similarity measure developed specifically for Arabic. In this paper, an Arabic benchmark dataset of 70 word pairs is presented. New methods and best possible available techniques have been used in this study to produce the Arabic dataset. This includes selecting and creating materials, collecting human ratings from a representative sample of participants, and calculating the overall ratings. This dataset will make a substantial contribution to future work in the field of Arabic WSS and hopefully it will be considered as a reference basis from which to evaluate and compare different methodologies in the field.

New Graph Similarity Measurements based on Isomorphic and Nonisomorphic Data Fusion and their Use in the Prediction of the Pharmacological Behavior of Drugs

New graph similarity methods have been proposed in this work with the aim to refining the chemical information extracted from molecules matching. For this purpose, data fusion of the isomorphic and nonisomorphic subgraphs into a new similarity measure, the Approximate Similarity, was carried out by several approaches. The application of the proposed method to the development of quantitative structure-activity relationships (QSAR) has provided reliable tools for predicting several pharmacological parameters: binding of steroids to the globulin-corticosteroid receptor, the activity of benzodiazepine receptor compounds, and the blood brain barrier permeability. Acceptable results were obtained for the models presented here.

Combining Similarity and Dissimilarity Measurements for the Development of QSAR Models Applied to the Prediction of Antiobesity Activity of Drugs

In this paper we study different similarity based approaches for the development of QSAR model devoted to the prediction of activity of antiobesity drugs. Classical similarity approaches are compared regarding to dissimilarity models based on the consideration of the calculation of Euclidean distances between the nonisomorphic fragments extracted in the matching process. Combining the classical similarity and dissimilarity approaches into a new similarity measure, the Approximate Similarity was also studied, and better results were obtained. The application of the proposed method to the development of quantitative structure-activity relationships (QSAR) has provided reliable tools for predicting of inhibitory activity of drugs. Acceptable results were obtained for the models presented here.

Maximum Common Substructure Extraction in RNA Secondary Structures Using Clique Detection Approach

The similarity comparison of RNA secondary structures is important in studying the functions of RNAs. In recent years, most existing tools represent the secondary structures by tree-based presentation and calculate the similarity by tree alignment distance. Different to previous approaches, we propose a new method based on maximum clique detection algorithm to extract the maximum common structural elements in compared RNA secondary structures. A new graph-based similarity measurement and maximum common subgraph detection procedures for comparing purely RNA secondary structures is introduced. Given two RNA secondary structures, the proposed algorithm consists of a process to determine the score of the structural similarity, followed by comparing vertices labelling, the labelled edges and the exact degree of each vertex. The proposed algorithm also consists of a process to extract the common structural elements between compared secondary structures based on a proposed maximum clique detection of the problem. This graph-based model also can work with NC-IUB code to perform the pattern-based searching. Therefore, it can be used to identify functional RNA motifs from database or to extract common substructures between complex RNA secondary structures. We have proved the performance of this proposed algorithm by experimental results. It provides a new idea of comparing RNA secondary structures. This tool is helpful to those who are interested in structural bioinformatics.

SAF: A Substitution and Alignment Free Similarity Measure for Protein Sequences

The literature reports a large number of approaches for measuring the similarity between protein sequences. Most of these approaches estimate this similarity using alignment-based techniques that do not necessarily yield biologically plausible results, for two reasons. First, for the case of non-alignable (i.e., not yet definitively aligned and biologically approved) sequences such as multi-domain, circular permutation and tandem repeat protein sequences, alignment-based approaches do not succeed in producing biologically plausible results. This is due to the nature of the alignment, which is based on the matching of subsequences in equivalent positions, while non-alignable proteins often have similar and conserved domains in non-equivalent positions. Second, the alignment-based approaches lead to similarity measures that depend heavily on the parameters set by the user for the alignment (e.g., gap penalties and substitution matrices). For easily alignable protein sequences, it's possible to supply a suitable combination of input parameters that allows such an approach to yield biologically plausible results. However, for difficult-to-align protein sequences, supplying different combinations of input parameters yields different results. Such variable results create ambiguities and complicate the similarity measurement task. To overcome these drawbacks, this paper describes a novel and effective approach for measuring the similarity between protein sequences, called SAF for Substitution and Alignment Free. Without resorting either to the alignment of protein sequences or to substitution relations between amino acids, SAF is able to efficiently detect the significant subsequences that best represent the intrinsic properties of protein sequences, those underlying the chronological dependencies of structural features and biochemical activities of protein sequences. Moreover, by using a new efficient subsequence matching scheme, SAF more efficiently handles protein sequences that contain similar structural features with significant meaning in chronologically non-equivalent positions. To show the effectiveness of SAF, extensive experiments were performed on protein datasets from different databases, and the results were compared with those obtained by several mainstream algorithms.

Recognition Machine (RM) for On-line and Isolated Flight Deck Officer (FDO) Gestures

The paper presents an on-line recognition machine (RM) for continuous/isolated, dynamic and static gestures that arise in Flight Deck Officer (FDO) training. RM is based on generic pattern recognition framework. Gestures are represented as templates using summary statistics. The proposed recognition algorithm exploits temporal and spatial characteristics of gestures via dynamic programming and Markovian process. The algorithm predicts corresponding index of incremental input data in the templates in an on-line mode. Accumulated consistency in the sequence of prediction provides a similarity measurement (Score) between input data and the templates. The algorithm provides an intuitive mechanism for automatic detection of start/end frames of continuous gestures. In the present paper, we consider isolated gestures. The performance of RM is evaluated using four datasets - artificial (W TTest), hand motion (Yang) and FDO (tracker, vision-based ). RM achieves comparable results which are in agreement with other on-line and off-line algorithms such as hidden Markov model (HMM) and dynamic time warping (DTW). The proposed algorithm has the additional advantage of providing timely feedback for training purposes.

Shape-Based Image Retrieval Using Shape Matrix

Retrieval image by shape similarity, given a template shape is particularly challenging, owning to the difficulty to derive a similarity measurement that closely conforms to the common perception of similarity by humans. In this paper, a new method for the representation and comparison of shapes is present which is based on the shape matrix and snake model. It is scaling, rotation, translation invariant. And it can retrieve the shape images with some missing or occluded parts. In the method, the deformation spent by the template to match the shape images and the matching degree is used to evaluate the similarity between them.