Abstract: Measuring semantic similarity between texts is calculating semantic relatedness between texts using various techniques. Our web application (Measuring Relatedness of Concepts-MRC) allows user to input two text corpuses and get semantic similarity percentage between both using WordNet. Our application goes through five stages for the computation of semantic relatedness. Those stages are: Preprocessing (extracts keywords from content), Feature Extraction (classification of words into Parts-of-Speech), Synonyms Extraction (retrieves synonyms against each keyword), Measuring Similarity (using keywords and synonyms, similarity is measured) and Visualization (graphical representation of similarity measure). Hence the user can measure similarity on basis of features as well. The end result is a percentage score and the word(s) which form the basis of similarity between both texts with use of different tools on same platform. In future work we look forward for a Web as a live corpus application that provides a simpler and user friendly tool to compare documents and extract useful information.
Abstract: Measuring semantic similarity between texts is calculating semantic relatedness between texts using various techniques. Our web application (Measuring Relatedness of Concepts-MRC) allows user to input two text corpuses and get semantic similarity percentage between both using WordNet. Our application goes through five stages for the computation of semantic relatedness. Those stages are: Preprocessing (extracts keywords from content), Feature Extraction (classification of words into Parts-of-Speech), Synonyms Extraction (retrieves synonyms against each keyword), Measuring Similarity (using keywords and synonyms, similarity is measured) and Visualization (graphical representation of similarity measure). Hence the user can measure similarity on basis of features as well. The end result is a percentage score and the word(s) which form the basis of similarity between both texts with use of different tools on same platform. In future work we look forward for a Web as a live corpus application that provides a simpler and user friendly tool to compare documents and extract useful information.
Abstract: The most important property of the Gene Ontology is
the terms. These control vocabularies are defined to provide
consistent descriptions of gene products that are shareable and
computationally accessible by humans, software agent, or other
machine-readable meta-data. Each term is associated with
information such as definition, synonyms, database references, amino
acid sequences, and relationships to other terms. This information has
made the Gene Ontology broadly applied in microarray and
proteomic analysis. However, the process of searching the terms is
still carried out using traditional approach which is based on keyword
matching. The weaknesses of this approach are: ignoring semantic
relationships between terms, and highly depending on a specialist to
find similar terms. Therefore, this study combines semantic similarity
measure and genetic algorithm to perform a better retrieval process
for searching semantically similar terms. The semantic similarity
measure is used to compute similitude strength between two terms.
Then, the genetic algorithm is employed to perform batch retrievals
and to handle the situation of the large search space of the Gene
Ontology graph. The computational results are presented to show the
effectiveness of the proposed algorithm.
Abstract: Chinese Idioms are a type of traditional Chinese idiomatic
expressions with specific meanings and stereotypes structure
which are widely used in classical Chinese and are still common in
vernacular written and spoken Chinese today. Currently, Chinese
Idioms are retrieved in glossary with key character or key word in
morphology or pronunciation index that can not meet the need of
searching semantically. OCIRS is proposed to search the desired
idiom in the case of users only knowing its meaning without any key
character or key word. The user-s request in a sentence or phrase will
be grammatically analyzed in advance by word segmentation, key
word extraction and semantic similarity computation, thus can be
mapped to the idiom domain ontology which is constructed to provide
ample semantic relations and to facilitate description logics-based
reasoning for idiom retrieval. The experimental evaluation shows that
OCIRS realizes the function of searching idioms via semantics, obtaining
preliminary achievement as requested by the users.
Abstract: This paper is concerned with the production of an Arabic word semantic similarity benchmark dataset. It is the first of its kind for Arabic which was particularly developed to assess the accuracy of word semantic similarity measurements. Semantic similarity is an essential component to numerous applications in fields such as natural language processing, artificial intelligence, linguistics, and psychology. Most of the reported work has been done for English. To the best of our knowledge, there is no word similarity measure developed specifically for Arabic. In this paper, an Arabic benchmark dataset of 70 word pairs is presented. New methods and best possible available techniques have been used in this study to produce the Arabic dataset. This includes selecting and creating materials, collecting human ratings from a representative sample of participants, and calculating the overall ratings. This dataset will make a substantial contribution to future work in the field of Arabic WSS and hopefully it will be considered as a reference basis from which to evaluate and compare different methodologies in the field.
Abstract: This paper describes a novel approach for deriving
modules from protein-protein interaction networks, which combines
functional information with topological properties of the network.
This approach is based on weighted clustering coefficient, which
uses weights representing the functional similarities between the
proteins. These weights are calculated according to the semantic
similarity between the proteins, which is based on their Gene
Ontology terms. We recently proposed an algorithm for identification
of functional modules, called SWEMODE (Semantic WEights for
MODule Elucidation), that identifies dense sub-graphs containing
functionally similar proteins. The rational underlying this approach is
that each module can be reduced to a set of triangles (protein triplets
connected to each other). Here, we propose considering semantic
similarity weights of all triangle-forming edges between proteins. We
also apply varying semantic similarity thresholds between
neighbours of each node that are not neighbours to each other (and
hereby do not form a triangle), to derive new potential triangles to
include in module-defining procedure. The results show an
improvement of pure topological approach, in terms of number of
predicted modules that match known complexes.
Abstract: Annotation of a protein sequence is pivotal for the understanding of its function. Accuracy of manual annotation provided by curators is still questionable by having lesser evidence strength and yet a hard task and time consuming. A number of computational methods including tools have been developed to tackle this challenging task. However, they require high-cost hardware, are difficult to be setup by the bioscientists, or depend on time intensive and blind sequence similarity search like Basic Local Alignment Search Tool. This paper introduces a new method of assigning highly correlated Gene Ontology terms of annotated protein sequences to partially annotated or newly discovered protein sequences. This method is fully based on Gene Ontology data and annotations. Two problems had been identified to achieve this method. The first problem relates to splitting the single monolithic Gene Ontology RDF/XML file into a set of smaller files that can be easy to assess and process. Thus, these files can be enriched with protein sequences and Inferred from Electronic Annotation evidence associations. The second problem involves searching for a set of semantically similar Gene Ontology terms to a given query. The details of macro and micro problems involved and their solutions including objective of this study are described. This paper also describes the protein sequence annotation and the Gene Ontology. The methodology of this study and Gene Ontology based protein sequence annotation tool namely extended UTMGO is presented. Furthermore, its basic version which is a Gene Ontology browser that is based on semantic similarity search is also introduced.
Abstract: Nowadays, organizing a repository of documents and
resources for learning on a special field as Information Technology
(IT), together with search techniques based on domain knowledge or
document-s content is an urgent need in practice of teaching, learning
and researching. There have been several works related to methods of
organization and search by content. However, the results are still
limited and insufficient to meet user-s demand for semantic
document retrieval. This paper presents a solution for the
organization of a repository that supports semantic representation and
processing in search. The proposed solution is a model which
integrates components such as an ontology describing domain
knowledge, a database of document repository, semantic
representation for documents and a file system; with problems,
semantic processing techniques and advanced search techniques
based on measuring semantic similarity. The solution is applied to
build a IT learning materials management system of a university with
semantic search function serving students, teachers, and manager as
well. The application has been implemented, tested at the University
of Information Technology, Ho Chi Minh City, Vietnam and has
achieved good results.