Abstract: Aerospace, mechanical, and civil engineering infrastructures can take advantages from damage detection and identification strategies in terms of maintenance cost reduction and operational life improvements, as well for safety scopes. The challenge is to detect so called “barely visible impact damage” (BVID), due to low/medium energy impacts, that can progressively compromise the structure integrity. The occurrence of any local change in material properties, that can degrade the structure performance, is to be monitored using so called Structural Health Monitoring (SHM) systems, in charge of comparing the structure states before and after damage occurs. SHM seeks for any "anomalous" response collected by means of sensor networks and then analyzed using appropriate algorithms. Independently of the specific analysis approach adopted for structural damage detection and localization, textual reports, tables and graphs describing possible outlier coordinates and damage severity are usually provided as artifacts to be elaborated for information extraction about the current health conditions of the structure under investigation. Visual Analytics can support the processing of monitored measurements offering data navigation and exploration tools leveraging the native human capabilities of understanding images faster than texts and tables. Herein, a SHM system enrichment by integration of a Visual Analytics component is investigated. Analytical dashboards have been created by combining worksheets, so that a useful Visual Analytics tool is provided to structural analysts for exploring the structure health conditions examined by a Principal Component Analysis based algorithm.
Abstract: The study has been carried out on the Saranda in Jharkhand and a part of Odisha state. Geospatial data of Hyperion, a remote sensing satellite, have been used. This study has used a wide variety of patterns related to image processing to enhance and extract the mining class of Fe and Mn ores.Landsat-8, OLI sensor data have also been used to correctly explore related minerals. In this way, various processes have been applied to increase the mineralogy class and comparative evaluation with related frequency done. The Hyperion dataset for hyperspectral remote sensing has been specifically verified as an effective tool for mineral or rock information extraction within the band range of shortwave infrared used. The abundant spatial and spectral information contained in hyperspectral images enables the differentiation of different objects of any object into targeted applications for exploration such as exploration detection, mining.
Abstract: In an access-control situation, the role of a user determines whether a data request is appropriate. This paper combines vetted web mining and logic modeling to build a lightweight system for determining the role of a health care provider based only on their prior authorized requests. The model identifies provider roles with 100% recall from very little data. This shows the value of vetted web mining in AI systems, and suggests the impact of the ICD classification on medical practice.
Abstract: Causal relation identification is a crucial task in information extraction and knowledge discovery. In this work, we present two approaches to causal relation identification. The first is a classification model trained on a set of knowledge-based features. The second is a deep learning based approach training a model using convolutional neural networks to classify causal relations. We experiment with several different convolutional neural networks (CNN) models based on previous work on relation extraction as well as our own research. Our models are able to identify both explicit and implicit causal relations as well as the direction of the causal relation. The results of our experiments show a higher accuracy than previously achieved for causal relation identification tasks.
Abstract: Arabic is one of the most ancient and critical languages in the world. It has over than 250 million Arabic native speakers and more than twenty countries having Arabic as one of its official languages. In the past decade, we have witnessed a rapid evolution in smart devices, social network and technology sector which led to the need to provide tools and libraries that properly tackle the Arabic language in different domains. Stemming is one of the most crucial linguistic fundamentals. It is used in many applications especially in information extraction and text mining fields. The motivation behind this work is to enhance the Arabic light stemmer to serve the data mining industry and leverage it in an open source community. The presented implementation works on enhancing the Arabic light stemmer by utilizing and enhancing an algorithm that provides an extension for a new set of rules and patterns accompanied by adjusted procedure. This study has proven a significant enhancement for better search accuracy with an average 10% improvement in comparison with previous works.
Abstract: Advances in spatial and spectral resolution of satellite
images have led to tremendous growth in large image databases. The
data we acquire through satellites, radars, and sensors consists of
important geographical information that can be used for remote
sensing applications such as region planning, disaster management.
Spatial data classification and object recognition are important tasks
for many applications. However, classifying objects and identifying
them manually from images is a difficult task. Object recognition is
often considered as a classification problem, this task can be
performed using machine-learning techniques. Despite of many
machine-learning algorithms, the classification is done using
supervised classifiers such as Support Vector Machines (SVM) as the
area of interest is known. We proposed a classification method,
which considers neighboring pixels in a region for feature extraction
and it evaluates classifications precisely according to neighboring
classes for semantic interpretation of region of interest (ROI). A
dataset has been created for training and testing purpose; we
generated the attributes by considering pixel intensity values and
mean values of reflectance. We demonstrated the benefits of using
knowledge discovery and data-mining techniques, which can be on
image data for accurate information extraction and classification from
high spatial resolution remote sensing imagery.
Abstract: The critical concern of satellite operations is to ensure
the health and safety of satellites. The worst case in this perspective
is probably the loss of a mission, but the more common interruption
of satellite functionality can result in compromised mission
objectives. All the data acquiring from the spacecraft are known as
Telemetry (TM), which contains the wealth information related to the
health of all its subsystems. Each single item of information is
contained in a telemetry parameter, which represents a time-variant
property (i.e. a status or a measurement) to be checked. As a
consequence, there is a continuous improvement of TM monitoring
systems to reduce the time required to respond to changes in a
satellite's state of health. A fast conception of the current state of the
satellite is thus very important to respond to occurring failures.
Statistical multivariate latent techniques are one of the vital learning
tools that are used to tackle the problem above coherently.
Information extraction from such rich data sources using advanced
statistical methodologies is a challenging task due to the massive
volume of data. To solve this problem, in this paper, we present a
proposed unsupervised learning algorithm based on Principle
Component Analysis (PCA) technique. The algorithm is particularly
applied on an actual remote sensing spacecraft. Data from the
Attitude Determination and Control System (ADCS) was acquired
under two operation conditions: normal and faulty states. The models
were built and tested under these conditions, and the results show that
the algorithm could successfully differentiate between these
operations conditions. Furthermore, the algorithm provides
competent information in prediction as well as adding more insight
and physical interpretation to the ADCS operation.
Abstract: Creating a database scheme is essentially a manual
process. From a requirement specification the information contained
within has to be analyzed and reduced into a set of tables, attributes
and relationships. This is a time consuming process that has to go
through several stages before an acceptable database schema is
achieved. The purpose of this paper is to implement a Natural
Language Processing (NLP) based tool to produce a relational
database from a requirement specification. The Stanford CoreNLP
version 3.3.1 and the Java programming were used to implement the
proposed model. The outcome of this study indicates that a first draft
of a relational database schema can be extracted from a requirement
specification by using NLP tools and techniques with minimum user
intervention. Therefore this method is a step forward in finding a
solution that requires little or no user intervention.
Abstract: Field Association (FA) terms are a limited set of discriminating terms that give us the knowledge to identify document fields which are effective in document classification, similar file retrieval and passage retrieval. But the problem lies in the lack of an effective method to extract automatically relevant Arabic FA Terms to build a comprehensive dictionary. Moreover, all previous studies are based on FA terms in English and Japanese, and the extension of FA terms to other language such Arabic could be definitely strengthen further researches. This paper presents a new method to extract, Arabic FA Terms from domain-specific corpora using part-of-speech (POS) pattern rules and corpora comparison. Experimental evaluation is carried out for 14 different fields using 251 MB of domain-specific corpora obtained from Arabic Wikipedia dumps and Alhyah news selected average of 2,825 FA Terms (single and compound) per field. From the experimental results, recall and precision are 84% and 79% respectively. Therefore, this method selects higher number of relevant Arabic FA Terms at high precision and recall.
Abstract: The myoelectric signal (MES) is one of the Biosignals
utilized in helping humans to control equipments. Recent approaches
in MES classification to control prosthetic devices employing pattern
recognition techniques revealed two problems, first, the classification
performance of the system starts degrading when the number of
motion classes to be classified increases, second, in order to solve the
first problem, additional complicated methods were utilized which
increase the computational cost of a multifunction myoelectric
control system. In an effort to solve these problems and to achieve a
feasible design for real time implementation with high overall
accuracy, this paper presents a new method for feature extraction in
MES recognition systems. The method works by extracting features
using Wavelet Packet Transform (WPT) applied on the MES from
multiple channels, and then employs Fuzzy c-means (FCM)
algorithm to generate a measure that judges on features suitability for
classification. Finally, Principle Component Analysis (PCA) is
utilized to reduce the size of the data before computing the
classification accuracy with a multilayer perceptron neural network.
The proposed system produces powerful classification results (99%
accuracy) by using only a small portion of the original feature set.
Abstract: From the importance of the conference and its
constructive role in the studies discussion, there must be a strong
organization that allows the exploitation of the discussions in opening
new horizons. The vast amount of information scattered across the
web, make it difficult to find experts, who can play a prominent role
in organizing conferences. In this paper we proposed a new approach
of extracting researchers- information from various Web resources
and correlating them in order to confirm their correctness. As a
validator of this approach, we propose a service that will be useful to
set up a conference. Its main objective is to find appropriate experts,
as well as the social events for a conference. For this application we
us Semantic Web technologies like RDF and ontology to represent
the confirmed information, which are linked to another ontology
(skills ontology) that are used to present and compute the expertise.
Abstract: Number of documents being created increases at an
increasing pace while most of them being in already known topics
and little of them introducing new concepts. This fact has started a
new era in information retrieval discipline where the requirements
have their own specialties. That is digging into topics and concepts
and finding out subtopics or relations between topics. Up to now IR
researches were interested in retrieving documents about a general
topic or clustering documents under generic subjects. However these
conventional approaches can-t go deep into content of documents
which makes it difficult for people to reach to right documents they
were searching. So we need new ways of mining document sets
where the critic point is to know much about the contents of the
documents. As a solution we are proposing to enhance LSI, one of
the proven IR techniques by supporting its vector space with n-gram
forms of words. Positive results we have obtained are shown in two
different application area of IR domain; querying a document
database, clustering documents in the document database.
Abstract: The rapid expansion of the web is causing the
constant growth of information, leading to several problems such as
increased difficulty of extracting potentially useful knowledge. Web
content mining confronts this problem gathering explicit information
from different web sites for its access and knowledge discovery.
Query interfaces of web databases share common building blocks.
After extracting information with parsing approach, we use a new
data mining algorithm to match a large number of schemas in
databases at a time. Using this algorithm increases the speed of
information matching. In addition, instead of simple 1:1 matching,
they do complex (m:n) matching between query interfaces. In this
paper we present a novel correlation mining algorithm that matches
correlated attributes with smaller cost. This algorithm uses Jaccard
measure to distinguish positive and negative correlated attributes.
After that, system matches the user query with different query
interfaces in special domain and finally chooses the nearest query
interface with user query to answer to it.
Abstract: In this paper we propose an NLP-based method for
Ontology Population from texts and apply it to semi automatic
instantiate a Generic Knowledge Base (Generic Domain Ontology) in
the risk management domain. The approach is semi-automatic and
uses a domain expert intervention for validation. The proposed
approach relies on a set of Instances Recognition Rules based on
syntactic structures, and on the predicative power of verbs in the
instantiation process. It is not domain dependent since it heavily
relies on linguistic knowledge.
A description of an experiment performed on a part of the
ontology of the PRIMA1 project (supported by the European
community) is given. A first validation of the method is done by
populating this ontology with Chemical Fact Sheets from
Environmental Protection Agency2. The results of this experiment
complete the paper and support the hypothesis that relying on the
predicative power of verbs in the instantiation process improves the
performance.
Abstract: Mammographic images and data analysis to
facilitate modelling or computer aided diagnostic (CAD) software development should best be done using a common database that can handle various mammographic image file
formats and relate these to other patient information.
This would optimize the use of the data as both primary
reporting and enhanced information extraction of research data could be performed from the single dataset. One desired
improvement is the integration of DICOM file header information into the database, as an efficient and reliable source of supplementary patient information intrinsically
available in the images.
The purpose of this paper was to design a suitable database to link and integrate different types of image files and gather common information that can be further used for research
purposes. An interface was developed for accessing, adding,
updating, modifying and extracting data from the common
database, enhancing the future possible application of the data in CAD processing.
Technically, future developments envisaged include the creation of an advanced search function to selects image files
based on descriptor combinations. Results can be further used for specific CAD processing and other research. Design of a
user friendly configuration utility for importing of the required fields from the DICOM files must be done.
Abstract: The internet has become an attractive avenue for
global e-business, e-learning, knowledge sharing, etc. Due to
continuous increase in the volume of web content, it is not practically
possible for a user to extract information by browsing and integrating
data from a huge amount of web sources retrieved by the existing
search engines. The semantic web technology enables advancement
in information extraction by providing a suite of tools to integrate
data from different sources. To take full advantage of semantic web,
it is necessary to annotate existing web pages into semantic web
pages. This research develops a tool, named OWIE (Ontology-based
Web Information Extraction), for semantic web annotation using
domain specific ontologies. The tool automatically extracts
information from html pages with the help of pre-defined ontologies
and gives them semantic representation. Two case studies have been
conducted to analyze the accuracy of OWIE.
Abstract: In the area of Human Resource Management, the trend is towards online exchange of information about human resources. For example, online applications for employment become standard and job offerings are posted in many job portals. However, there are too many job portals to monitor all of them if someone is interested in a new job. We developed a prototype for integrating information of different job portals into one meta-search engine. First, existing job portals were investigated and XML schema documents were derived automated from these portals. Second, translation rules for transforming each schema to a central HR-XML-conform schema were determined. The HR-XML-schema is used to build a form for searching jobs. The data supplied by a user in this form is now translated into queries for the different job portals. Each result obtained by a job portal is sent to the meta-search engine that ranks the result of all received job offers according to user's preferences.
Abstract: Due to the ever growing amount of publications about
protein-protein interactions, information extraction from text is
increasingly recognized as one of crucial technologies in
bioinformatics. This paper presents a Protein Interaction Extraction
System using a Link Grammar Parser from biomedical abstracts
(PIELG). PIELG uses linkage given by the Link Grammar Parser to
start a case based analysis of contents of various syntactic roles as
well as their linguistically significant and meaningful combinations.
The system uses phrasal-prepositional verbs patterns to overcome
preposition combinations problems. The recall and precision are
74.4% and 62.65%, respectively. Experimental evaluations with two
other state-of-the-art extraction systems indicate that PIELG system
achieves better performance. For further evaluation, the system is
augmented with a graphical package (Cytoscape) for extracting
protein interaction information from sequence databases. The result
shows that the performance is remarkably promising.
Abstract: Automatic Extraction of Event information from
social text stream (emails, social network sites, blogs etc) is a vital
requirement for many applications like Event Planning and
Management systems and security applications. The key information
components needed from Event related text are Event title, location,
participants, date and time. Emails have very unique distinctions over
other social text streams from the perspective of layout and format
and conversation style and are the most commonly used
communication channel for broadcasting and planning events.
Therefore we have chosen emails as our dataset. In our work, we
have employed two statistical NLP methods, named as Finite State
Machines (FSM) and Hidden Markov Model (HMM) for the
extraction of event related contextual information. An application
has been developed providing a comparison among the two methods
over the event extraction task. It comprises of two modules, one for
each method, and works for both bulk as well as direct user input.
The results are evaluated using Precision, Recall and F-Score.
Experiments show that both methods produce high performance and
accuracy, however HMM was good enough over Title extraction and
FSM proved to be better for Venue, Date, and time.
Abstract: Named Entity Recognition (NER) aims to classify each word of a document into predefined target named entity classes and is now-a-days considered to be fundamental for many Natural Language Processing (NLP) tasks such as information retrieval, machine translation, information extraction, question answering systems and others. This paper reports about the development of a NER system for Bengali and Hindi using Support Vector Machine (SVM). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Indian languages (ILs) is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named (NE) classes, such as Person name, Location name, Organization name and Miscellaneous name. We have used the annotated corpora of 122,467 tokens of Bengali and 502,974 tokens of Hindi tagged with the twelve different NE classes 1, defined as part of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (SSEAL) 2. In addition, we have manually annotated 150K wordforms of the Bengali news corpus, developed from the web-archive of a leading Bengali newspaper. We have also developed an unsupervised algorithm in order to generate the lexical context patterns from a part of the unlabeled Bengali news corpus. Lexical patterns have been used as the features of SVM in order to improve the system performance. The NER system has been tested with the gold standard test sets of 35K, and 60K tokens for Bengali, and Hindi, respectively. Evaluation results have demonstrated the recall, precision, and f-score values of 88.61%, 80.12%, and 84.15%, respectively, for Bengali and 80.23%, 74.34%, and 77.17%, respectively, for Hindi. Results show the improvement in the f-score by 5.13% with the use of context patterns. Statistical analysis, ANOVA is also performed to compare the performance of the proposed NER system with that of the existing HMM based system for both the languages.