Abstract: Nowadays, the Web has become one of the most
pervasive platforms for information change and retrieval. It collects
the suitable and perfectly fitting information from websites that one
requires. Data mining is the form of extracting data’s available in the
internet. Web mining is one of the elements of data mining
Technique, which relates to various research communities such as
information recovery, folder managing system and simulated
intellects. In this Paper we have discussed the concepts of Web
mining. We contain generally focused on one of the categories of
Web mining, specifically the Web Content Mining and its various
farm duties. The mining tools are imperative to scanning the many
images, text, and HTML documents and then, the result is used by
the various search engines. We conclude by presenting a comparative
table of these tools based on some pertinent criteria.
Abstract: In this study, a fuzzy similarity approach for Arabic
web pages classification is presented. The approach uses a fuzzy
term-category relation by manipulating membership degree for the
training data and the degree value for a test web page. Six measures
are used and compared in this study. These measures include:
Einstein, Algebraic, Hamacher, MinMax, Special case fuzzy and
Bounded Difference approaches. These measures are applied and
compared using 50 different Arabic web pages. Einstein measure was
gave best performance among the other measures. An analysis of
these measures and concluding remarks are drawn in this study.
Abstract: Increasing growth of information volume in the
internet causes an increasing need to develop new (semi)automatic
methods for retrieval of documents and ranking them according to
their relevance to the user query. In this paper, after a brief review
on ranking models, a new ontology based approach for ranking
HTML documents is proposed and evaluated in various
circumstances. Our approach is a combination of conceptual,
statistical and linguistic methods. This combination reserves the
precision of ranking without loosing the speed. Our approach
exploits natural language processing techniques for extracting
phrases and stemming words. Then an ontology based conceptual
method will be used to annotate documents and expand the query.
To expand a query the spread activation algorithm is improved so
that the expansion can be done in various aspects. The annotated
documents and the expanded query will be processed to compute
the relevance degree exploiting statistical methods. The outstanding
features of our approach are (1) combining conceptual, statistical
and linguistic features of documents, (2) expanding the query with
its related concepts before comparing to documents, (3) extracting
and using both words and phrases to compute relevance degree, (4)
improving the spread activation algorithm to do the expansion based
on weighted combination of different conceptual relationships and
(5) allowing variable document vector dimensions. A ranking
system called ORank is developed to implement and test the
proposed model. The test results will be included at the end of the
paper.
Abstract: Increasing growth of information volume in the
internet causes an increasing need to develop new (semi)automatic
methods for retrieval of documents and ranking them according to
their relevance to the user query. In this paper, after a brief review
on ranking models, a new ontology based approach for ranking
HTML documents is proposed and evaluated in various
circumstances. Our approach is a combination of conceptual,
statistical and linguistic methods. This combination reserves the
precision of ranking without loosing the speed. Our approach
exploits natural language processing techniques to extract phrases
from documents and the query and doing stemming on words. Then
an ontology based conceptual method will be used to annotate
documents and expand the query. To expand a query the spread
activation algorithm is improved so that the expansion can be done
flexible and in various aspects. The annotated documents and the
expanded query will be processed to compute the relevance degree
exploiting statistical methods. The outstanding features of our
approach are (1) combining conceptual, statistical and linguistic
features of documents, (2) expanding the query with its related
concepts before comparing to documents, (3) extracting and using
both words and phrases to compute relevance degree, (4) improving
the spread activation algorithm to do the expansion based on
weighted combination of different conceptual relationships and (5)
allowing variable document vector dimensions. A ranking system
called ORank is developed to implement and test the proposed
model. The test results will be included at the end of the paper.