Abstract: Twitter is one of the most popular social media platforms where users share their opinions on different subjects. Twitter can be considered a great source for mining text due to the high volumes of data generated through the platform daily. Many industries such as telecommunication companies can leverage the availability of Twitter data to better understand their markets and make an appropriate business decision. This study performs topic modeling on Twitter data using Latent Dirichlet Allocation (LDA). The obtained results are benchmarked with another topic modeling technique, Latent Semantic Indexing (LSI). The study aims to retrieve topics on a Twitter dataset containing user tweets on South African Telcos. Results from this study show that LSI is much faster than LDA. However, LDA yields better results with higher topic coherence by 8% for the best-performing model in this experiment. A higher topic coherence score indicates better performance of the model.
Abstract: The role of controlled vocabularies in information retrieval is broadly recognized as a relevant feature. Besides, there is a standing demand that editors and databases should consider the effective introduction of controlled vocabularies in their procedures to index scientific literature. That is especially important because information retrieval is pointed out as a significant point to drive systematic literature review. Hence, a first question emerges: Are the controlled vocabularies at this moment considered? On the other hand, subject searching in the catalogs is complex mainly due to the dichotomy between keywords from authors versus keywords based on controlled vocabularies. Finally, there is some demand to unify the terminology related to health to make easier the medical history exploitation and research. Considering these features, this paper focuses on controlled vocabularies related to the health field and their role for storing, classifying, and retrieving relevant literature. The objective is knowing which role plays the controlled vocabularies related to the health field to index and retrieve research literature in data bases such as Web of Science (WoS) and Scopus. So, this exploratory research is grounded over two research questions: 1) Which are the terms considered in specific controlled vocabularies of the health field; and 2) How papers are indexed in relevant databases to be easily retrieved, considering keywords vs specific health’ controlled vocabularies? This research takes as fieldwork the controlled vocabularies related to health and the scientific interest for 1918 flu pandemic, also known equivocally as ‘Spanish flu’. This interest has been fostered by the emergence in the early 21st of epidemics of pneumonic diseases caused by virus. Searches about and with controlled vocabularies on WoS and Scopus databases are conducted. First results of this work in progress are surprising. There are different controlled vocabularies for the health field, into which the terms collected and preferred related to ‘1918 pandemic’ are identified. To summarize, ‘Spanish influenza epidemic’ or ‘Spanish flu’ are collected as not preferred terms. The preferred terms are: ‘influenza’ or ‘influenza pandemic, 1918-1919’. Although the controlled vocabularies are clear in their election, most of the literature about ‘1918 pandemic’ is retrievable either by ‘Spanish’ or by ‘1918’ disjunct, and the dominant word to retrieve literature is ‘Spanish’ rather than ‘1918’. This is surprising considering the existence of suitable controlled vocabularies related to health topics, and the modern guidelines of World Health Organization concerning naming of diseases that point out to other preferred terms. A first conclusion is the failure of using controlled vocabularies for a field such as health, and in consequence for WoS and Scopus. This research opens further research questions about which is the role that controlled vocabularies play in the instructions to authors that journals deliver to documents’ authors.
Abstract: Multimedia Indexing and Retrieval is generally de-signed and implemented by employing feature graphs. These graphs typically contain a significant number of nodes and edges to reflect the level of detail in feature detection. A higher level of detail increases the effectiveness of the results but also leads to more complex graph structures. However, graph-traversal-based algorithms for similarity are quite inefficient and computation intensive, espe-cially for large data structures. To deliver fast and effective retrieval, an efficient similarity algorithm, particularly for large graphs, is mandatory. Hence, in this paper, we define a graph-projection into a 2D space (Graph Code) as well as the corresponding algorithms for indexing and retrieval. We show that calculations in this space can be performed more efficiently than graph-traversals due to a simpler processing model and a high level of parallelisation. In consequence, we prove that the effectiveness of retrieval also increases substantially, as Graph Codes facilitate more levels of detail in feature fusion. Thus, Graph Codes provide a significant increase in efficiency and effectiveness (especially for Multimedia indexing and retrieval) and can be applied to images, videos, audio, and text information.
Abstract: The examination of the Public Service Organization’s performance evaluation includes several steps that help public organizations to develop a more efficient system. Public sector organizations have different characteristics than the competitive sector, so it can be stated that other/new elements become more important in their performance processes. The literature in this area is diverse, so highlighting an indicator system can be useful for introducing a system, but it is also worthwhile to measure the specific elements of the organization. In the case of a public service organization, due to the service obligation, it is usually possible to talk about a high number of users, so compliance is more difficult. For the organization, it is an important target to place great emphasis on the increase of service standards and the development of related processes. In this research, the health sector is given a prominent role, as it is a sensitive area where both organizational and individual performance is important for all participants. As a primary step, the content of the strategy is decisive, as this is important for the efficient structure of the process. When designing any system, it is important to review the expectations of the stakeholders, as this is primary when considering the design. The goal of this paper is to build the foundations of a performance management and indexing framework that can help a hospital to provide effective feedback and a direction that is important in assessing and developing a service and can become a management philosophy.
Abstract: An essential task in the field of artificial intelligence is
to allow computers to interact with people through natural language.
Therefore, researches such as virtual assistants and dialogue systems
have received widespread attention from industry and academia. The
response generation plays a crucial role in dialogue systems, so to
push forward the research on this topic, this paper surveys various
methods for response generation. We sort out these methods into
three categories. First one includes finite state machine methods,
framework methods, and instance methods. The second contains
full-text indexing methods, ontology methods, vast knowledge base
method, and some other methods. The third covers retrieval methods
and generative methods. We also discuss some hybrid methods based
knowledge and deep learning. We compare their disadvantages and
advantages and point out in which ways these studies can be improved
further. Our discussion covers some studies published in leading
conferences such as IJCAI and AAAI in recent years.
Abstract: Ontologies and various semantic repositories became a convenient approach for implementing model-driven architectures of distributed systems on the Web. SPARQL is the standard query language for querying such. However, although SPARQL is well-established standard for querying semantic repositories in RDF and OWL format and there are commonly used APIs which supports it, like Jena for Java, its parallel option is not incorporated in them. This article presents a complete framework consisting of an object algebra for parallel RDF and an index-based implementation of the parallel query engine capable of dealing with the distributed RDF ontologies which share common vocabulary. It has been implemented in Java, and for validation of the algorithms has been applied to the problem of organizing virtual exhibitions on the Web.
Abstract: H-index has been widely used as a performance indicator of researchers around the world especially in Indonesia. The Government uses Scopus and Google scholar as indexing references in providing recognition and appreciation. However, those two indexing services yield to different H-index values. For that purpose, this paper evaluates the difference of the H-index from those services. Researchers indexed by Webometrics, are used as reference’s data in this paper. Currently, Webometrics only uses H-index from Google Scholar. This paper observed and compared corresponding researchers’ data from Scopus to get their H-index score. Subsequently, some researchers with huge differences in score are observed in more detail on their paper’s publisher. This paper shows that the H-index of researchers in Google Scholar is approximately 2.45 times of their Scopus H-Index. Most difference exists due to the existence of uncertified publishers, which is considered in Google Scholar but not in Scopus.
Abstract: In this paper, we propose a framework to help users to search and retrieve the portions in the lecture video of their interest. This is achieved by temporally segmenting and indexing the lecture video using the topic keywords. We use transcribed text from the video and documents relevant to the video topic extracted from the web for this purpose. The keywords for indexing are found by applying the non-negative matrix factorization (NMF) topic modeling techniques on the web documents. Our proposed technique first creates indices on the transcribed documents using the topic keywords, and these are mapped to the video to find the start and end time of the portions of the video for a particular topic. This time information is stored in the index table along with the topic keyword which is used to retrieve the specific portions of the video for the query provided by the users.
Abstract: A practical and simple self-indexing data structure, Partitioned Elias-Fano (PEF) - Compressed Suffix Arrays (CSA), is built in linear time for the CSA based on PEF indexes. Moreover, the PEF-CSA is compared with two classical compressed indexing methods, Ferragina and Manzini implementation (FMI) and Sad-CSA on different type and size files in Pizza & Chili. The PEF-CSA performs better on the existing data in terms of the compression ratio, count, and locates time except for the evenly distributed data such as proteins data. The observations of the experiments are that the distribution of the φ is more important than the alphabet size on the compression ratio. Unevenly distributed data φ makes better compression effect, and the larger the size of the hit counts, the longer the count and locate time.
Abstract: In this paper, a shot boundary detection method is presented using octagon square search pattern. The color, edge, motion and texture features of each frame are extracted and used in shot boundary detection. The motion feature is extracted using octagon square search pattern. Then, the transition detection method is capable of detecting the shot or non-shot boundaries in the video using the feature weight values. Experimental results are evaluated in TRECVID video test set containing various types of shot transition with lighting effects, object and camera movement within the shots. Further, this paper compares the experimental results of the proposed method with existing methods. It shows that the proposed method outperforms the state-of-art methods for shot boundary detection.
Abstract: The rich Islamic resources related to religious text,
Islamic sciences, and history are widely available in print and in
electronic format online. However, most of these works are only
available in Arabic language. In this research, an attempt is made
to utilize these resources to create interactive web applications in
Arabic, English and other languages. The system utilizes the Pattern
Recognition, Knowledge Management, Data Mining, Information
Retrieval and Management, Indexing, storage and data-analysis
techniques to parse, store, convert and manage the information from
authentic Arabic resources. These interactive web Apps provide
smart multi-lingual search, tree based search, on-demand information
matching and linking. In this paper, we provide details of application
architecture, design, implementation and technologies employed. We
also presented the summary of web applications already developed.
We have also included some screen shots from the corresponding web
sites. These web applications provide an Innovative On-line Learning
Systems (eLearning and computer based education).
Abstract: The development of web technologies and mobile devices makes creating, accessing, using and sharing information or communicating with each other simpler every day. However, while the amount of information constantly increasing it is becoming harder to effectively organize and find quality information despite the availability of web search engines, filtering and indexing tools. Although digital technologies have overall positive impact on students’ lives, frequent use of these technologies and digital media enriched with dynamic hypertext and hypermedia content, as well as multitasking, distractions caused by notifications, calls or messages; can decrease the attention span, make thinking, memorizing and learning more difficult, which can lead to stress and mental exhaustion. This is referred to as “information overload”, “information glut” or “information anxiety”. Objective of this study is to determine whether students show signs of information overload and to identify the possible predictors. Research was conducted using a questionnaire developed for the purpose of this study. The results show that students frequently use technology (computers, gadgets and digital media), while they show moderate level of information literacy. They have sometimes experienced symptoms of information overload. According to the statistical analysis, higher frequency of technology use and lower level of information literacy are correlated with larger information overload. The multiple regression analysis has confirmed that the combination of these two independent variables has statistically significant predictive capacity for information overload. Therefore, the information science teachers should pay attention to improving the level of students’ information literacy and educate them about the risks of excessive technology use.
Abstract: In this paper, we propose a new method for threedimensional
object indexing based on D.A.M.C-S.H.C descriptor
(Direct and Analytical Method for Calculating the Spherical
Harmonics Coefficients). For this end, we propose a direct
calculation of the coefficients of spherical harmonics with perfect
precision. The aims of the method are to minimize, the processing
time on the 3D objects database and the searching time of similar
objects to a request object.
Firstly we start by defining the new descriptor using a new
division of 3-D object in a sphere. Then we define a new distance
which will be tested and prove his efficiency in the search for similar
objects in the database in which we have objects with very various
and important size.
Abstract: In this paper, we are interested in the problem of
finding similar images in a large database. For this purpose we
propose a new algorithm based on a combination of the 2-D
histogram intersection in the HSV space and statistical moments. The
proposed histogram is based on a 3x3 window and not only on the
intensity of the pixel. This approach overcome the drawback of the
conventional 1-D histogram which is ignoring the spatial distribution
of pixels in the image, while the statistical moments are used to
escape the effects of the discretisation of the color space which is
intrinsic to the use of histograms. We compare the performance of
our new algorithm to various methods of the state of the art and we
show that it has several advantages. It is fast, consumes little memory
and requires no learning. To validate our results, we apply this
algorithm to search for similar images in different image databases.
Abstract: The growth in the volume of text data such as books
and articles in libraries for centuries has imposed to establish
effective mechanisms to locate them. Early techniques such as
abstraction, indexing and the use of classification categories have
marked the birth of a new field of research called "Information
Retrieval". Information Retrieval (IR) can be defined as the task of
defining models and systems whose purpose is to facilitate access to
a set of documents in electronic form (corpus) to allow a user to find
the relevant ones for him, that is to say, the contents which matches
with the information needs of the user. This paper presents a new
semantic indexing approach of a documentary corpus. The indexing
process starts first by a term weighting phase to determine the
importance of these terms in the documents. Then the use of a
thesaurus like Wordnet allows moving to the conceptual level.
Each candidate concept is evaluated by determining its level of
representation of the document, that is to say, the importance of the
concept in relation to other concepts of the document. Finally, the
semantic index is constructed by attaching to each concept of the
ontology, the documents of the corpus in which these concepts are
found.
Abstract: Nowadays, huge amount of multimedia repositories
make the browsing, retrieval and delivery of video contents very slow
and even difficult tasks. Video summarization has been proposed to
improve faster browsing of large video collections and more efficient
content indexing and access. In this paper, we focus on approaches to
video summarization. The video summaries can be generated in many
different forms. However, two fundamentals ways to generate
summaries are static and dynamic. We present different techniques
for each mode in the literature and describe some features used for
generating video summaries. We conclude with perspective for
further research.
Abstract: In this paper, we propose a method for three-dimensional
(3-D)-model indexing based on defining a new
descriptor, which we call new descriptor using spherical harmonics.
The purpose of the method is to minimize, the processing time on the
database of objects models and the searching time of similar objects
to request object.
Firstly we start by defining the new descriptor using a new
division of 3-D object in a sphere. Then we define a new distance
which will be used in the search for similar objects in the database.
Abstract: The growth in the volume of text data such as books
and articles in libraries for centuries has imposed to establish
effective mechanisms to locate them. Early techniques such as
abstraction, indexing and the use of classification categories have
marked the birth of a new field of research called "Information
Retrieval". Information Retrieval (IR) can be defined as the task of
defining models and systems whose purpose is to facilitate access to
a set of documents in electronic form (corpus) to allow a user to find
the relevant ones for him, that is to say, the contents which matches
with the information needs of the user.
Most of the models of information retrieval use a specific data
structure to index a corpus which is called "inverted file" or "reverse
index".
This inverted file collects information on all terms over the corpus
documents specifying the identifiers of documents that contain the
term in question, the frequency of each term in the documents of the
corpus, the positions of the occurrences of the word...
In this paper we use an oriented object database (db4o) instead of
the inverted file, that is to say, instead to search a term in the inverted
file, we will search it in the db4o database.
The purpose of this work is to make a comparative study to see if
the oriented object databases may be competing for the inverse index
in terms of access speed and resource consumption using a large
volume of data.
Abstract: Key frame extraction methods select the most
representative frames of a video, which can be used in different areas
of video processing such as video retrieval, video summary, and video
indexing. In this paper we present a novel approach for extracting key
frames from video sequences. The frame is characterized uniquely by
his contours which are represented by the dominant blocks. These
dominant blocks are located on the contours and its near textures.
When the video frames have a noticeable changement, its dominant
blocks changed, then we can extracte a key frame. The dominant
blocks of every frame is computed, and then feature vectors are
extracted from the dominant blocks image of each frame and arranged
in a feature matrix. Singular Value Decomposition is used to calculate
sliding windows ranks of those matrices. Finally the computed ranks
are traced and then we are able to extract key frames of a video.
Experimental results show that the proposed approach is robust
against a large range of digital effects used during shot transition.
Abstract: Color Histogram is considered as the oldest method
used by CBIR systems for indexing images. In turn, the global
histograms do not include the spatial information; this is why the
other techniques coming later have attempted to encounter this
limitation by involving the segmentation task as a preprocessing step.
The weak segmentation is employed by the local histograms while
other methods as CCV (Color Coherent Vector) are based on strong
segmentation. The indexation based on local histograms consists of
splitting the image into N overlapping blocks or sub-regions, and
then the histogram of each block is computed. The dissimilarity
between two images is reduced, as consequence, to compute the
distance between the N local histograms of the both images resulting
then in N*N values; generally, the lowest value is taken into account
to rank images, that means that the lowest value is that which helps to
designate which sub-region utilized to index images of the collection
being asked. In this paper, we make under light the local histogram
indexation method in the hope to compare the results obtained against
those given by the global histogram. We address also another
noteworthy issue when Relying on local histograms namely which
value, among N*N values, to trust on when comparing images, in
other words, which sub-region among the N*N sub-regions on which
we base to index images. Based on the results achieved here, it seems
that relying on the local histograms, which needs to pose an extra
overhead on the system by involving another preprocessing step
naming segmentation, does not necessary mean that it produces better
results. In addition to that, we have proposed here some ideas to
select the local histogram on which we rely on to encode the image
rather than relying on the local histogram having lowest distance with
the query histograms.