Abstract: Email has become a fast and cheap means of online
communication. The main threat to email is Unsolicited Bulk Email
(UBE), commonly called spam email. The current work aims at
identification of unigrams in more than 2700 UBE that advertise
body-enhancement drugs. The identification is based on the
requirement that the unigram is neither present in dictionary, nor is a
slang term. The motives of the paper are many fold. This is an
attempt to analyze spamming behaviour and employment of wordmutation
technique. On the side-lines of the paper, we have
attempted to better understand the spam, the slang and their interplay.
The problem has been addressed by employing Tokenization
technique and Unigram BOW model. We found that the non-lexicon
words constitute nearly 66% of total number of lexis of corpus
whereas non-slang words constitute nearly 2.4% of non-lexicon
words. Further, non-lexicon non-slang unigrams composed of 2
lexicon words, form more than 71% of the total number of such
unigrams. To the best of our knowledge, this is the first attempt to
analyze usage of non-lexicon non-slang unigrams in any kind of
UBE.
Abstract: At present, dictionary attack has been the basic tool for
recovering key passwords. In order to avoid dictionary attack, users
purposely choose another character strings as passwords. According to
statistics, about 14% of users choose keys on a keyboard (Kkey, for
short) as passwords. This paper develops a framework system to attack
the password chosen from Kkeys and analyzes its efficiency. Within
this system, we build up keyboard rules using the adjacent and parallel
relationship among Kkeys and then use these Kkey rules to generate
password databases by depth-first search method. According to the
experiment results, we find the key space of databases derived from
these Kkey rules that could be far smaller than the password databases
generated within brute-force attack, thus effectively narrowing down
the scope of attack research. Taking one general Kkey rule, the
combinations in all printable characters (94 types) with Kkey adjacent
and parallel relationship, as an example, the derived key space is about
240 smaller than those in brute-force attack. In addition, we
demonstrate the method's practicality and value by successfully
cracking the access password to UNIX and PC using the password
databases created
Abstract: In the paper, a fast high-resolution range profile synthetic algorithm called orthogonal matching pursuit with sensing dictionary (OMP-SD) is proposed. It formulates the traditional HRRP synthetic to be a sparse approximation problem over redundant dictionary. As it employs a priori that the synthetic range profile (SRP) of targets are sparse, SRP can be accomplished even in presence of data lost. Besides, the computation complexity decreases from O(MNDK) flops for OMP to O(M(N + D)K) flops for OMP-SD by introducing sensing dictionary (SD). Simulation experiments illustrate its advantages both in additive white Gaussian noise (AWGN) and noiseless situation, respectively.
Abstract: In 2011, Debiao et al. pointed out that S-3PAKE protocol proposed by Lu and Cao for password-authenticated key exchange in the three-party setting is vulnerable to an off-line dictionary attack. Then, they proposed some countermeasures to eliminate the security vulnerability of the S-3PAKE. Nevertheless, this paper points out their enhanced S-3PAKE protocol is still vulnerable to undetectable on-line dictionary attacks unlike their claim.
Abstract: The emergence of the Internet has brewed the
revolution of information storage and retrieval. As most of the
data in the web is unstructured, and contains a mix of text,
video, audio etc, there is a need to mine information to cater to
the specific needs of the users without loss of important
hidden information. Thus developing user friendly and
automated tools for providing relevant information quickly
becomes a major challenge in web mining research. Most of
the existing web mining algorithms have concentrated on
finding frequent patterns while neglecting the less frequent
ones that are likely to contain outlying data such as noise,
irrelevant and redundant data. This paper mainly focuses on
Signed approach and full word matching on the organized
domain dictionary for mining web content outliers. This
Signed approach gives the relevant web documents as well as
outlying web documents. As the dictionary is organized based
on the number of characters in a word, searching and retrieval
of documents takes less time and less space.
Abstract: Field Association (FA) terms are a limited set of discriminating terms that give us the knowledge to identify document fields which are effective in document classification, similar file retrieval and passage retrieval. But the problem lies in the lack of an effective method to extract automatically relevant Arabic FA Terms to build a comprehensive dictionary. Moreover, all previous studies are based on FA terms in English and Japanese, and the extension of FA terms to other language such Arabic could be definitely strengthen further researches. This paper presents a new method to extract, Arabic FA Terms from domain-specific corpora using part-of-speech (POS) pattern rules and corpora comparison. Experimental evaluation is carried out for 14 different fields using 251 MB of domain-specific corpora obtained from Arabic Wikipedia dumps and Alhyah news selected average of 2,825 FA Terms (single and compound) per field. From the experimental results, recall and precision are 84% and 79% respectively. Therefore, this method selects higher number of relevant Arabic FA Terms at high precision and recall.
Abstract: Clusters of microcalcifications in mammograms are an
important sign of breast cancer. This paper presents a complete
Computer Aided Detection (CAD) scheme for automatic detection of
clustered microcalcifications in digital mammograms. The proposed
system, MammoScan μCaD, consists of three main steps. Firstly
all potential microcalcifications are detected using a a method for
feature extraction, VarMet, and adaptive thresholding. This will also
give a number of false detections. The goal of the second step,
Classifier level 1, is to remove everything but microcalcifications.
The last step, Classifier level 2, uses learned dictionaries and sparse
representations as a texture classification technique to distinguish
single, benign microcalcifications from clustered microcalcifications,
in addition to remove some remaining false detections. The system
is trained and tested on true digital data from Stavanger University
Hospital, and the results are evaluated by radiologists. The overall
results are promising, with a sensitivity > 90 % and a low false
detection rate (approx 1 unwanted pr. image, or 0.3 false pr. image).
Abstract: An algorithm for learning an overcomplete dictionary
using a Cauchy mixture model for sparse decomposition of an underdetermined
mixing system is introduced. The mixture density
function is derived from a ratio sample of the observed mixture
signals where 1) there are at least two but not necessarily more
mixture signals observed, 2) the source signals are statistically
independent and 3) the sources are sparse. The basis vectors of the
dictionary are learned via the optimization of the location parameters
of the Cauchy mixture components, which is shown to be more
accurate and robust than the conventional data mining methods
usually employed for this task. Using a well known sparse
decomposition algorithm, we extract three speech signals from two
mixtures based on the estimated dictionary. Further tests with
additive Gaussian noise are used to demonstrate the proposed
algorithm-s robustness to outliers.
Abstract: All Text processing systems allow their users to
search a pattern of string from a given text. String matching is
fundamental to database and text processing applications. Every text
editor must contain a mechanism to search the current document for
arbitrary strings. Spelling checkers scan an input text for words in the
dictionary and reject any strings that do not match. We store our
information in data bases so that later on we can retrieve the same
and this retrieval can be done by using various string matching
algorithms. This paper is describing a new string matching algorithm
for various applications. A new algorithm has been designed with the
help of Rabin Karp Matcher, to improve string matching process.
Abstract: Along with the advances in medicine, providing medical information to individual patient is becoming more important. In Japan such information via Braille is hardly provided to blind and partially sighted people. Thus we are researching and developing a Web-based automatic translation program “eBraille" to translate Japanese text into Japanese Braille. First we analyzed the Japanese transcription rules to implement them on our program. We then added medical words to the dictionary of the program to improve its translation accuracy for medical text. Finally we examined the efficacy of statistical learning models (SLMs) for further increase of word segmentation accuracy in braille translation. As a result, eBraille had the highest translation accuracy in the comparison with other translation programs, improved the accuracy for medical text and is utilized to make hospital brochures in braille for outpatients and inpatients.
Abstract: Machine Translation, (hereafter in this document
referred to as the "MT") faces a lot of complex problems from its
origination. Extracting multiword expressions is also one of the
complex problems in MT. Finding multiword expressions during
translating a sentence from English into Urdu, through existing
solutions, takes a lot of time and occupies system resources. We have
designed a simple relational data approach, in which we simply set a
bit in dictionary (database) for multiword, to find and handle
multiword expression. This approach handles multiword efficiently.
Abstract: Texture classification is an important image processing
task with a broad application range. Many different techniques for
texture classification have been explored. Using sparse approximation
as a feature extraction method for texture classification is a relatively
new approach, and Skretting et al. recently presented the Frame
Texture Classification Method (FTCM), showing very good results on
classical texture images. As an extension of that work the FTCM is
here tested on a real world application as detection of abnormalities
in mammograms. Some extensions to the original FTCM that are
useful in some applications are implemented; two different smoothing
techniques and a vector augmentation technique. Both detection of
microcalcifications (as a primary detection technique and as a last
stage of a detection scheme), and soft tissue lesions in mammograms
are explored. All the results are interesting, and especially the results
using FTCM on regions of interest as the last stage in a detection
scheme for microcalcifications are promising.
Abstract: In this paper, an improvement of PDLZW implementation
with a new dictionary updating technique is proposed. A
unique dictionary is partitioned into hierarchical variable word-width
dictionaries. This allows us to search through dictionaries in parallel.
Moreover, the barrel shifter is adopted for loading a new input string
into the shift register in order to achieve a faster speed. However,
the original PDLZW uses a simple FIFO update strategy, which is
not efficient. Therefore, a new window based updating technique
is implemented to better classify the difference in how often each
particular address in the window is referred. The freezing policy
is applied to the address most often referred, which would not be
updated until all the other addresses in the window have the same
priority. This guarantees that the more often referred addresses would
not be updated until their time comes. This updating policy leads
to an improvement on the compression efficiency of the proposed
algorithm while still keep the architecture low complexity and easy
to implement.
Abstract: This paper presents an algebraic approach to optimize
queries in domain-specific database management system
for protein structure data. The approach involves the introduction of
several protein structure specific algebraic operators to query the
complex data stored in an object-oriented database system. The
Protein Algebra provides an extensible set of high-level Genomic
Data Types and Protein Data Types along with a comprehensive
collection of appropriate genomic and protein functions. The paper
also presents a query translator that converts high-level query
specifications in algebra into low-level query specifications in
Protein-QL, a query language designed to query protein structure
data. The query transformation process uses a Protein Ontology that
serves the purpose of a dictionary.
Abstract: This work presents a new phonetic transcription system based on a tree of hierarchical pronunciation rules expressed as context-specific grapheme-phoneme correspondences. The tree is automatically inferred from a phonetic dictionary by incrementally analyzing deeper context levels, eventually representing a minimum set of exhaustive rules that pronounce without errors all the words in the training dictionary and that can be applied to out-of-vocabulary words. The proposed approach improves upon existing rule-tree-based techniques in that it makes use of graphemes, rather than letters, as elementary orthographic units. A new linear algorithm for the segmentation of a word in graphemes is introduced to enable outof- vocabulary grapheme-based phonetic transcription. Exhaustive rule trees provide a canonical representation of the pronunciation rules of a language that can be used not only to pronounce out-of-vocabulary words, but also to analyze and compare the pronunciation rules inferred from different dictionaries. The proposed approach has been implemented in C and tested on Oxford British English and Basic English. Experimental results show that grapheme-based rule trees represent phonetically sound rules and provide better performance than letter-based rule trees.
Abstract: Sparse representation has long been studied and several
dictionary learning methods have been proposed. The dictionary
learning methods are widely used because they are adaptive. In this
paper, a new dictionary learning method for audio is proposed. Signals
are at first decomposed into different degrees of Intrinsic Mode
Functions (IMF) using Empirical Mode Decomposition (EMD)
technique. Then these IMFs form a learned dictionary. To reduce the
size of the dictionary, the K-means method is applied to the dictionary
to generate a K-EMD dictionary. Compared to K-SVD algorithm, the
K-EMD dictionary decomposes audio signals into structured
components, thus the sparsity of the representation is increased by
34.4% and the SNR of the recovered audio signals is increased by
20.9%.
Abstract: Machine Translation (MT 3) of English text to its Urdu equivalent is a difficult challenge. Lot of attempts has been made, but a few limited solutions are provided till now. We present a direct approach, using an expert system to translate English text into its equivalent Urdu, using The Unicode Standard, Version 4.0 (ISBN 0-321-18578-1) Range: 0600–06FF. The expert system works with a knowledge base that contains grammatical patterns of English and Urdu, as well as a tense and gender-aware dictionary of Urdu words (with their English equivalents).
Abstract: In today's day and age, one of the important topics in
information security is authentication. There are several alternatives
to text-based authentication of which includes Graphical Password
(GP) or Graphical User Authentication (GUA). These methods stems
from the fact that humans recognized and remembers images better
than alphanumerical text characters. This paper will focus on the
security aspect of GP algorithms and what most researchers have
been working on trying to define these security features and
attributes. The goal of this study is to develop a fuzzy decision model
that allows automatic selection of available GP algorithms by taking
into considerations the subjective judgments of the decision makers
who are more than 50 postgraduate students of computer science. The
approach that is being proposed is based on the Fuzzy Analytic
Hierarchy Process (FAHP) which determines the criteria weight as a
linear formula.
Abstract: Currently electronic slide (e-slide) is one of the most common styles in educational presentation. Unfortunately, the utilization of e-slide for the visually impaired is uncommon since they are unable to see the content of such e-slides which are usually composed of text, images and animation. This paper proposes a model for presenting e-slide in multimodal presentation i.e. using conventional slide concurrent with voicing, in both languages Malay and English. At the design level, live multimedia presentation concept is used, while at the implementation level several components are used. The text content of each slide is extracted using COM component, Microsoft Speech API for voicing the text in English language and the text in Malay language is voiced using dictionary approach. To support the accessibility, an auditory user interface is provided as an additional feature. A prototype of such model named as VSlide has been developed and introduced.