Abstract: Email has become a fast and cheap means of online
communication. The main threat to email is Unsolicited Bulk Email
(UBE), commonly called spam email. The current work aims at
identification of unigrams in more than 2700 UBE that advertise
body-enhancement drugs. The identification is based on the
requirement that the unigram is neither present in dictionary, nor is a
slang term. The motives of the paper are many fold. This is an
attempt to analyze spamming behaviour and employment of wordmutation
technique. On the side-lines of the paper, we have
attempted to better understand the spam, the slang and their interplay.
The problem has been addressed by employing Tokenization
technique and Unigram BOW model. We found that the non-lexicon
words constitute nearly 66% of total number of lexis of corpus
whereas non-slang words constitute nearly 2.4% of non-lexicon
words. Further, non-lexicon non-slang unigrams composed of 2
lexicon words, form more than 71% of the total number of such
unigrams. To the best of our knowledge, this is the first attempt to
analyze usage of non-lexicon non-slang unigrams in any kind of
UBE.
Abstract: e-mail has become an important means of electronic
communication but the viability of its usage is marred by Unsolicited
Bulk e-mail (UBE) messages. UBE consists of many types
like pornographic, virus infected and 'cry-for-help' messages as well
as fake and fraudulent offers for jobs, winnings and medicines. UBE
poses technical and socio-economic challenges to usage of e-mails.
To meet this challenge and combat this menace, we need to
understand UBE. Towards this end, the current paper presents a
content-based textual analysis of more than 2700 body enhancement
medicinal UBE. Technically, this is an application of Text Parsing
and Tokenization for an un-structured textual document and we
approach it using Bag Of Words (BOW) and Vector Space Document
Model techniques. We have attempted to identify the most
frequently occurring lexis in the UBE documents that advertise
various products for body enhancement. The analysis of such top
100 lexis is also presented. We exhibit the relationship between
occurrence of a word from the identified lexis-set in the given UBE
and the probability that the given UBE will be the one advertising for
fake medicinal product. To the best of our knowledge and survey of
related literature, this is the first formal attempt for identification of
most frequently occurring lexis in such UBE by its textual analysis.
Finally, this is a sincere attempt to bring about alertness against and
mitigate the threat of such luring but fake UBE.
Abstract: In recent times, the problem of Unsolicited Bulk
Email (UBE) or commonly known as Spam Email, has increased at a
tremendous growth rate. We present an analysis of survey based on
classifications of UBE in various research works. There are many
research instances for classification between spam and non-spam
emails but very few research instances are available for classification
of spam emails, per se. This paper does not intend to assert some
UBE classification to be better than the others nor does it propose
any new classification but it bemoans the lack of harmony on number
and definition of categories proposed by different researchers. The
paper also elaborates on factors like intent of spammer, content of
UBE and ambiguity in different categories as proposed in related
research works of classifications of UBE.
Abstract: e-mail has become an important means of electronic
communication but the viability of its usage is marred by Unsolicited
Bulk e-mail (UBE) messages. UBE consists of many types
like pornographic, virus infected and 'cry-for-help' messages as well
as fake and fraudulent offers for jobs, winnings and medicines. UBE
poses technical and socio-economic challenges to usage of e-mails.
To meet this challenge and combat this menace, we need to
understand UBE. Towards this end, the current paper presents a
content-based textual analysis of nearly 3000 winnings-announcing
UBE. Technically, this is an application of Text Parsing and
Tokenization for an un-structured textual document and we approach
it using Bag Of Words (BOW) and Vector Space Document Model
techniques. We have attempted to identify the most frequently
occurring lexis in the winnings-announcing UBE documents. The
analysis of such top 100 lexis is also presented. We exhibit the
relationship between occurrence of a word from the identified lexisset
in the given UBE and the probability that the given UBE will be
the one announcing fake winnings. To the best of our knowledge and
survey of related literature, this is the first formal attempt for
identification of most frequently occurring lexis in winningsannouncing
UBE by its textual analysis. Finally, this is a sincere
attempt to bring about alertness against and mitigate the threat of
such luring but fake UBE.