Identification of Most Frequently Occurring Lexis in Body-enhancement Medicinal Unsolicited Bulk e-mails
e-mail has become an important means of electronic
communication but the viability of its usage is marred by Unsolicited
Bulk e-mail (UBE) messages. UBE consists of many types
like pornographic, virus infected and 'cry-for-help' messages as well
as fake and fraudulent offers for jobs, winnings and medicines. UBE
poses technical and socio-economic challenges to usage of e-mails.
To meet this challenge and combat this menace, we need to
understand UBE. Towards this end, the current paper presents a
content-based textual analysis of more than 2700 body enhancement
medicinal UBE. Technically, this is an application of Text Parsing
and Tokenization for an un-structured textual document and we
approach it using Bag Of Words (BOW) and Vector Space Document
Model techniques. We have attempted to identify the most
frequently occurring lexis in the UBE documents that advertise
various products for body enhancement. The analysis of such top
100 lexis is also presented. We exhibit the relationship between
occurrence of a word from the identified lexis-set in the given UBE
and the probability that the given UBE will be the one advertising for
fake medicinal product. To the best of our knowledge and survey of
related literature, this is the first formal attempt for identification of
most frequently occurring lexis in such UBE by its textual analysis.
Finally, this is a sincere attempt to bring about alertness against and
mitigate the threat of such luring but fake UBE.
[1] Berry R. "The 100 Most Annoying Things of 2003", January 18,
2004,
http://www.retrocrush.buzznet.com/archive2004/annoying2003/
[2] Castillo C., Donato D., Becchetti L. et al. "A Reference
Collection for Web Spam", ACM SIGIR Forum, December
2006. 40(2). 11-24p. ISSN: 0163-5840
[3] Crucial Web Hosting Ltd. "How Consumers Define Spam",
March 06, 2007, http://www.crucialwebost.com/blog/howconsumers-
define-spam/
[4] Fette I., Sadeh N. and Tomasic A. "Learning to Detect Phishing
Emails", Institute for Software Research International School of
Computer Science (ISRI), Carnegie Mellon University (CMU),
CMU-ISRI-06-112, June 2006
[5] Frederic E. "Text Mining Applied to Spam Detection",
Presentation given at University of Geneva on January 24, 2007,
http://cui.unige.ch/~ehrler/presentation/ Spam%20Filtering.pdf
[6] Gajewski W. P. "Adaptive Naïve Bayesian Anti-spam Engine",
Proceedings of World Academy of Science, Engineering and
Technology (PWASET 2005), August 2005. 7. 45-50p. ISSN:
1307-6884
[7] Gyongyi Z. and Garcia-Molina H. "Web Spam Taxonomy",
First International Workshop on Adversarial Information
Retrieval on the Web (AIRWeb, 2005), Chiba, Japan, April
2005
[8] Infinite Monkeys & Co. "Spam Defined",
http://www.monkeys.com/spam-defined/definition.shtml
[9] Kiritchenko S. and Matwin S. "Email Classification with Co-
Training", Proceedings of the 2001 Conference of the Centre for
Advanced Studies on Collaborative Research, Toronto, Canada,
2001. 8p.
[10] Knujon.com "Categorizing junk eMail",
http://www.knujon.com/categories.html
[11] Lambert A. "Analysis of Spam", Dissertation for Degree of
Master of Science in Computer Science, Department of
Computer Science, University of Dublin, Trinity College,
September 2003
[12] Mahalo.com "How to stop spam email",
http://www.mahalo.com/How_to_Stop_Spam_Email
[13] Martin S., Sewani A., Nelson B., et al. "Analyzing Behavioral
Features for Email Classification", Proceedings of the Second
Conference on Email and Anti-Spam (CEAS, 2005), Stanford
University, California, U.S.A. July 21-22, 2005
[14] Roth W. "Spam? Its All Relative", published online on
December 19, 2005,
http://www.imediaconnection.com/content/7581.asp
[15] Sebastiani F. "Machine Learning in Automated Text
Categorization", in ACM Computing Surveys, March 2002.
32(1), 1-47p. ISSN: 0360-0300
[16] Sen P. "Types of Spam", Interactive Advertising, Fall 2004,
http://ciadvertising.org/sa/fall_04/adv391k/paroma/spam/types_
of_spam.htm
[17] The Spam Register "Spam Email Directory: Categorized Spam
Emails", December 17, 2008,
http://www.spamreg.com/directory.php
[18] Threat Research and Content Engineering (TRACE) "Spam
Type Descriptions",
http://www.marshal.com/TRACE/Spam_Types.asp
[19] Youn S. and McLeod D. "Spam Email Classification Using an
Adaptive Ontology", Institute of Electrical and Electronics
Engineers (IEEE) Journal of Software, April 2007
[20] Zahren B. "Blizzard of Spam",
http://www.pcpitstop.com/news/blizzard.asp
[21] Zhang T. "Predictive Methods for Text Mining", Machine
Learning Summer School - 2006, Taipei,
http://videolectures.net/mlss06tw_zhang_pmtm
[1] Berry R. "The 100 Most Annoying Things of 2003", January 18,
2004,
http://www.retrocrush.buzznet.com/archive2004/annoying2003/
[2] Castillo C., Donato D., Becchetti L. et al. "A Reference
Collection for Web Spam", ACM SIGIR Forum, December
2006. 40(2). 11-24p. ISSN: 0163-5840
[3] Crucial Web Hosting Ltd. "How Consumers Define Spam",
March 06, 2007, http://www.crucialwebost.com/blog/howconsumers-
define-spam/
[4] Fette I., Sadeh N. and Tomasic A. "Learning to Detect Phishing
Emails", Institute for Software Research International School of
Computer Science (ISRI), Carnegie Mellon University (CMU),
CMU-ISRI-06-112, June 2006
[5] Frederic E. "Text Mining Applied to Spam Detection",
Presentation given at University of Geneva on January 24, 2007,
http://cui.unige.ch/~ehrler/presentation/ Spam%20Filtering.pdf
[6] Gajewski W. P. "Adaptive Naïve Bayesian Anti-spam Engine",
Proceedings of World Academy of Science, Engineering and
Technology (PWASET 2005), August 2005. 7. 45-50p. ISSN:
1307-6884
[7] Gyongyi Z. and Garcia-Molina H. "Web Spam Taxonomy",
First International Workshop on Adversarial Information
Retrieval on the Web (AIRWeb, 2005), Chiba, Japan, April
2005
[8] Infinite Monkeys & Co. "Spam Defined",
http://www.monkeys.com/spam-defined/definition.shtml
[9] Kiritchenko S. and Matwin S. "Email Classification with Co-
Training", Proceedings of the 2001 Conference of the Centre for
Advanced Studies on Collaborative Research, Toronto, Canada,
2001. 8p.
[10] Knujon.com "Categorizing junk eMail",
http://www.knujon.com/categories.html
[11] Lambert A. "Analysis of Spam", Dissertation for Degree of
Master of Science in Computer Science, Department of
Computer Science, University of Dublin, Trinity College,
September 2003
[12] Mahalo.com "How to stop spam email",
http://www.mahalo.com/How_to_Stop_Spam_Email
[13] Martin S., Sewani A., Nelson B., et al. "Analyzing Behavioral
Features for Email Classification", Proceedings of the Second
Conference on Email and Anti-Spam (CEAS, 2005), Stanford
University, California, U.S.A. July 21-22, 2005
[14] Roth W. "Spam? Its All Relative", published online on
December 19, 2005,
http://www.imediaconnection.com/content/7581.asp
[15] Sebastiani F. "Machine Learning in Automated Text
Categorization", in ACM Computing Surveys, March 2002.
32(1), 1-47p. ISSN: 0360-0300
[16] Sen P. "Types of Spam", Interactive Advertising, Fall 2004,
http://ciadvertising.org/sa/fall_04/adv391k/paroma/spam/types_
of_spam.htm
[17] The Spam Register "Spam Email Directory: Categorized Spam
Emails", December 17, 2008,
http://www.spamreg.com/directory.php
[18] Threat Research and Content Engineering (TRACE) "Spam
Type Descriptions",
http://www.marshal.com/TRACE/Spam_Types.asp
[19] Youn S. and McLeod D. "Spam Email Classification Using an
Adaptive Ontology", Institute of Electrical and Electronics
Engineers (IEEE) Journal of Software, April 2007
[20] Zahren B. "Blizzard of Spam",
http://www.pcpitstop.com/news/blizzard.asp
[21] Zhang T. "Predictive Methods for Text Mining", Machine
Learning Summer School - 2006, Taipei,
http://videolectures.net/mlss06tw_zhang_pmtm
@article{"International Journal of Information, Control and Computer Sciences:58043", author = "Jatinderkumar R. Saini and Apurva A. Desai", title = "Identification of Most Frequently Occurring Lexis in Body-enhancement Medicinal Unsolicited Bulk e-mails", abstract = "e-mail has become an important means of electronic
communication but the viability of its usage is marred by Unsolicited
Bulk e-mail (UBE) messages. UBE consists of many types
like pornographic, virus infected and 'cry-for-help' messages as well
as fake and fraudulent offers for jobs, winnings and medicines. UBE
poses technical and socio-economic challenges to usage of e-mails.
To meet this challenge and combat this menace, we need to
understand UBE. Towards this end, the current paper presents a
content-based textual analysis of more than 2700 body enhancement
medicinal UBE. Technically, this is an application of Text Parsing
and Tokenization for an un-structured textual document and we
approach it using Bag Of Words (BOW) and Vector Space Document
Model techniques. We have attempted to identify the most
frequently occurring lexis in the UBE documents that advertise
various products for body enhancement. The analysis of such top
100 lexis is also presented. We exhibit the relationship between
occurrence of a word from the identified lexis-set in the given UBE
and the probability that the given UBE will be the one advertising for
fake medicinal product. To the best of our knowledge and survey of
related literature, this is the first formal attempt for identification of
most frequently occurring lexis in such UBE by its textual analysis.
Finally, this is a sincere attempt to bring about alertness against and
mitigate the threat of such luring but fake UBE.", keywords = "Body Enhancement, Lexis, Medicinal, Unsolicited
Bulk e-mail (UBE), Vector Space Document Model, Viagra", volume = "6", number = "4", pages = "495-5", }