Identification of Most Frequently Occurring Lexis in Body-enhancement Medicinal Unsolicited Bulk e-mails

e-mail has become an important means of electronic communication but the viability of its usage is marred by Unsolicited Bulk e-mail (UBE) messages. UBE consists of many types like pornographic, virus infected and 'cry-for-help' messages as well as fake and fraudulent offers for jobs, winnings and medicines. UBE poses technical and socio-economic challenges to usage of e-mails. To meet this challenge and combat this menace, we need to understand UBE. Towards this end, the current paper presents a content-based textual analysis of more than 2700 body enhancement medicinal UBE. Technically, this is an application of Text Parsing and Tokenization for an un-structured textual document and we approach it using Bag Of Words (BOW) and Vector Space Document Model techniques. We have attempted to identify the most frequently occurring lexis in the UBE documents that advertise various products for body enhancement. The analysis of such top 100 lexis is also presented. We exhibit the relationship between occurrence of a word from the identified lexis-set in the given UBE and the probability that the given UBE will be the one advertising for fake medicinal product. To the best of our knowledge and survey of related literature, this is the first formal attempt for identification of most frequently occurring lexis in such UBE by its textual analysis. Finally, this is a sincere attempt to bring about alertness against and mitigate the threat of such luring but fake UBE.




References:
[1] Berry R. "The 100 Most Annoying Things of 2003", January 18,
2004,
http://www.retrocrush.buzznet.com/archive2004/annoying2003/
[2] Castillo C., Donato D., Becchetti L. et al. "A Reference
Collection for Web Spam", ACM SIGIR Forum, December
2006. 40(2). 11-24p. ISSN: 0163-5840
[3] Crucial Web Hosting Ltd. "How Consumers Define Spam",
March 06, 2007, http://www.crucialwebost.com/blog/howconsumers-
define-spam/
[4] Fette I., Sadeh N. and Tomasic A. "Learning to Detect Phishing
Emails", Institute for Software Research International School of
Computer Science (ISRI), Carnegie Mellon University (CMU),
CMU-ISRI-06-112, June 2006
[5] Frederic E. "Text Mining Applied to Spam Detection",
Presentation given at University of Geneva on January 24, 2007,
http://cui.unige.ch/~ehrler/presentation/ Spam%20Filtering.pdf
[6] Gajewski W. P. "Adaptive Naïve Bayesian Anti-spam Engine",
Proceedings of World Academy of Science, Engineering and
Technology (PWASET 2005), August 2005. 7. 45-50p. ISSN:
1307-6884
[7] Gyongyi Z. and Garcia-Molina H. "Web Spam Taxonomy",
First International Workshop on Adversarial Information
Retrieval on the Web (AIRWeb, 2005), Chiba, Japan, April
2005
[8] Infinite Monkeys & Co. "Spam Defined",
http://www.monkeys.com/spam-defined/definition.shtml
[9] Kiritchenko S. and Matwin S. "Email Classification with Co-
Training", Proceedings of the 2001 Conference of the Centre for
Advanced Studies on Collaborative Research, Toronto, Canada,
2001. 8p.
[10] Knujon.com "Categorizing junk eMail",
http://www.knujon.com/categories.html
[11] Lambert A. "Analysis of Spam", Dissertation for Degree of
Master of Science in Computer Science, Department of
Computer Science, University of Dublin, Trinity College,
September 2003
[12] Mahalo.com "How to stop spam email",
http://www.mahalo.com/How_to_Stop_Spam_Email
[13] Martin S., Sewani A., Nelson B., et al. "Analyzing Behavioral
Features for Email Classification", Proceedings of the Second
Conference on Email and Anti-Spam (CEAS, 2005), Stanford
University, California, U.S.A. July 21-22, 2005
[14] Roth W. "Spam? Its All Relative", published online on
December 19, 2005,
http://www.imediaconnection.com/content/7581.asp
[15] Sebastiani F. "Machine Learning in Automated Text
Categorization", in ACM Computing Surveys, March 2002.
32(1), 1-47p. ISSN: 0360-0300
[16] Sen P. "Types of Spam", Interactive Advertising, Fall 2004,
http://ciadvertising.org/sa/fall_04/adv391k/paroma/spam/types_
of_spam.htm
[17] The Spam Register "Spam Email Directory: Categorized Spam
Emails", December 17, 2008,
http://www.spamreg.com/directory.php
[18] Threat Research and Content Engineering (TRACE) "Spam
Type Descriptions",
http://www.marshal.com/TRACE/Spam_Types.asp
[19] Youn S. and McLeod D. "Spam Email Classification Using an
Adaptive Ontology", Institute of Electrical and Electronics
Engineers (IEEE) Journal of Software, April 2007
[20] Zahren B. "Blizzard of Spam",
http://www.pcpitstop.com/news/blizzard.asp
[21] Zhang T. "Predictive Methods for Text Mining", Machine
Learning Summer School - 2006, Taipei,
http://videolectures.net/mlss06tw_zhang_pmtm