Identification of Non-Lexicon Non-Slang Unigrams in Body-enhancement Medicinal UBE

Email has become a fast and cheap means of online communication. The main threat to email is Unsolicited Bulk Email (UBE), commonly called spam email. The current work aims at identification of unigrams in more than 2700 UBE that advertise body-enhancement drugs. The identification is based on the requirement that the unigram is neither present in dictionary, nor is a slang term. The motives of the paper are many fold. This is an attempt to analyze spamming behaviour and employment of wordmutation technique. On the side-lines of the paper, we have attempted to better understand the spam, the slang and their interplay. The problem has been addressed by employing Tokenization technique and Unigram BOW model. We found that the non-lexicon words constitute nearly 66% of total number of lexis of corpus whereas non-slang words constitute nearly 2.4% of non-lexicon words. Further, non-lexicon non-slang unigrams composed of 2 lexicon words, form more than 71% of the total number of such unigrams. To the best of our knowledge, this is the first attempt to analyze usage of non-lexicon non-slang unigrams in any kind of UBE.




References:
[1] Astriyani, Sutjiati R. and Purwaningsih D. E. "An Analysis of Slang
Language Related to Sex in Eminem-s Rap Songs- Lyrics", Repository
of Gunadarma University, Jakarta, 2007. ISSN: 1987-4783
[2] Berry R. "The 100 Most Annoying Things of 2003". Available:
http://www.retrocrush.buzznet.com/archive2004/annoying2003/,
January 18, 2004
[3] Castillo C., Donato D., Becchetti L., Boldi P., Leonardi S., Santini M.,
and Vigna S. "A. Reference Collection for Web Spam", ACM SIGIR
Forum, v. 40, n. 2, p. 11-24, December 2006, ISSN: 0163-5840
[4] Crucial Web Hosting Ltd. "How Consumers Define Spam". Available:
http://www.crucialwebost.com/blog/how-consumers-define-spam/,
March 06, 2007
[5] Evett D. "Spam Statistics 2006", TopTenREVIEWS Inc. Available:
http://spam-filterreview.toptenreviews.com/spam-statistics.html
[6] Frederic E. "Text Mining Applied to Spam Detection", Presentation
given at University of Geneva on January 24, 2007. Available:
http://cui.unige.ch/~ehrler/presentation/Spam%20Filtering.pdf
[7] Gajewski W. P. "Adaptive Naïve Bayesian Anti-spam Engine", in
Proceedings of World Academy of Science, Engineering and
Technology (PWASET 2005), Pages 45-50 Volume 7 August 2005 ISSN
1307-6884
[8] Goswami S., Sarkar S. and Rustagi M. "Stylometric Analysis of
Bloggers- Age and Gender" in Proceedings of the 3rd International
AAAI Conference on Weblogs and Social Media (ICWSM - 2009), San
Jose, California, May 2009
[9] Gyongyi Z., Garcia-Molina H. "Web Spam Taxonomy", First
International Workshop on Adversarial Information Retrieval on the
Web (AIRWeb, 2005), Chiba, Japan, April 2005
[10] Infinite Monkeys & Co. "Spam Defined". Available:
http://www.monkeys.com/spam-defined/definition.shtml, 2011
[11] Kiritchenko S. and Matwin S. "Email Classification with Co-Training",
in Proceedings of the 2001 conference of the Centre for Advanced
Studies on Collaborative Research, Toronto, Canada, pp. 8, 2001
[12] Knujon.com "Categorizing junk eMail". Available:
http://www.knujon.com/categories.html, 2011
[13] Krasny M. "Analysis: Usage of Slang Words", article from Talk of the
Nation (NPR), August 7, 2000. Available:
http://www.highbeam.com/doc/1P1-30383388.html
[14] Kucukyilmaz T., Cambazoglu B. B., Aykanat C. and Can F. "Chat
Mining for Gender Prediction", in Lecture Notes in Computer Science,
Springer Berlin, Heidelberg vol. 4243/2006, pp. 274-283,. ISSN: 0302-
9743
[15] Kucukyilmaz T., Cambazoglu B. B., Aykanat C. and Can F. "Chat
mining: Predicting user and message attributes in computer-mediated
communication" in Information Processing and Management: An
International Journal, vol. 44, issue no. 4, pp. 1448-1466, July 2008.
ISSN: 0306-4573
[16] Lambert A. "Analysis of Spam", Dissertation for Degree of Master of
Science in Computer Science, Department of Computer Science,
University of Dublin, Trinity College September 2003
[17] Lance J. "Phishing Exposed", Syngress Inc., ISBN:159749030X
[18] Ma W., Tran D. and Sharma D. "Filtering Spam Email with Flexible
Preprocessors", Advances in Communication Systems and Electrical
Engineering, Lecture Notes in Electrical Engineering Volume 4 Pages
211-227, ISBN 978-0-387-74937-2
[19] Meyer T. and Whateley B. "Spambayes: Effective Open-Source,
Bayesian Based, Email Classification System", in Proceedings of the
First Conference on Email and Anti-Spam (CEAS, 2004), Mountain
View,California, USA 2004
[20] Roth W. "Spam? Its All Relative". Available:
http://www.imediaconnection.com/content/7581.asp, Published online
on December 19, 2005
[21] Saini J. R. "Self Learning Taxonomical Classification System using
Vector Space Document Analysis Model for Web Text Mining in
UBE", Ph.D. Thesis guided by Desai A. A., accepted by Department of
Computer Science, Veer Narmad South Gujarat University, Surat,
Gujarat, India, September 2009
[22] Sebastiani F. "Machine Learning in Automated Text Categorization",
in ACM Computing Surveys, Vol. 32, No. 1, pp. 1-47, March 2002.
ISSN: 0360-0300
[23] Sen P. "Types of Spam". Available:
http://ciadvertising.org/sa/fall_04/adv391k/paroma/spam/types_of_spa
m.htm, Interactive Advertising, Fall 2004
[24] Sravan "Types of Spam Mail". Available:
http://www.thatdamnpc.com/types-ofspam-mail, November 18, 2008
[25] Thorne T. "Slang, Style-shifting and Sociability", Multicultural
Perspectives on English Language and Literature, Tallinn/London,
2004. Available: http://www.kcl.ac.uk/content/1/c6/03/08/16/
Slang/%20Style-shifting%20and%20Sociability.doc
[26] Youn S. and McLeod D. "Spam Email Classification using an Adaptive
Ontology", Institute of Electrical and Electronics Engineers (IEEE)
Journal of Software, April 2007
[27] Zhang T. "Predictive Methods for Text Mining", Machine Learning
Summer School - 2006, Taipei. Available:
videolectures.net/mlss06tw_zhang_pmtm