Identification of Most Frequently Occurring Lexis in Winnings-announcing Unsolicited Bulke-mails
e-mail has become an important means of electronic
communication but the viability of its usage is marred by Unsolicited
Bulk e-mail (UBE) messages. UBE consists of many types
like pornographic, virus infected and 'cry-for-help' messages as well
as fake and fraudulent offers for jobs, winnings and medicines. UBE
poses technical and socio-economic challenges to usage of e-mails.
To meet this challenge and combat this menace, we need to
understand UBE. Towards this end, the current paper presents a
content-based textual analysis of nearly 3000 winnings-announcing
UBE. Technically, this is an application of Text Parsing and
Tokenization for an un-structured textual document and we approach
it using Bag Of Words (BOW) and Vector Space Document Model
techniques. We have attempted to identify the most frequently
occurring lexis in the winnings-announcing UBE documents. The
analysis of such top 100 lexis is also presented. We exhibit the
relationship between occurrence of a word from the identified lexisset
in the given UBE and the probability that the given UBE will be
the one announcing fake winnings. To the best of our knowledge and
survey of related literature, this is the first formal attempt for
identification of most frequently occurring lexis in winningsannouncing
UBE by its textual analysis. Finally, this is a sincere
attempt to bring about alertness against and mitigate the threat of
such luring but fake UBE.
[1] Anonymous, "Categorizing junk eMail", Available:
http://www.knujon.com/categories.html, 2008
[2] Berry R. "The 100 Most Annoying Things of 2003", Available:
http://www.retrocrush.buzznet.com/archive2004/annoying2003/,
January 18, 2004
[3] Castillo C., Donato D., Becchetti L., Boldi P., Leonardi S., Santini M.,
Vigna S. "A Reference Collection for Web Spam", ACM SIGIR Forum,
vol. 40 (2), pp. 11-24, December 2006, ISSN: 0163-5840
[4] Commtouch Software Ltd. "Spam Trends For First Half of 2004",
Commtouch Report, Available:
http://www.commtouch.com/Site/News_Events/pr_content.asp?news_id
=45&cat_id=1, Press Release, 30 June, 2004
[5] Crucial Web Hosting Ltd., "How Consumers Define Spam", Available:
http://www.crucialwebost.com/blog/how-consumers-define-spam/,
March 06, 2007
[6] CUED, "Junk e-mail", Cambridge University Engineering Department,
Available: http://www.cam.ac.uk/cs/email/junk, 2008
[7] Fette I., Sadeh N. and Tomasic A. "Learning to Detect Phishing Emails",
Institute for Software Research International School of Computer
Science (ISRI), Carnegie Mellon University (CMU), CMU-ISRI-06-112,
June 2006
[8] Frederic E. "Text Mining Applied to Spam Detection", Presentation
given at University of Geneva on January 24, 2007, Available:
http://cui.unige.ch/~ehrler/presentation/ Spam%20Filtering.pdf
[9] Gajewski W. P. "Adaptive Naïve Bayesian Anti-spam Engine",
Proceedings of World Academy of Science, Engineering and Technology
(PWASET 2005), Pages 45-50 vol. 7 August 2005 ISSN 1307-6884
[10] Gyongyi Z. and Garcia-Molina H. "Web Spam Taxonomy", First
International Workshop on Adversarial Information Retrieval on the
Web (AIRWeb, 2005), Chiba, Japan, April 2005
[11] Indiana University. "What is spam?", University Information
Technology Services, Knowledge Base, Indiana University,
Pennsylvania, November 11, 2008. Available:
http://kb.iu.edu/data/afne.html
[12] Infinite Monkeys & Co., "Spam Defined", Available:
http://www.monkeys.com/spam-defined/definition.shtml, 2008
[13] Kiritchenko S. and Matwin S. "Email Classification with Co-Training",
Proceedings of the 2001 Conference of the Centre for Advanced Studies
on Collaborative Research, Toronto, Canada, Page 8, 2001
[14] Lambert A. "Analysis of Spam", Dissertation for Degree of Master of
Science in Computer Science, Department of Computer Science,
University of Dublin, Trinity College September 2003
[15] Lance J. "Phishing Exposed", Syngress Inc., Chapter 1 Page 2 ISBN:
159749030X; 2005
[16] Martin S., Sewani A., Nelson B., Chen K. and Joseph A. D. "Analyzing
Behaviorial Features for Email Classification", Proceedings of the
Second Conference on Email and Anti-Spam (CEAS, 2005), Stanford
University, California, U.S.A. July 21-22, 2005
[17] Roth W. "Spam? Its All Relative", Available:
http://www.imediaconnection.com/content/7581.asp, published online
on December 19, 2005
[18] ScamBusters Editors "Email Scam Analysis". Available:
http://www.scamdex.com/MHON/E/msg08805.php, Scamdex,
Scambusters Online - Issue No. 292
[19] Sebastiani F. "Machine Learning in Automated Text Categorization", in
ACM Computing Surveys, vol. 32 (1), pp. 1-47, March 2002. ISSN
0360-0300
[20] Sen P. "Types of Spam", Interactive Advertising, Fall 2004, Available:
http://ciadvertising.org/sa/fall_04/adv391k/paroma/spam/types_of_spam
.htm
[21] Threat Research and Content Engineering (TRACE) "Spam Type
Descriptions". Available:
http://www.marshal.com/TRACE/Spam_Types.asp, TRACE Blog, 2008
[22] Wikimedia Foundation Inc. "E-mail", Available:
http://en.wikipedia.org/wiki/Email, 2010
[23] Youn, S. and McLeod D. "Spam Email Classification Using an Adaptive
Ontology", Institute of Electrical and Electronics Engineers (IEEE)
Journal of Software, April 2007
[24] Zhang T. "Predictive Methods for Text Mining", Machine
Learning Summer School - 2006, Taipei. Available:
videolectures.net/mlss06tw_zhang_pmtm
[1] Anonymous, "Categorizing junk eMail", Available:
http://www.knujon.com/categories.html, 2008
[2] Berry R. "The 100 Most Annoying Things of 2003", Available:
http://www.retrocrush.buzznet.com/archive2004/annoying2003/,
January 18, 2004
[3] Castillo C., Donato D., Becchetti L., Boldi P., Leonardi S., Santini M.,
Vigna S. "A Reference Collection for Web Spam", ACM SIGIR Forum,
vol. 40 (2), pp. 11-24, December 2006, ISSN: 0163-5840
[4] Commtouch Software Ltd. "Spam Trends For First Half of 2004",
Commtouch Report, Available:
http://www.commtouch.com/Site/News_Events/pr_content.asp?news_id
=45&cat_id=1, Press Release, 30 June, 2004
[5] Crucial Web Hosting Ltd., "How Consumers Define Spam", Available:
http://www.crucialwebost.com/blog/how-consumers-define-spam/,
March 06, 2007
[6] CUED, "Junk e-mail", Cambridge University Engineering Department,
Available: http://www.cam.ac.uk/cs/email/junk, 2008
[7] Fette I., Sadeh N. and Tomasic A. "Learning to Detect Phishing Emails",
Institute for Software Research International School of Computer
Science (ISRI), Carnegie Mellon University (CMU), CMU-ISRI-06-112,
June 2006
[8] Frederic E. "Text Mining Applied to Spam Detection", Presentation
given at University of Geneva on January 24, 2007, Available:
http://cui.unige.ch/~ehrler/presentation/ Spam%20Filtering.pdf
[9] Gajewski W. P. "Adaptive Naïve Bayesian Anti-spam Engine",
Proceedings of World Academy of Science, Engineering and Technology
(PWASET 2005), Pages 45-50 vol. 7 August 2005 ISSN 1307-6884
[10] Gyongyi Z. and Garcia-Molina H. "Web Spam Taxonomy", First
International Workshop on Adversarial Information Retrieval on the
Web (AIRWeb, 2005), Chiba, Japan, April 2005
[11] Indiana University. "What is spam?", University Information
Technology Services, Knowledge Base, Indiana University,
Pennsylvania, November 11, 2008. Available:
http://kb.iu.edu/data/afne.html
[12] Infinite Monkeys & Co., "Spam Defined", Available:
http://www.monkeys.com/spam-defined/definition.shtml, 2008
[13] Kiritchenko S. and Matwin S. "Email Classification with Co-Training",
Proceedings of the 2001 Conference of the Centre for Advanced Studies
on Collaborative Research, Toronto, Canada, Page 8, 2001
[14] Lambert A. "Analysis of Spam", Dissertation for Degree of Master of
Science in Computer Science, Department of Computer Science,
University of Dublin, Trinity College September 2003
[15] Lance J. "Phishing Exposed", Syngress Inc., Chapter 1 Page 2 ISBN:
159749030X; 2005
[16] Martin S., Sewani A., Nelson B., Chen K. and Joseph A. D. "Analyzing
Behaviorial Features for Email Classification", Proceedings of the
Second Conference on Email and Anti-Spam (CEAS, 2005), Stanford
University, California, U.S.A. July 21-22, 2005
[17] Roth W. "Spam? Its All Relative", Available:
http://www.imediaconnection.com/content/7581.asp, published online
on December 19, 2005
[18] ScamBusters Editors "Email Scam Analysis". Available:
http://www.scamdex.com/MHON/E/msg08805.php, Scamdex,
Scambusters Online - Issue No. 292
[19] Sebastiani F. "Machine Learning in Automated Text Categorization", in
ACM Computing Surveys, vol. 32 (1), pp. 1-47, March 2002. ISSN
0360-0300
[20] Sen P. "Types of Spam", Interactive Advertising, Fall 2004, Available:
http://ciadvertising.org/sa/fall_04/adv391k/paroma/spam/types_of_spam
.htm
[21] Threat Research and Content Engineering (TRACE) "Spam Type
Descriptions". Available:
http://www.marshal.com/TRACE/Spam_Types.asp, TRACE Blog, 2008
[22] Wikimedia Foundation Inc. "E-mail", Available:
http://en.wikipedia.org/wiki/Email, 2010
[23] Youn, S. and McLeod D. "Spam Email Classification Using an Adaptive
Ontology", Institute of Electrical and Electronics Engineers (IEEE)
Journal of Software, April 2007
[24] Zhang T. "Predictive Methods for Text Mining", Machine
Learning Summer School - 2006, Taipei. Available:
videolectures.net/mlss06tw_zhang_pmtm
@article{"International Journal of Information, Control and Computer Sciences:51548", author = "Jatinderkumar R. Saini and Apurva A. Desai", title = "Identification of Most Frequently Occurring Lexis in Winnings-announcing Unsolicited Bulke-mails", abstract = "e-mail has become an important means of electronic
communication but the viability of its usage is marred by Unsolicited
Bulk e-mail (UBE) messages. UBE consists of many types
like pornographic, virus infected and 'cry-for-help' messages as well
as fake and fraudulent offers for jobs, winnings and medicines. UBE
poses technical and socio-economic challenges to usage of e-mails.
To meet this challenge and combat this menace, we need to
understand UBE. Towards this end, the current paper presents a
content-based textual analysis of nearly 3000 winnings-announcing
UBE. Technically, this is an application of Text Parsing and
Tokenization for an un-structured textual document and we approach
it using Bag Of Words (BOW) and Vector Space Document Model
techniques. We have attempted to identify the most frequently
occurring lexis in the winnings-announcing UBE documents. The
analysis of such top 100 lexis is also presented. We exhibit the
relationship between occurrence of a word from the identified lexisset
in the given UBE and the probability that the given UBE will be
the one announcing fake winnings. To the best of our knowledge and
survey of related literature, this is the first formal attempt for
identification of most frequently occurring lexis in winningsannouncing
UBE by its textual analysis. Finally, this is a sincere
attempt to bring about alertness against and mitigate the threat of
such luring but fake UBE.", keywords = "Lexis, Unsolicited Bulk e-mail (UBE), Vector SpaceDocument Model, Winnings, Lottery", volume = "5", number = "3", pages = "242-5", }