Skip to main content

SPAM Detection: Naïve Bayesian Classification and RPN Expression-Based LGP Approaches Compared

  • Conference paper
  • First Online:
Software Engineering Perspectives and Application in Intelligent Systems ( ICTIS 2017, CSOC 2016)

Abstract

An investigation is performed of a machine learning algorithm and the Bayesian classifier in the spam-filtering context. The paper shows the advantage of the use of Reverse Polish Notation (RPN) expressions with feature extraction compared to the traditional Naïve Bayesian classifier used for spam detection assuming the same features. The performance of the two is investigated using a public corpus and a recent private spam collection, concluding that the system based on RPN LGP (Linear Genetic Programming) gave better results compared to two popularly used open source Bayesian spam filters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  1. Cohen, W.: Learning rules that classify e-mail. In: Papers from the AAAI Spring Symposium on Machine Learning in Information Access, pp. 18–25. AAAI Press

    Google Scholar 

  2. Clack, C., Farringdon, J., Lidwell, P., Yu, T.: Autonomous document classification for business. In: Proceedings of the first international conference on Autonomous Agents, pp. 201–208. ACM, New York, NY, USA (1997)

    Google Scholar 

  3. Brameier, M.: On linear genetic programming (2004). https://eldorado.tu-dortmund.de/handle/2003/20098

  4. Brameier, M.F., Banzhaf, W.: Linear Genetic Programming. Springer (2006)

    Google Scholar 

  5. M. Brameier, W. Banzhaf, A comparison of linear genetic programming and neural networks in medical data mining. IEEE Trans. Evol. Comput. 5, 17–26 (2001)

    Article  MATH  Google Scholar 

  6. Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Paliouras, G., Spyropoulos, C.D.: An evaluation of Naive Bayesian anti-spam filtering (2000). arXiv:cs/0006013

  7. Duda, R.O., Hart, P.E., Nilsson, N.J.: Subjective bayesian methods for rule-based inference systems. In: Proceedings of the June 7–10, 1976, National Computer Conference and Exposition, pp. 1075–1082. ACM, New York, NY, USA (1976)

    Google Scholar 

  8. Mitchell, T.M.: Machine Learning. McGraw-Hill Science/Engineering/Math (1997)

    Google Scholar 

  9. Zdziarski, J.: Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification. No Starch Press (2005)

    Google Scholar 

  10. Reports| Press Panda Security. http://press.pandasecurity.com/press-room/reports/

  11. Cranor, L.F., LaMacchia, B.A.: Spam! Commun. ACM 41, 74–83 (1998)

    Google Scholar 

  12. Graham, Paul: A Plan for Spam. http://www.paulgraham.com/spam.html

  13. Graham, P.: Better Bayesian Filtering. http://www.paulgraham.com/better.html

  14. Pantel, P., Lin, D.: SpamCop: A spam classification & organization program. In: Learning for Text Categorization: Papers from the 1998 Workshop, pp. 95–98 (1998)

    Google Scholar 

  15. Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A bayesian approach to filtering junk e-mail. In: Proceedings of AAAI-98 Workshop Learn. Text Categ. (1998)

    Google Scholar 

  16. SpamAssassin Homepage. http://spamassassin.apache.org/

  17. Bayler, G.: Penetrating Bayesian Spam Filters: Exploiting Redundancy in Natural Language to Disguise Spam Emails. Vdm Verlag Dr. Müller (2008)

    Google Scholar 

  18. Shmueli, G., Patel, N.R., Bruce, P.C.: Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. Wiley (2011)

    Google Scholar 

  19. C. Sangeetha, P. Amudha, S. Sivakumari, Feature extraction approach for spam filtering. Int. J. Adv. Res. Technol. 2, 89–93 (2012)

    Google Scholar 

  20. Goweder, A.M., Rashed, T.E., Ali, S., Alhammi, H.A.: An Anti-spam system using artificial neural networks and genetic algorithms. Proc. 2008 Int. Arab Conf. Inf. Technol. 1–8 (2008)

    Google Scholar 

  21. A. Khorsi, An overview of content-based spam filtering techniques. Inform. Slov. 31, 269–277 (2007)

    MATH  Google Scholar 

  22. Katirai, H.: Filtering Junk E-Mail: A Performance Comparison Between Genetic Programming and Naive Bayes (1999). http://citeseer.ist.psu.edu/310632.html

  23. L. Hirsch, M. Saeedi, R. Hirsch, Evolving rules for document classification, in Genetic Programming, ed. by M. Keijzer, A. Tettamanzi, P. Collet, J. van Hemert, M. Tomassini (Springer, Berlin, 2005), pp. 85–95

    Chapter  Google Scholar 

  24. Shengen, L., Xiaofei, N., Peiqi, L., Lin, W.: Generating new features using genetic programming to detect link spam. In: Proceedings of the 2011 Fourth International Conference on Intelligent Computation Technology and Automation, vol. 01. pp. 135–138. IEEE Computer Society, Washington, DC, USA (2011)

    Google Scholar 

  25. Payne, T., Payne, T.: Learning Email Filtering Rules with Magi A Mail Agent Interface. Presented at the Department of Computing Science, University of Aberdeen (1994)

    Google Scholar 

  26. Davenport, G.F., Ryan, M.D., Rayward-Smith, V.J.: Rule induction using a reverse polish representation. In: GECCO, pp. 990–995 (1999)

    Google Scholar 

  27. Lichman, M.: UCI Machine Learning Repository, Irvine, CA, University of California, School of Information and Computer Science (2013). http://archive.ics.uci.edu/ml

  28. J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988)

    MATH  Google Scholar 

  29. Koza J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. A Bradford Book (1992)

    Google Scholar 

  30. Koza J.R.: Genetic evolution and co-evolution of computer programs. In: Artificial Life II, pp. 603–629. Addison-Wesley Publishing Company (1990)

    Google Scholar 

  31. Koza J.R., K.M.A.: Genetic Programming IV. Kluwer Academic Publishers (2003)

    Google Scholar 

  32. Downey, C.: Explorations in Parallel Linear Genetic Programming: A Thesis Submitted to the Victoria University of Wellington in Fulfilment of the Requirements for the Degree of Master of Science in Computer Science. Victoria University of Wellington (2011)

    Google Scholar 

  33. Downey, C., Zhang, M.: Parallel linear genetic programming. In: Proceedings of the 14th European conference on Genetic programming, pp. 178–189. Springer, Berlin (2011)

    Google Scholar 

  34. Abraham, A., Ramos, V.: Web usage mining using artificial ant colony clustering and linear genetic programming. In: The 2003 Congress on Evolutionary Computation, 2003. CEC’03, vol. 2, pp. 1384–1391 (2003)

    Google Scholar 

  35. A.H. Gandomi, A.H. Alavi, M.G. Sahab, New formulation for compressive strength of CFRP confined concrete cylinders using linear genetic programming. Mater. Struct. 43, 963–983 (2009)

    Article  Google Scholar 

  36. A. Guven, Linear genetic programming for time-series modelling of daily flow rate. J. Earth Syst. Sci. 118, 137–146 (2009)

    Article  Google Scholar 

  37. Song, D., Heywood, M.I., Zincir-Heywood, A.N.: A linear genetic programming approach to intrusion detection. In: Genetic and Evolutionary Computation—GECCO 2003, pp. 2325–2336. Springer, Berlin (2003)

    Google Scholar 

  38. S. Mukkamala, A.H. Sung, A. Abraham, Modeling Intrusion Detection Systems Using Linear Genetic Programming Approach, in Innovations in Applied Artificial Intelligence, ed. by B. Orchard, C. Yang, M. Ali (Springer, Berlin, 2004), pp. 633–642

    Chapter  Google Scholar 

  39. I. Kononenko, Semi-naive bayesian classifier, in Machine Learning—EWSL-91, ed. by Y. Kodratoff (Springer, Berlin, 1991), pp. 206–219

    Chapter  Google Scholar 

  40. C.L. Hamblin, Translation to and from polish notation. Comput. J. 5, 210–213 (1962)

    Article  MATH  Google Scholar 

  41. RPN.: An Introduction To Reverse Polish Notation. http://h41111.www4.hp.com/calculators/uk/en/articles/rpn.html

  42. A.W. Burks, Don W. Warren, J.B. Wright, An analysis of a logical machine using parenthesis-free notation. Math. Tables Aids Comput. 8, 53–57 (1954)

    Article  MathSciNet  MATH  Google Scholar 

  43. galculator—a GTK 2/GTK 3 algebraic and RPN calculator. http://galculator.sourceforge.net/

  44. Bennett, P.N.: Assessing the Calibration of Naive Bayes’ Posterior Estimates. School of Computer Science, Carnegie Mellon University (2000)

    Google Scholar 

  45. Monti, S., Cooper, G.F.: A Bayesian Network Classifier that Combines a Finite Mixture Model and a Naive Bayes Model (2013). arXiv:1301.6723

  46. Safe Browsing Tool| WOT (Web of Trust). http://www.mywot.com/

  47. Safe Browsing API—Google Developers. https://developers.google.com/safe-browsing/

  48. Damodaram, R., Valarmathi, D.M.L.: RBL Global Toolbar with Clustering Algorithm for Fake Website Detection

    Google Scholar 

  49. P.E. Bennett, The statistical measurement of a stylistic trait in julius caesar and as you like it. Shakespeare Q. 8, 33–50 (1957)

    Article  Google Scholar 

  50. E. Stamatatos, N. Fakotakis, G. Kokkinakis, Computer-based authorship attribution without lexical measures. Comput. Humanit. 35, 193–214 (2001)

    Article  Google Scholar 

  51. V.A. Yatsko, Automatic text classification method based on Zipf’s law. Autom. Doc. Math. Linguist. 49, 83–88 (2015)

    Article  Google Scholar 

  52. M. Basavaraju, D.R. Prabhakar, A novel method of spam mail detection using text based clustering approach. Int. J. Comput. Appl. 5, 15–25 (2010)

    Google Scholar 

  53. M. Matsumoto, T. Nishimura, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. ACM Trans. Model. Comput. Simul. 8, 3–30 (1998)

    Article  MATH  Google Scholar 

  54. Pdnsd: pdnsd homepage. http://members.home.nl/p.a.rombouts/pdnsd/

  55. Jong, K.A.D., Spears, W.M.: An analysis of the interacting roles of population size and crossover in genetic algorithms. In: Proceedings of the 1st Workshop on Parallel Problem Solving from Nature, pp. 38–47. Springer, London, UK (1991)

    Google Scholar 

  56. M. Zhang, V. Ciesielski, Genetic programming for multiple class object detection, in Advanced Topics in Artificial Intelligence, ed. by N. Foo (Springer, Berlin, 1999), pp. 180–192

    Chapter  Google Scholar 

  57. Piszcz, A., Soule, T.: Genetic programming: analysis of optimal mutation rates in a problem with varying difficulty. In: FLAIRS Conference, pp. 451–456 (2006)

    Google Scholar 

  58. G.V. Cormack, T.R. Lynam, Online supervised spam filter evaluation. ACM Trans. Inf. Syst. 25, 11 (2007)

    Article  Google Scholar 

  59. Graham-Cumming, John: Understanding Spam Filter Accuracy (Newsletter). http://www.jgc.org/antispam/11162004-baafcd719ec31936296c1fb3d74d2cbd.pdf

  60. Mark, C., O’Brien, J.: An Analysis of Spam Filters. Computer Science Department, WPI (2003)

    Google Scholar 

Download references

Acknowledgement

Acknowledgements go to my Ph.D. Supervisors Dr Vitezlav Nezval. Thanks also to Tom Fawcett who answered my email query about the subject of Bayesian classifiers and RPN. This work was supported by Grant Agency of the Czech Republic—GACR P103/15/06700S, further by financial support of research project NPU I No. MSMT-7778/2014 by the Ministry of Education of the Czech Republic and also by the European Regional Development Fund under the Project CEBIA-Tech No. CZ.1.05/2.1.00/03.0089.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Clyde Meli .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Meli, C., Oplatkova, Z.K. (2016). SPAM Detection: Naïve Bayesian Classification and RPN Expression-Based LGP Approaches Compared. In: Silhavy, R., Senkerik, R., Oplatkova, Z.K., Silhavy, P., Prokopova, Z. (eds) Software Engineering Perspectives and Application in Intelligent Systems. ICTIS CSOC 2017 2016. Advances in Intelligent Systems and Computing, vol 465. Springer, Cham. https://doi.org/10.1007/978-3-319-33622-0_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-33622-0_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-33620-6

  • Online ISBN: 978-3-319-33622-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics