Spam Detection Using Linear Genetic Programming

Meli, Clyde; Nezval, Vitezslav; Kominkova Oplatkova, Zuzana; Buttigieg, Victor

doi:10.1007/978-3-319-97888-8_7

Clyde Meli ORCID: orcid.org/0000-0003-3551-862X¹⁵,
Vitezslav Nezval¹⁵,
Zuzana Kominkova Oplatkova¹⁶ &
…
Victor Buttigieg¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 837))

Included in the following conference series:

23rd International Conference on Soft Computing

323 Accesses

Abstract

Spam refers to unsolicited bulk email. Many algorithms have been applied to the spam detection problem and many programs have been developed. The problem is an adversarial one and an ongoing fight against spammers. We prove that reliable Spam detection is an NP-complete problem, by mapping email spams to metamorphic viruses and applying Spinellis’s [30] proof of NP-completeness of metamorphic viruses. Using a number of features extracted from the SpamAssassin Data set, a linear genetic programming (LGP) system called Gagenes LGP (or GLGP) has been implemented. The system has been shown to give 99.83% accuracy, higher than Awad et al.’s [3] result with the Naïve Bayes algorithm. GLGP’s recall and precision are higher than Awad et al.’s, and GLGP’s Accuracy is also higher than the reported results by Lai and Tsai [19].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Available at http://csmining.org/index.php/spam-assassin-datasets.html
2.
As updated by http://csmining.org.
3.
All these features are found and explained further in my Ph.D. thesis [23]
4.
URL features (Table 4) are practically numbered as also being part of message body features

References

Almeida, T.A., Yamakami, A.: Advances in spam filtering techniques. In: Computational Intelligence for Privacy and Security, pp. 199–214. Springer, Heidelberg (2012)
Google Scholar
Androutsopoulos, I., et al.: An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 160–167. ACM, New York (2000)
Google Scholar
Awad, W.A., ELseuofi, S.M.: Machine learning methods for e-mail classification. Int. J. Comput. Appl. 16(1), 0975–8887 (2011)
Google Scholar
Blickle, T., Thiele, L.: A Comparison of selection schemes used in genetic algorithms. Gloriastrasse 35, CH-8092 Zurich: Swiss Federal Institute of Technology (ETH) Zurich, Computer Engineering and Communications Networks Lab (TIK (1995)
Google Scholar
Borodin, Y., et al.: Live and learn from mistakes: a lightweight system for document classification. Inf. Process. Manag. 49(1), 83–98 (2013)
Article MathSciNet Google Scholar
Brameier, M.: On linear genetic programming. Fachbereich Informatik, Universität Dortmund (2004)
Google Scholar
Cid, I., et al.: The impact of noise in spam filtering: a case study. In: Perner, P. (ed.) Advances in Data Mining. Medical Applications, E-Commerce, Marketing, and Theoretical Aspects, pp. 228–241. Springer, Heidelberg (2008)
Google Scholar
Cormack, G.V., Lynam, T.: TREC 2005 spam track overview. In: The Fourteenth Text REtrieval Conference (TREC 2005) Proceedings (2005)
Google Scholar
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, San Francisco (1979)
MATH Google Scholar
Graham, P.: Better Bayesian Filtering. http://www.paulgraham.com/better.html
Graham, P.: A Plan for Spam. http://www.paulgraham.com/spam.html
Gržinić, T., et al.: CROFlux—Passive DNS method for detecting fast-flux domains. In: 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 1376–1380 (2014)
Google Scholar
Harris, E.: The Next Step in the Spam Control War: Greylisting. http://projects.puremagic.com/greylisting/whitepaper.html
Holz, T., et al.: Measuring and detecting fast-flux service networks. In: 15th Network and Distributed System Security Symposium (NDSS) (2008)
Google Scholar
Hunt, R., Carpinter, J.: Current and new developments in spam filtering. In: 2006 14th IEEE International Conference on Networks, pp. 1–6 (2006)
Google Scholar
Gonçalves, I.: Controlling Overfitting in Genetic Programming. CISUG (2011)
Google Scholar
Juknius, J., Čenys, A.: Intelligent botnet attacks in modern Information warfare. In: 15th International Conference on Information and Software Technology, pp. 37–39 (2009)
Google Scholar
Kolari, P., et al.: Detecting spam blogs: a machine learning approach. In: Proceedings of the National Conference on Artificial Intelligence, p. 1351. AAAI Press/MIT Press, Menlo Park/Cambridge 1999 (2006)
Google Scholar
Lai, C.-C., Tsai, M.-C.: An empirical performance comparison of machine learning methods for spam e-mail categorization. In: Fourth International Conference on Hybrid Intelligent Systems, HIS 2004, pp. 44–48 IEEE (2004)
Google Scholar
Lee, K., et al.: Uncovering social spammers: social honeypots + machine learning. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 435–442 ACM, New York (2010)
Google Scholar
Sahami, M., et al.: A Bayesian approach to filtering junk e-mail. In: Proceedings of AAAI-98 Workshop on Learning for Text Categorization (1998)
Google Scholar
Meli, C., Oplatkova, Z.K.: SPAM detection: Naïve Bayesian classification and RPN expression-based LGP approaches compared. In: Software Engineering Perspectives and Application in Intelligent Systems, pp. 399–411. Springer, Heidelberg (2016)
Google Scholar
Meli, C.: Application and improvement of genetic algorithms and genetic programming towards the fight against spam and other internet malware. Submitted Ph.D. thesis, University of Malta, Malta (2017)
Google Scholar
Miranda-García, A., Calle-Martín, J.: Yule’s characteristic K revisited. Lang. Resour. Eval. 39(4), 287–294 (2005)
Article Google Scholar
Ntoulas, A., et al.: Detecting spam web pages through content analysis. In: Proceedings of the 15th International Conference on World Wide Web, pp. 83–92. ACM, New York (2006)
Google Scholar
Oltean, M., Grosan, C.: Evolving evolutionary algorithms using multi expression programming. In: ECAL, pp. 651–658 (2003)
Google Scholar
Oltean, M., Dumitrescu, D.: Multi expression programming. Babes-Bolyai University (2002)
Google Scholar
Rao, J.M., Reiley, D.H.: The economics of spam. J. Econ. Perspect. 26(3), 87–110 (2012)
Article Google Scholar
Ruan, G., Tan, Y.: A three-layer back-propagation neural network for spam detection using artificial immune concentration. Soft. Comput. 14(2), 139–150 (2009)
Article Google Scholar
Spinellis, D.: Reliable identification of bounded-length viruses is NP-complete. IEEE Trans. Inf. Theory 49(1), 280–284 (2003)
Article MathSciNet Google Scholar
Stuart, I., et al.: A neural network classifier for junk e-mail. In: Document Analysis Systems VI, pp. 442–450. Springer, Heidelberg (2004)
Google Scholar
Wang, Z.-Q., et al.: An efficient SVM-based spam filtering algorithm. In: 2006 International Conference on Machine Learning and Cybernetics, pp. 3682–3686. IEEE (2006)
Google Scholar
Yule, G.U.: On sentence- length as a statistical characteristic of style in prose: with application to two cases of disputed authorship. Biometrika 30(3–4), 363–390 (1939)
Google Scholar
Zhang, L., et al.: An evaluation of statistical spam filtering. Techniques 3(4), 243–269 (2004)
Google Scholar
Zhang, M., Fogelberg, C.G.: Genetic programming for image recognition: an LGP approach. In: EvoWorkshops 2007, pp. 340–350. Springer, Heidelberg (2007)
Google Scholar
RPN, An Introduction To Reverse Polish Notation. http://h41111.www4.hp.com/calculators/uk/en/articles/rpn.html
Symantec Internet Security Report (2016). https://resource.elq.symantec.com/LP=2899

Download references

Acknowledgements

This work was supported by the Ministry of Education, Youth and Sports of the Czech Republic within the National Sustainability Programme project No. LO1303 (MSMT-7778/2014) and also by the European Regional Development Fund under the project CEBIA-Tech No. CZ.1.05/2.1.00/03.0089 and further it was supported by Grant Agency of the Czech Republic—GACR P103/15/06700S.

This research has in part been carried out using computational facilities procured through the European Regional Development Fund, Project ERDF-076 ‘Refurbishing the Signal Processing Laboratory within the Department of CCE’, University of Malta.

Author information

Authors and Affiliations

Department of Computer Information Systems, University of Malta, Msida, Malta
Clyde Meli & Vitezslav Nezval
Department of Informatics and Artificial Intelligence, Tomas Bata University, Zlín, Czech Republic
Zuzana Kominkova Oplatkova
Department of Communications and Computer Engineering, University of Malta, Msida, Malta
Victor Buttigieg

Authors

Clyde Meli
View author publications
You can also search for this author in PubMed Google Scholar
Vitezslav Nezval
View author publications
You can also search for this author in PubMed Google Scholar
Zuzana Kominkova Oplatkova
View author publications
You can also search for this author in PubMed Google Scholar
Victor Buttigieg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Clyde Meli .

Editor information

Editors and Affiliations

Department of Applied Computer Science, Institute of Automation and Computer Science, Faculty of Mechanical Engineering, Brno University of Technology, Brno, Czech Republic
Radek Matoušek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Meli, C., Nezval, V., Kominkova Oplatkova, Z., Buttigieg, V. (2019). Spam Detection Using Linear Genetic Programming. In: Matoušek, R. (eds) Recent Advances in Soft Computing . MENDEL 2017. Advances in Intelligent Systems and Computing, vol 837. Springer, Cham. https://doi.org/10.1007/978-3-319-97888-8_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-97888-8_7
Published: 05 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97887-1
Online ISBN: 978-3-319-97888-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics