Skip to main content

GP-Fileprints: File Types Detection Using Genetic Programming

  • Conference paper
Genetic Programming (EuroGP 2010)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6021))

Included in the following conference series:

Abstract

We propose a novel application of Genetic Programming (GP): the identification of file types via the analysis of raw binary streams (i.e., without the use of meta data). GP evolves programs with multiple components. One component analyses statistical features extracted from the raw byte-series to divide the data into blocks. These blocks are then analysed via another component to obtain a signature for each file in a training set. These signatures are then projected onto a two-dimensional Euclidean space via two further (evolved) program components. K-means clustering is applied to group similar signatures. Each cluster is then labelled according to the dominant label for its members. Once a program that achieves good classification is evolved it can be used on unseen data without requiring any further evolution. Experimental results show that GP compares very well with established file classification algorithms (i.e., Neural Networks, Bayes Networks and J48 Decision Trees).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Erbacher, R.F., Mulholland, J.: Identification and localization of data types within large-scale file systems. In: SADFE 2007: Proceedings of the Second International Workshop on Systematic Approaches to Digital Forensic Engineering, Washington, DC, USA, pp. 55–70. IEEE Computer Society, Los Alamitos (2007)

    Chapter  Google Scholar 

  2. Boric, N., Estevez, P.A.: Genetic programming-based clustering using an information theoretic fitness measure. In: Proceedings of the IEEE Congress on Evolutionary Computation CEC 2006, pp. 31–38. IEEE, Los Alamitos (2007)

    Google Scholar 

  3. Hall, G.A., Davis, W.P.: Sliding window measurement for file type identification. Technical report, Computer Forensics and Intrusion Analysis Group, ManTech. Security and Mission Assurance, Rexas (2006)

    Google Scholar 

  4. Haynes, T., Sen, S., Sen, I., Schoenefeld, D., Wainwright, R.: Evolving a team. In: Working Notes of the AAAI 1995 Fall Symposium on Genetic Programming, pp. 23–30. AAAI, Menlo Park (1995)

    Google Scholar 

  5. Karresand, M., Shahmehri, N.: File type identification of data fragments by their binary structure. In: Proceedings of the 2006 IEEE Workshop on Information Assurance, NY, pp. 140–147. IEEE Computer Society, Los Alamitos (2006)

    Chapter  Google Scholar 

  6. Karresand, M., Shahmehri, N.: Oscar – file type identification of binary data in disk clusters and ram pages. In: Security and Privacy in Dynamic Environments, pp. 413–424. Springer, Boston (2006)

    Chapter  Google Scholar 

  7. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press, Cambridge (1992)

    MATH  Google Scholar 

  8. Li, W.-J., Stolfo, S.J., Herzog, B.: Fileprints: Identifying file types by n-gram analysis. In: Proceedings of the 2005 IEEE Workshop on Information Assurance, pp. 64–71 (2005)

    Google Scholar 

  9. McDaniel, M., Heydari, M.H.: Content based file type detection algorithms. In: HICSS 2003: Proceedings of the 36th Annual Hawaii International Conference on System Sciences (HICSS 2003) - Track 9, Washington, DC, USA, p. 332.1. IEEE Computer Society, Los Alamitos (2003)

    Google Scholar 

  10. Muni, D.P., Pal, N.R., Das, J.: A novel approach to design classifiers using genetic programming. IEEE Transactions on Evolutionary Computation 8(2), 183–196 (2004)

    Article  Google Scholar 

  11. U. of Waikato. Weka (July 2009), http://www.cs.waikato.ac.nz/ml/weka/

  12. Poli, R., Langdon, W.B., McPhee, N.F.: A Field Guide to Genetic Programming (With contributions by J. R. Koza) (2008), http://lulu.com , http://www.gp-field-guide.org.uk

  13. Sepulveda, F., Meckes, M., Conway, B.: Cluster separation index suggests usefulness of non-motor eeg channels in detecting wrist movement direction intention. In: IEEE Conference on Cybernetics and Intelligent Systems, pp. 943–947. IEEE Press, Los Alamitos (2004)

    Google Scholar 

  14. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2005)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kattan, A., Galván-López, E., Poli, R., O’Neill, M. (2010). GP-Fileprints: File Types Detection Using Genetic Programming. In: Esparcia-Alcázar, A.I., Ekárt, A., Silva, S., Dignum, S., Uyar, A.Ş. (eds) Genetic Programming. EuroGP 2010. Lecture Notes in Computer Science, vol 6021. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12148-7_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12148-7_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12147-0

  • Online ISBN: 978-3-642-12148-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics