Abstract
We propose a novel application of Genetic Programming (GP): the identification of file types via the analysis of raw binary streams (i.e., without the use of meta data). GP evolves programs with multiple components. One component analyses statistical features extracted from the raw byte-series to divide the data into blocks. These blocks are then analysed via another component to obtain a signature for each file in a training set. These signatures are then projected onto a two-dimensional Euclidean space via two further (evolved) program components. K-means clustering is applied to group similar signatures. Each cluster is then labelled according to the dominant label for its members. Once a program that achieves good classification is evolved it can be used on unseen data without requiring any further evolution. Experimental results show that GP compares very well with established file classification algorithms (i.e., Neural Networks, Bayes Networks and J48 Decision Trees).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Erbacher, R.F., Mulholland, J.: Identification and localization of data types within large-scale file systems. In: SADFE 2007: Proceedings of the Second International Workshop on Systematic Approaches to Digital Forensic Engineering, Washington, DC, USA, pp. 55–70. IEEE Computer Society, Los Alamitos (2007)
Boric, N., Estevez, P.A.: Genetic programming-based clustering using an information theoretic fitness measure. In: Proceedings of the IEEE Congress on Evolutionary Computation CEC 2006, pp. 31–38. IEEE, Los Alamitos (2007)
Hall, G.A., Davis, W.P.: Sliding window measurement for file type identification. Technical report, Computer Forensics and Intrusion Analysis Group, ManTech. Security and Mission Assurance, Rexas (2006)
Haynes, T., Sen, S., Sen, I., Schoenefeld, D., Wainwright, R.: Evolving a team. In: Working Notes of the AAAI 1995 Fall Symposium on Genetic Programming, pp. 23–30. AAAI, Menlo Park (1995)
Karresand, M., Shahmehri, N.: File type identification of data fragments by their binary structure. In: Proceedings of the 2006 IEEE Workshop on Information Assurance, NY, pp. 140–147. IEEE Computer Society, Los Alamitos (2006)
Karresand, M., Shahmehri, N.: Oscar – file type identification of binary data in disk clusters and ram pages. In: Security and Privacy in Dynamic Environments, pp. 413–424. Springer, Boston (2006)
Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press, Cambridge (1992)
Li, W.-J., Stolfo, S.J., Herzog, B.: Fileprints: Identifying file types by n-gram analysis. In: Proceedings of the 2005 IEEE Workshop on Information Assurance, pp. 64–71 (2005)
McDaniel, M., Heydari, M.H.: Content based file type detection algorithms. In: HICSS 2003: Proceedings of the 36th Annual Hawaii International Conference on System Sciences (HICSS 2003) - Track 9, Washington, DC, USA, p. 332.1. IEEE Computer Society, Los Alamitos (2003)
Muni, D.P., Pal, N.R., Das, J.: A novel approach to design classifiers using genetic programming. IEEE Transactions on Evolutionary Computation 8(2), 183–196 (2004)
U. of Waikato. Weka (July 2009), http://www.cs.waikato.ac.nz/ml/weka/
Poli, R., Langdon, W.B., McPhee, N.F.: A Field Guide to Genetic Programming (With contributions by J. R. Koza) (2008), http://lulu.com , http://www.gp-field-guide.org.uk
Sepulveda, F., Meckes, M., Conway, B.: Cluster separation index suggests usefulness of non-motor eeg channels in detecting wrist movement direction intention. In: IEEE Conference on Cybernetics and Intelligent Systems, pp. 943–947. IEEE Press, Los Alamitos (2004)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kattan, A., Galván-López, E., Poli, R., O’Neill, M. (2010). GP-Fileprints: File Types Detection Using Genetic Programming. In: Esparcia-Alcázar, A.I., Ekárt, A., Silva, S., Dignum, S., Uyar, A.Ş. (eds) Genetic Programming. EuroGP 2010. Lecture Notes in Computer Science, vol 6021. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12148-7_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-12148-7_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12147-0
Online ISBN: 978-3-642-12148-7
eBook Packages: Computer ScienceComputer Science (R0)