booktitle = "17th International Symposium on Symbolic and Numeric
Algorithms for Scientific Computing (SYNASC)",
title = "Feature Extraction Using Genetic Programming with
Applications in Malware Detection",
year = "2015",
pages = "224--231",
abstract = "This paper extends the authors' previous research on a
malware detection method, focusing on improving the
accuracy of the perceptron based - One Side Class
Perceptron algorithm via the use of Genetic
Programming. We are concerned with finding a proper
balance between the three basic requirements for
malware detection algorithms: (a) that their training
time on large datasets falls below acceptable upper
limits; (b) that their false positive rate
(clean/legitimate files/software wrongly classified as
malware) is as close as possible to 0 and (c) that
their detection rate is as close as possible to 1. When
the first two requirements are set as objectives for
the design of detection algorithms, it often happens
that the third objective is missed: the detection rate
is low. This study focuses on improving the detection
rate while preserving the small training time and the
low rate of false positives. Another concern is to use
the perceptron-based algorithm's good performance on
linearly separable data, by extracting features from
existing ones. In order to keep the overall training
time low, the huge search space of possible extracted
features is efficiently explored in terms of time and
memory foot-print using Genetic Programming; better
separability is sought for. For experiments we used a
dataset consisting of 350,000 executable files with an
initial set of 300 Boolean features describing each of
them. The feature-extraction algorithm is implemented
in a parallel manner in order to cope with the size of
the data set. We also tested different ways of
controlling the growth in size of the variable-length
chromosomes. The experimental results show that the
features produced by this method are better than the
best ones obtained through mapping allowing for an
increase in detection rate.",