Detection of Malicious Websites Using Symbolic Classifier
Created by W.Langdon from
gp-bibliography.bib Revision:1.7929
- @Article{Andelic:2022:FI,
-
author = "Nikola Andelic and Sandi {Baressi Segota} and
Ivan Lorencin and Matko Glucina",
-
title = "Detection of Malicious Websites Using Symbolic
Classifier",
-
journal = "Future Internet",
-
year = "2022",
-
volume = "14",
-
number = "12",
-
pages = "Article no 358",
-
month = nov,
-
email = "nandelic@riteh.hr",
-
keywords = "genetic algorithms, genetic programming, malicious
websites, oversampling methods, symbolic classifier,
undersampling methods",
-
ISSN = "1999-5903",
-
publisher = "MDPI",
-
URL = "https://www.mdpi.com/1999-5903/14/12/358",
-
DOI = "doi:10.3390/fi14120358",
-
size = "30 pages",
-
abstract = "Malicious websites are web locations that attempt to
install malware, which is the general term for anything
that will cause problems in computer operation, gather
confidential information, or gain total control over
the computer. a novel approach is proposed which
consists of the implementation of the genetic
programming symbolic classifier (GPSC) algorithm on a
publicly available dataset to obtain a simple symbolic
expression (mathematical equation) which could detect
malicious websites with high classification accuracy.
Due to a large imbalance of classes in the initial
dataset, several data sampling methods (random
under-sampling/oversampling, ADASYN, SMOTE,
BorderlineSMOTE, and KmeansSMOTE) were used to balance
the dataset classes. For this investigation, the
hyper-parameter search method was developed to find the
combination of GPSC hyperparameters with which high
classification accuracy could be achieved. The first
investigation was conducted using GPSC with a random
hyperparameter search method and each dataset variation
was divided on a train and test dataset in a ratio of
70:30. To evaluate each symbolic expression, the
performance of each symbolic expression was measured on
the train and test dataset and the mean and standard
deviation values of accuracy (ACC), AUC, precision,
recall and f1-score were obtained. The second
investigation was also conducted using GPSC with the
random hyperparameter search method; however,
70percent, i.e., the train dataset, was used to perform
5-fold cross-validation. If the mean accuracy, AUC,
precision, recall, and f1-score values were above 0.97
then final training and testing (train/test 70:30) were
performed with GPSC with the same randomly chosen
hyperparameters used in a 5-fold cross-validation
process and the final mean and standard deviation
values of the aforementioned evaluation methods were
obtained. In both investigations, the best symbolic
expression was obtained in the case where the dataset
balanced with the KMeansSMOTE method was used for
training and testing. The best symbolic expression
obtained using GPSC with the random hyperparameter
search method and classic train釦est procedure (70:30)
on a dataset balanced with the KMeansSMOTE method
achieved values of",
-
notes = "Faculty of Engineering, University of Rijeka, 51000
Rijeka, Croatia",
- }
Genetic Programming entries for
Nikola Andelic
Sandi Baressi Segota
Ivan Lorencin
Matko Glucina
Citations