abstract = "Genetic programming is an Evolutionary Computing
technique, inspired by biological evolution, capable of
discovering complex non-linear patterns in large
datasets. Genetic programming is a general methodology,
the specific implementation of which requires
development of several different specific elements such
as problem representation, fitness, selection and
genetic variation. Despite the potential advantages of
genetic programming over standard statistical methods,
its applications to survival analysis are at best rare,
primarily because of the difficulty in handling
censored data. The aim of this work was to develop a
genetic programming approach for survival analysis and
demonstrate its utility for the automatic development
of clinical prediction models using cardiovascular
disease as a case study. We developed a tree-based
untyped steady-state genetic programming approach for
censored longitudinal data, comparing its performance
to the de facto statistical method (Cox regression) in
the development of clinical prediction models for the
prediction of future cardiovascular events in patients
with symptomatic and asymptomatic cardiovascular
disease, using large observational datasets. We also
used genetic programming to examine the prognostic
significance of different risk factors together with
their non-linear combinations for the prognosis of
health outcomes in cardiovascular disease. These
experiments showed that Cox regression and the
developed steady-state genetic programming approach
produced similar results when evaluated in common
validation datasets. Despite slight relative
differences, both approaches demonstrated an acceptable
level of discriminative and calibration at a range of
times points. Whilst the application of genetic
programming did not provide more accurate
representations of factors that predict the risk of
both symptomatic and asymptomatic cardiovascular
disease when compared with existing methods, genetic
programming did offer comparable performance. Despite
generally comparable performance, albeit in slight
favour of the Cox model, the predictors selected for
representing their relationships with the outcome were
quite different and, on average, the models developed
using genetic programming used considerably fewer
predictors. The results of the genetic programming
confirm the prognostic significance of a small number
of the most highly associated predictors in the Cox
modelling; age, previous atherosclerosis, and albumin
for secondary prevention; age, recorded diagnosis of
other cardiovascular disease, and ethnicity for primary
prevention in patients with type 2 diabetes. When
considered as a whole, genetic programming did not
produce better performing clinical prediction models,
rather it used fewer predictors, most of which were the
predictors that Cox regression estimated be most
strongly associated with the outcome, whilst achieving
comparable performance. This suggests that genetic
programming may better represent the potentially
non-linear relationship of (a smaller subset of) the
strongest predictors. To our knowledge, this work is
the first study to develop a genetic programming
approach for censored longitudinal data and assess its
value for clinical prediction in comparison with the
well-known and widely applied Cox regression technique.
Using empirical data this work has demonstrated that
clinical prediction models developed by steady-state
genetic programming have predictive ability comparable
to those developed using Cox regression. The genetic
programming models were more complex and thus more
difficult to validate by domain experts, however these
models were developed in an automated fashion, using
fewer input variables, without the need for domain
specific knowledge and expertise required to
appropriately perform survival analysis. This work has
demonstrated the strong potential of genetic
programming as a methodology for automated development
of clinical prediction models for diagnostic and
prognostic purposes in the presence of censored data.
This work compared untuned genetic programming models
that were developed in an automated fashion with highly
tuned Cox regression models that was developed in a
very involved manner that required a certain amount of
clinical and statistical expertise. Whilst the highly
tuned Cox regression models performed slightly better
in validation data, the performance of the
automatically generated genetic programming models were
generally comparable. The comparable performance
demonstrates the utility of genetic programming for
clinical prediction modelling and prognostic research,
where the primary goal is accurate prediction. In
aetiological research, where the primary goal is to
examine the relative strength of association between
risk factors and the outcome, then Cox regression and
its variants remain as the de facto approach.",