abstract = "With the growing amount of data that are collected and
recorded in various application areas the need to use
these data is also growing. In science, data have
always played an important role; in recent years,
however, the economic potential of data has also become
increasingly important. In combination with methods for
data analysis, data can be used to their full
potential, whether in the commercial sector to optimise
offers, or in the industrial sector to optimize
resources and product quality based on process data.
This work describes a new approach for the analysis of
data which is based on symbolic regression with genetic
programming and aims to generate an overall view of the
interactions of various variables of a system. By this
means, all potentially interesting relationships, which
can be detected in a dataset, should be identified and
represented as compact and understandable models. In
the first part of this work, this approach of
comprehensive symbolic regression is described in
detail. Important issues that play a role in the
process are the prevention of bloat and over-fitting,
the simplification of models, and the identification of
relevant input variables. In this context, different
methods for bloat control and prevention are presented
and compared. In particular, the influence of offspring
selection on bloat is analysed. In addition, a new way
to detect over-fitting is presented. On the basis of
this, extensions for the reduction of over-fitting are
presented and compared. Pruning of models is featured
prominently, on the one hand to prevent over-fitting
and on the other hand to simplify complex models. An
important aspect is the analysis of the vast amount of
different models that results from the proposed
approach. In this context, different methods to
quantify relevant factors are proposed. These methods
can be used to identify interactions of variables of
the analysed system. Visualising such interactions
provides a general overview of the system in question
which would not be possible by analysis of individual
models which are concentrated on selected aspects of
the problem. Additionally, the prognosis of
multivariate time series with genetic programming is
described in the first part. The second part of this
work shows how the described approach can be applied to
the analysis of real-world systems, and how the result
of this data analysis process can result in the gain of
new knowledge about the investigated system. The
analyzed data stem from a blast furnace for the
production of steel and an industrial chemical process.
In addition the same approach is also applied on a data
collection storing economic data in order to identify
macro-economic interactions.",