abstract = "Recently, the low-cost Microsoft Kinect sensor, which
can capture real-time high-resolution RGB and depth
visual information, has attracted increasing attentions
for a wide range of applications in computer vision.
Existing techniques extract hand-tuned features from
the RGB and the depth data separately and heuristically
fuse them, which would not fully exploit the
complementarity of both data sources. In this paper, we
introduce an adaptive learning methodology to
automatically extract (holistic) spatio-temporal
features, simultaneously fusing the RGB and depth
information, from RGB-D video data for visual
recognition tasks. We address this as an optimisation
problem using our proposed restricted graph-based
genetic programming (RGGP) approach, in which a group
of primitive 3D operators are first randomly assembled
as graph-based combinations and then evolved generation
by generation by evaluating on a set of RGB-D video
samples. Finally the best-performed combination is
selected as the (near-)optimal representation for a
pre-defined task.
The proposed method is systematically evaluated on a
new hand gesture dataset, SKIG, that we collected
ourselves and the public MSR Daily Activity 3D dataset,
respectively. Extensive experimental results show that
our approach leads to significant advantages compared
with state-of-the-art hand-crafted and machine-learnt
features.",