abstract = "Substructural analysis (SSA) was one of the very first
machine learning techniques to be applied to
chemoinformatics in the area of virtual screening. For
this method, given a set of compounds typically defined
by their fragment occurrence data (such as 2D
fingerprints). The SSA computes weights for each of the
fragments which outlines its contribution to the
activity (or inactivity) of compounds containing that
fragment. The overall probability of activity for a
compound is then computed by summing up or combining
the weights for the fragments present in the compound.
A variety of weighting schemes based on specific
relationship-bound equations are available for this
purpose. This thesis identifies uplift to the
effectiveness of SSA, using two evolutionary
computation methods based on genetic traits,
particularly the genetic algorithm (GA) and genetic
programming (GP). Building on previous studies, it was
possible to analyse and compare ten published SSA
weighting schemes based on a simulated virtual
screening experiment. The analysis showed the most
effective weighting scheme to be the R4 equation which
was a part of document-based weighting schemes. A
second experiment was carried out to investigate the
application of GA-based weighting scheme for the SSA in
comparison to an experiment using the R4 weighting
scheme. The GA algorithm is simple in concept focusing
purely on suitable weight generation and effective in
operation. The findings show that the GA-based SSA is
superior to the R4-based SSA, both in terms of active
compound retrieval rate and predictive performance. A
third experiment investigated the genetic application
via a GP-based SSA. Rigorous experiment results showed
that the GP was found to be superior to the existing
SSA weighting schemes. In general, however, the
GP-based SSA was found to be less effective than the
GA-based SSA. A final experimented is described in this
thesis which sought to explore the feasibility of data
fusion on both the GA and GP. It is a method producing
a final ranking list from multiple sets of ranking
lists, based on several fusion rules. The results
indicate that data fusion is a good method to boost
GA-and GP-based SSA searching. The RKP rule was
considered the most effective fusion rule.",
notes = "I would also like to thank my main sponsors: The
Ministry of Higher Education (MOHE), Malaysia and also
my employer,University Kebangsaan Malaysia (UKM)for the
opportunity given in pursuant of this PhD study.
ISNI: 0000 0004 5991 7219 Supervisors: Peter Willett
and John Holliday",