Relational clustering for knowledge discovery in life sciences
Created by W.Langdon from
gp-bibliography.bib Revision:1.8051
- @PhdThesis{Giordani:thesis,
-
author = "Ilaria Giordani",
-
title = "Relational clustering for knowledge discovery in life
sciences",
-
school = "Universita degli Studi di Milano-Bicocca",
-
year = "2009",
-
address = "Italy",
-
month = oct,
-
keywords = "genetic algorithms, genetic programming, Relational
Clustering, Feature Selection, Knowledge integration,
Mixed data types",
-
URL = "http://boa.unimib.it/handle/10281/7830",
-
URL = "http://hdl.handle.net/10281/7830",
-
URL = "http://boa.unimib.it/bitstream/10281/7830/1/phd_unimib_032791.pdf",
-
language = "eng",
-
size = "144 pages",
-
abstract = "Clustering is one of the most common machines learning
technique, which has been widely applied in genomics,
proteomics and more generally in Life Sciences. In
particular, clustering is an unsupervised technique
that, based on geometric concepts like distance or
similarity, partitions objects into groups, such that
objects with similar characteristics are clustered
together and dissimilar objects are in different
clusters. In many domains where clustering is applied,
some background knowledge is available in different
forms: labelled data (specifying the category to which
an instance belongs); complementary information about
'true' similarity between pairs of objects or about the
relationships structure present in the input data; user
preferences (for example specifying whether two
instances should be in same or different clusters). In
particular, in many real-world applications like
biological data processing, social network analysis and
text mining, data do not exist in isolation, but a rich
structure of relationships subsists between them. A
simple example can be viewed in biological domain,
where there are al lot of relationships between genes
and proteins based on many experimental conditions.
Another example, maybe common, is the Web search domain
where there are relations between documents and words
in a text or web pages, search queries and web users.
Our research is focused on how this background
knowledge can be incorporated into traditional
clustering algorithms to optimise the process of
pattern discovery (clustering) between instances.",
-
abstract = "provide an overview of traditional clustering methods
with some important distance measures and then we
analyse three particular challenges that we try to
overcome with different proposed methods: 'feature
selection' to reduce high dimensional input space and
remove noise from data; 'mixed data types' to handle in
clustering procedure both numeric and categorical
values, typically of life science applications;
finally, 'knowledge integration' in order to improve
the semantic value of clustering incorporating the
background knowledge. Regarding the first challenge we
propose a novel approach based on using of genetic
programming, an evolutionary algorithm-based
methodology, in order to automatically perform feature
selection. Different clustering algorithms are been
investigated regarding the second challenge. A modify
version of a particular algorithm is proposed and
applied to clinical data. Particularly attention is
given to the final challenge, the most important
objective of this Thesis: the development of a new
relational clustering framework in order to improve the
semantic value of clustering taking into account in the
clustering algorithm relationships learnt from
background knowledge. We investigate and classify
existing clustering methods into two principal
categories: - Structure driven approaches: that are
bound to data structure. The data clustering problem is
tackled from several dimensions: clustering
concurrently columns and rows of a given dataset, like
biclustering algorithm or vertical 3-D clustering. -
Knowledge driven approaches: where domain information
is used to drive the clustering process and interpret
its results: semi-supervised clustering, that using
both labelled and unlabeled data, has attracted
significant attention. This kind of clustering
algorithms represents the first step to implement the
proposed general framework that it is classified into
this category. In particular the thesis focuses on the
development of a general framework for relational
clustering instantiating it for three different life
science applications: the first one with the aim of
finding groups of gene with similar behaviour respect
to their expression and regulatory profile. The second
one is a pharmacogenomics application, in which the
relational clustering framework is applied on a
benchmark dataset (NCI60) to identify a drug treatment
to a given cell line based both on drug activity
pattern and gene expression profile. Finally, the
proposed framework is applied on clinical data: a
particular dataset containing different information
about patients in anticoagulant therapy has been
analyzed to find group of patients with similar
behaviour and responses to the therapy.",
-
notes = "NCI60, Saccharomyces Genome Database, Oral
anticoagulation therapy Also known as
\cite{10281_7830}",
- }
Genetic Programming entries for
Ilaria Giordani
Citations