Clustering by Compression
Created by W.Langdon from
gp-bibliography.bib Revision:1.7964
- @Article{Cilibrasi:2005:ITIT,
-
author = "Rudi Cilibrasi and Paul M. B. Vitanyi",
-
title = "Clustering by Compression",
-
journal = "IEEE Transactions on Information Theory",
-
year = "2005",
-
volume = "51",
-
number = "4",
-
pages = "1523--1545",
-
month = apr,
-
keywords = "genetic algorithms, genetic programming, complearn,
universal dissimilarity distance, normalised
compression distance, hierarchical unsupervised
clustering, quartet tree method, parameter-free
data-mining, heterogenous data analysis, Kolmogorov
complexity",
-
ISSN = "0018-9448",
-
URL = "http://homepages.cwi.nl/~paulv/papers/cluster.pdf",
-
DOI = "doi:10.1109/TIT.2005.844059",
-
size = "21 pages",
-
abstract = "We present a new method for clustering based on
compression. The method doesn't use subject-specific
features or background knowledge, and works as follows:
First, we determine a parameter-free, universal,
similarity distance, the normalized compression
distance or NCD , computed from the lengths of
compressed data files (singly and in pairwise
concatenation). Second, we apply a hierarchical
clustering method. The NCD is not restricted to a
specific application area, and works across application
area boundaries. A theoretical precursor, the
normalised information distance, co-developed by one of
the authors, is provably optimal. However, the
optimality comes at the price of using the
non-computable notion of Kolmogorov complexity. We
propose axioms to capture the real-world setting, and
show that the NCD approximates optimality. To extract a
hierarchy of clusters from the distance matrix, we
determine a dendrogram (binary tree) by a new quartet
method and a fast heuristic to implement it. The method
is implemented and available as public software, and is
robust under choice of different compressors. To
substantiate our claims of universality and robustness,
we report evidence of successful application in areas
as diverse as genomics, virology, languages,
literature, music, handwritten digits, astronomy, and
combinations of objects from completely different
domains, using statistical, dictionary, and block
sorting compressors. In genomics we presented new
evidence for major questions in Mammalian evolution,
based on whole-mitochondrial genomic analysis: the
Eutherian orders and the Marsupionta hypothesis against
the Theria hypothesis.",
-
notes = "With respect to the version published in the IEEE
Trans. Inform. Th., 51:4(2005), 1523-1545, we have
changed Definition 2.1 of 'admissible distance' making
it more general and Definitions 2.4 and 2.5 of
'normalized admissible distance,' consequently adapted
Lemma 2.6 (II.2) and in its proof (II.3) and the
displayed inequalities. This left Theorem 6.3 unchanged
except for changing 'such that d(x; y) \le e' to 'such
that d(x; y) \le e and C(v) \le C(x).'",
- }
Genetic Programming entries for
Rudi Cilibrasi
Paul M B Vitanyi
Citations