The Google Similarity Distance
Created by W.Langdon from
gp-bibliography.bib Revision:1.7975
- @Article{Cilibrasi:2007:ieeeTKDE,
-
author = "Rudi L. Cilibrasi and Paul M. B. Vitanyi",
-
title = "The Google Similarity Distance",
-
journal = "IEEE Transactions on Knowledge and Data Engineering",
-
year = "2007",
-
volume = "19",
-
number = "3",
-
pages = "370--383",
-
month = mar,
-
keywords = "genetic algorithms, genetic programming, Kolmogorov
complexity, wordnet, artificial common sense, Accuracy
comparison with WordNet categories, automatic
classification and clustering, automatic meaning
discovery using Google, automatic relative semantics,
automatic translation, dissimilarity semantic distance,
Google search, Google distribution via page hit counts,
Google code, Kolmogorov complexity, normalized
compression distance (NCD ), normalized information
distance (NID), normalized Google distance (NGD),
meaning of words and phrases extracted from the Web,
parameter-free data mining, universal similarity
metric",
-
ISSN = "1041-4347",
-
DOI = "doi:10.1109/TKDE.2007.48",
-
size = "14 pages",
-
abstract = "Words and phrases acquire meaning from the way they
are used in society, from their relative semantics to
other words and phrases. For computers, the equivalent
of {"}society{"} is {"}database,{"} and the equivalent
of {"}use{"} is {"}a way to search the database{"}. We
present a new theory of similarity between words and
phrases based on information distance and Kolmogorov
complexity. To fix thoughts, we use the World Wide Web
(WWW) as the database, and Google as the search engine.
The method is also applicable to other search engines
and databases. This theory is then applied to construct
a method to automatically extract similarity, the
Google similarity distance, of words and phrases from
the WWW using Google page counts. The WWW is the
largest database on earth, and the context information
entered by millions of independent users averages out
to provide automatic semantics of useful quality. We
give applications in hierarchical clustering,
classification, and language translation. We give
examples to distinguish between colours and numbers,
cluster names of paintings by 17th century Dutch
masters and names of books by English novelists, the
ability to understand emergencies and primes, and we
demonstrate the ability to do a simple automatic
English-Spanish translation. Finally, we use the
WordNet database as an objective baseline against which
to judge the performance of our method. We conduct a
massive randomized trial in binary classification using
support vector machines to learn categories based on
our Google distance, resulting in an a mean agreement
of 87 percent with the expert crafted WordNet
categories",
-
notes = "Also known as \cite{4072748}",
- }
Genetic Programming entries for
Rudi Cilibrasi
Paul M B Vitanyi
Citations