Words and phrases acquire meaning from the way they are used in society, from
their relative semantics to other words and phrases. For computers the
equivalent of `society' is `database,' and the equivalent of `use' is `way to
search the database.' We present a new theory of similarity between words and
phrases based on information distance and Kolmogorov complexity. To fix
thoughts we use the world-wide-web as database, and Google as search engine.
The method is also applicable to other search engines and databases. This
theory is then applied to construct a method to automatically extract
similarity, the Google similarity distance, of words and phrases from the
world-wide-web using Google page counts. The world-wide-web is the largest
database on earth, and the context information entered by millions of
independent users averages out to provide automatic semantics of useful
quality. We give applications in hierarchical clustering, classification, and
language translation. We give examples to distinguish between colors and
numbers, cluster names of paintings by 17th century Dutch masters and names of
books by English novelists, the ability to understand emergencies, and primes,
and we demonstrate the ability to do a simple automatic English-Spanish
translation. Finally, we use the WordNet database as an objective baseline
against which to judge the performance of our method. We conduct a massive
randomized trial in binary classification using support vector machines to
learn categories based on our Google distance, resulting in an a mean agreement
of 87\% with the expert crafted WordNet categories.
%0 Journal Article
%1 citeulike:4487
%A Cilibrasi, Rudi
%A Vitanyi, Paul M. B.
%D 2007
%K automatic-learning, google, linguistics, ontology, semantic
%T The Google Similarity Distance
%U http://arxiv.org/abs/cs.CL/0412098
%X Words and phrases acquire meaning from the way they are used in society, from
their relative semantics to other words and phrases. For computers the
equivalent of `society' is `database,' and the equivalent of `use' is `way to
search the database.' We present a new theory of similarity between words and
phrases based on information distance and Kolmogorov complexity. To fix
thoughts we use the world-wide-web as database, and Google as search engine.
The method is also applicable to other search engines and databases. This
theory is then applied to construct a method to automatically extract
similarity, the Google similarity distance, of words and phrases from the
world-wide-web using Google page counts. The world-wide-web is the largest
database on earth, and the context information entered by millions of
independent users averages out to provide automatic semantics of useful
quality. We give applications in hierarchical clustering, classification, and
language translation. We give examples to distinguish between colors and
numbers, cluster names of paintings by 17th century Dutch masters and names of
books by English novelists, the ability to understand emergencies, and primes,
and we demonstrate the ability to do a simple automatic English-Spanish
translation. Finally, we use the WordNet database as an objective baseline
against which to judge the performance of our method. We conduct a massive
randomized trial in binary classification using support vector machines to
learn categories based on our Google distance, resulting in an a mean agreement
of 87\% with the expert crafted WordNet categories.
@article{citeulike:4487,
abstract = {{Words and phrases acquire meaning from the way they are used in society, from
their relative semantics to other words and phrases. For computers the
equivalent of `society' is `database,' and the equivalent of `use' is `way to
search the database.' We present a new theory of similarity between words and
phrases based on information distance and Kolmogorov complexity. To fix
thoughts we use the world-wide-web as database, and Google as search engine.
The method is also applicable to other search engines and databases. This
theory is then applied to construct a method to automatically extract
similarity, the Google similarity distance, of words and phrases from the
world-wide-web using Google page counts. The world-wide-web is the largest
database on earth, and the context information entered by millions of
independent users averages out to provide automatic semantics of useful
quality. We give applications in hierarchical clustering, classification, and
language translation. We give examples to distinguish between colors and
numbers, cluster names of paintings by 17th century Dutch masters and names of
books by English novelists, the ability to understand emergencies, and primes,
and we demonstrate the ability to do a simple automatic English-Spanish
translation. Finally, we use the WordNet database as an objective baseline
against which to judge the performance of our method. We conduct a massive
randomized trial in binary classification using support vector machines to
learn categories based on our Google distance, resulting in an a mean agreement
of 87\% with the expert crafted WordNet categories.}},
added-at = {2010-12-17T18:47:41.000+0100},
archiveprefix = {arXiv},
author = {Cilibrasi, Rudi and Vitanyi, Paul M. B.},
biburl = {https://www.bibsonomy.org/bibtex/24e823daa890d0bafff91045fd4bedb0b/mortimer_m8},
citeulike-article-id = {4487},
citeulike-linkout-0 = {http://arxiv.org/abs/cs.CL/0412098},
citeulike-linkout-1 = {http://arxiv.org/pdf/cs.CL/0412098},
day = 30,
eprint = {cs.CL/0412098},
interhash = {8fc73a93c327ea9a45ef793242ac3508},
intrahash = {4e823daa890d0bafff91045fd4bedb0b},
keywords = {automatic-learning, google, linguistics, ontology, semantic},
month = May,
posted-at = {2004-12-28 20:46:48},
priority = {4},
timestamp = {2010-12-20T11:11:25.000+0100},
title = {{The Google Similarity Distance}},
url = {http://arxiv.org/abs/cs.CL/0412098},
year = 2007
}