Abstract
Words and phrases acquire meaning from the way
they are used in society, from their relative semantics to other
words and phrases. For computers the equivalent of ‘society’
is ‘database,’ and the equivalent of ‘use’ is ‘way to search the
database.’ We present a new theory of similarity between words
and phrases based on information distance and Kolmogorov complexity.
To fix thoughts we use the world-wide-web as database,
and Google as search engine. The method is also applicable
to other search engines and databases. This theory is then
applied to construct a method to automatically extract similarity,
the Google similarity distance, of words and phrases from the
world-wide-web using Google page counts. The world-wide-web
is the largest database on earth, and the context information
entered by millions of independent users averages out to provide
automatic semantics of useful quality. We give applications in
hierarchical clustering, classification, and language translation.
We give examples to distinguish between colors and numbers,
cluster names of paintings by 17th century Dutch masters and
names of books by English novelists, the ability to understand
emergencies, and primes, and we demonstrate the ability to do
a simple automatic English-Spanish translation. Finally, we use
the WordNet database as an objective baseline against which to
judge the performance of our method. We conduct a massive
randomized trial in binary classification using support vector
machines to learn categories based on our Google distance,
resulting in an a mean agreement of 87% with the expert crafted
WordNet categories.
Links and resources
Tags
community