Zusammenfassung
We propose a new method to extract semantic knowledge from the world-wide-web
for both supervised and unsupervised learning using the Google search engine in
an unconventional manner. The approach is novel in its unrestricted problem
domain, simplicity of implementation, and manifestly ontological underpinnings.
We give evidence of elementary learning of the semantics of concepts, in
contrast to most prior approaches. The method works as follows: The
world-wide-web is the largest database on earth, and it induces a probability
mass function, the Google distribution, via page counts for combinations of
search queries. This distribution allows us to tap the latent semantic
knowledge on the web. Shannon's coding theorem is used to establish a
code-length associated with each search query. Viewing this mapping as a data
compressor, we connect to earlier work on Normalized Compression Distance. We
give applications in (i) unsupervised hierarchical clustering, demonstrating
the ability to distinguish between colors and numbers, and to distinguish
between 17th century Dutch painters; (ii) supervised concept-learning by
example, using Support Vector Machines, demonstrating the ability to understand
electrical terms, religious terms, emergency incidents, and by conducting a
massive experiment in understanding WordNet categories; and (iii) matching of
meaning, in an example of automatic English-Spanish translation.
Beschreibung
phd thesis version 2009-10-23
Links und Ressourcen
Tags
Community