We propose a new unsupervised learning technique for ex-
tracting information from large text collections. We model
documents as if they were generated by a two-stage stochas-
tic process. Each author is represented by a probability
distribution over topics, and each topic is represented as
a probability distribution over words for that topic. The
words in a multi-author paper are assumed to be the result
of a mixture of each authors’ topic mixture. The topic-word
and author-topic distributions are learned from data in an
unsupervised manner using a Markov chain Monte Carlo al-
gorithm. We apply the methodology to a large corpus of
160,000 abstracts and 85,000 authors from the well-known
CiteSeer digital library, and learn a model with 300 topics.
We discuss in detail the interpretation of the results dis-
covered by the system including specific topic and author
models, ranking of authors by topic and topics by author,
significant trends in the computer science literature between
1990 and 2002, parsing of abstracts by topics and authors
and detection of unusual papers by specific authors. An on-
line query interface to the model is also discussed that allows
interactive exploration of author-topic models for corpora
such as CiteSeer.
Описание
generative document model with latent author-topic vars
%0 Journal Article
%1 steyvers2004pat
%A Steyvers, M.
%A Smyth, P.
%A Rosen-Zvi, M.
%A Griffiths, T.
%D 2004
%I ACM New York, NY, USA
%J Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
%K authortopic imported machinelearning model topic
%P 306--315
%T Probabilistic author-topic models for information discovery
%U http://irlab.cis.nctu.edu.tw/Presentation2005/%E9%99%B3%E4%BB%A5%E7%90%86/4_20/Probabilistic%20Author-Topic%20Models%20for%20Information%20discovery.pdf
%X We propose a new unsupervised learning technique for ex-
tracting information from large text collections. We model
documents as if they were generated by a two-stage stochas-
tic process. Each author is represented by a probability
distribution over topics, and each topic is represented as
a probability distribution over words for that topic. The
words in a multi-author paper are assumed to be the result
of a mixture of each authors’ topic mixture. The topic-word
and author-topic distributions are learned from data in an
unsupervised manner using a Markov chain Monte Carlo al-
gorithm. We apply the methodology to a large corpus of
160,000 abstracts and 85,000 authors from the well-known
CiteSeer digital library, and learn a model with 300 topics.
We discuss in detail the interpretation of the results dis-
covered by the system including specific topic and author
models, ranking of authors by topic and topics by author,
significant trends in the computer science literature between
1990 and 2002, parsing of abstracts by topics and authors
and detection of unusual papers by specific authors. An on-
line query interface to the model is also discussed that allows
interactive exploration of author-topic models for corpora
such as CiteSeer.
@article{steyvers2004pat,
abstract = {We propose a new unsupervised learning technique for ex-
tracting information from large text collections. We model
documents as if they were generated by a two-stage stochas-
tic process. Each author is represented by a probability
distribution over topics, and each topic is represented as
a probability distribution over words for that topic. The
words in a multi-author paper are assumed to be the result
of a mixture of each authors’ topic mixture. The topic-word
and author-topic distributions are learned from data in an
unsupervised manner using a Markov chain Monte Carlo al-
gorithm. We apply the methodology to a large corpus of
160,000 abstracts and 85,000 authors from the well-known
CiteSeer digital library, and learn a model with 300 topics.
We discuss in detail the interpretation of the results dis-
covered by the system including specific topic and author
models, ranking of authors by topic and topics by author,
significant trends in the computer science literature between
1990 and 2002, parsing of abstracts by topics and authors
and detection of unusual papers by specific authors. An on-
line query interface to the model is also discussed that allows
interactive exploration of author-topic models for corpora
such as CiteSeer.},
added-at = {2008-09-09T07:34:25.000+0200},
author = {Steyvers, M. and Smyth, P. and Rosen-Zvi, M. and Griffiths, T.},
biburl = {https://www.bibsonomy.org/bibtex/2f6764f9867c5f9c06c07c7f53de9e033/tberg},
description = {generative document model with latent author-topic vars},
interhash = {b80d5948a7089aa63ce0f7d349c5ab85},
intrahash = {f6764f9867c5f9c06c07c7f53de9e033},
journal = {Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining},
keywords = {authortopic imported machinelearning model topic},
pages = {306--315},
publisher = {ACM New York, NY, USA},
timestamp = {2008-09-09T07:34:25.000+0200},
title = {{Probabilistic author-topic models for information discovery}},
url = {http://irlab.cis.nctu.edu.tw/Presentation2005/%E9%99%B3%E4%BB%A5%E7%90%86/4_20/Probabilistic%20Author-Topic%20Models%20for%20Information%20discovery.pdf},
year = 2004
}