@trude

User generated content: how good is it?

. Proceedings of the 3rd workshop on Information credibility on the web, page 1--2. New York, NY, USA, ACM, (2009)
DOI: 10.1145/1526993.1526995

Abstract

User Generated Content (UGC) is one of the main current trends in the Web. This trend has allowed all people that can access the Internet to publish content in different media, such as text (e.g. blogs), photos or video. This data can be crucial for many applications, in particular for semantic search. It is early to say which impact UGC will have and to what extent. However, the impact will be clearly related to the quality of this content. Hence, how good is the content that people generate in the so called Web 2.0? Clearly is not as good as editorial content in the Web site of a publisher. However, histories of success such as the case of the Wikipedia, show that it can be quite good. In addition, the quality gap is balanced by volume, as user generated content is much larger than, say, editorial content. In fact, Ramakrishnan and Tomkins estimate that UGC generates daily from 8 to 10GB while the professional Web only generates 2GB in the same time. How we can estimate the quality of UGC? One possibility is to directly evaluate the quality, but that is not easy as depends on the type of content and the availability of human judgments. One example of such approach is the study of Yahoo! Answers done by Agichtein et al. In this work they start from a judged question/answer collection where good questions usually have good answers. Then they predict good questions and good answers, obtaining an AUC (area under the curve of the precision-recall graph) of 0.76 and 0.88, respectively. A second possibility is obtaining indirect evidence of the quality. For example, use UGC for a given task and then evaluate the quality of the task results. One such example is the extraction of semantic relations done by Baeza-Yates and Tiberi. To evaluate the quality of the results they used the Open Directory Project (ODP), showing that the results had a precision of over 60%. For the cases that were not found in the ODP, a manually verified sample showed that the real precision was close to 100%. What happened was that the ODP was not specific enough to contain very specific relations, and every day the problem gets worse as we have more data. This example shows the quality of ODP as well as the semantic encoded in queries. Notice that we can define queries as implicit UGC, because each query can be considered an implicit tag to Web pages that are clicked for that query, and hence we have an implicit folksonomy. A final alternative is crossing different UGC sources and infer from there the quality of those sources. An example of this case, is the work by Van Zwol et al. where they use collective knowledge (wisdom of crowds) to extend image tags, and prove that almost 70% of the tags can be semantically classified by using Wordnet and Wikipedia. This exposes the quality of both Flickr tags and Wikipedia. Our main motivation, is that by being able to generate semantic resources automatically from the Web (and in particular the Web 2.0), even with noise, coupling that with open content resources, we can create a virtuous feedback circuit. In fact, explicit and implicit folksonomies can be used to do supervised machine learning without the need of manual intervention (or at least drastically reduce it) to improve semantic tagging. After that, we can feedback the results on itself, and repeat the process. Using the right conditions, every iteration should improve the output, obtaining a virtuous cycle. As a side effect, we can also improve Web search, our main goal.

Links and resources

Tags

community

  • @trude
  • @griesbau
  • @dblp
@trude's tags highlighted