закладки  36

  •  

    We present a novel method for key term extraction from text documents. In our method, document is modeled as a graph of semantic relationships between terms of that document. We exploit the following remarkable feature of the graph: the terms related to the main topics of the document tend to bunch up into densely interconnected subgraphs or commu- nities, while non-important terms fall into weakly intercon- nected communities, or even become isolated vertices. We apply graph community detection techniques to partition the graph into thematically cohesive groups of terms. We introduce a criterion function to select groups that contain key terms discarding groups with unimportant terms. To weight terms and determine semantic relatedness between them we exploit information extracted from Wikipedia. Using such an approach gives us the following two ad- vantages. First, it allows effectively processing multi-theme documents. Second, it is good at filtering out noise infor- mation in the document, such as, for example, navigational bars or headers in web pages. Evaluations of the method show that it outperforms exist- ing methods producing key terms with higher precision and recall. Additional experiments on web pages prove that our method appears to be substantially more effective on noisy and multi-theme documents than existing methods.
    15 лет назад , @dbenz
     
      attendedwww2009
       
       
    •  

      Measuring the similarity between semantic relations that hold among entities is an important and necessary step in various Web related tasks such as relation extraction, information retrieval and analogy detection. For example, consider the case in which a person knows a pair of entities (e.g. Google, YouTube), between which a partic- ular relation holds (e.g. acquisition). The person is interested in retrieving other such pairs with similar relations (e.g. Microsoft, Powerset). Existing keyword-based search engines cannot be ap- plied directly in this case because, in keyword-based search, the goal is to retrieve documents that are relevant to the words used in a query – not necessarily to the relations implied by a pair of words. We propose a relational similarity measure, using a Web search en- gine, to compute the similarity between semantic relations implied by two pairs of words. Our method has three components: repre- senting the various semantic relations that exist between a pair of words using automatically extracted lexical patterns, clustering the extracted lexical patterns to identify the different patterns that ex- press a particular semantic relation, and measuring the similarity between semantic relations using a metric learning approach. We evaluate the proposed method in two tasks: classifying semantic relations between named entities, and solving word-analogy ques- tions. The proposed method outperforms all baselines in a relation classification task with a statistically significant average precision score of 0.74. Moreover, it reduces the time taken by Latent Relational Analysis to process 374 word-analogy questions from 9 days to less than 6 hours, with an SAT score of 51%.
      15 лет назад , @dbenz
       
        attendedwww2009
         
         
      •  

        In this paper we present Triplify – a simplistic but effective approach to publish Linked Data from relational databases. Triplify is based on mapping HTTP-URI requests onto relational database queries. Triplify transforms the resulting relations into RDF statements and publishes the data on the Web in various RDF serializations, in particular as Linked Data. The rationale for developing Triplify is that the largest part of information on the Web is already stored in structured form, often as data contained in relational databases, but usually published by Web applications only as HTML mixing structure, layout and content. In order to reveal the pure structured information behind the current Web, we have implemented Triplify as a light-weight software component, which can be easily integrated into and deployed by the numerous, widely installed Web applications. Our approach includes a method for publishing update logs to enable incremental crawling of linked data sources. Triplify is complemented by a library of configurations for common relational schemata and a REST-enabled data source registry. Triplify configurations containing mappings are provided for many popular Web applications, including osCommerce, WordPress, Drupal, Gallery, and phpBB. We will show that despite its light-weight architecture Triplify is usable to publish very large datasets, such as 160GB of geo data from the OpenStreetMap project.
        15 лет назад , @dbenz
         
          attendedwww2009
           
           
        •  

          This paper presents SOFIE, a system for automated on- tology extension. SOFIE can parse natural language docu- ments, extract ontological facts from them and link the facts into an ontology. SOFIE uses logical reasoning on the exist- ing knowledge and on the new knowledge in order to disam- biguate words to their most probable meaning, to reason on the meaning of text patterns and to take into account world knowledge axioms. This allows SOFIE to check the plau- sibility of hypotheses and to avoid inconsistencies with the ontology. The framework of SOFIE unites the paradigms of pattern matching, word sense disambiguation and ontolog- ical reasoning in one unified model. Our experiments show that SOFIE delivers high-quality output, even from unstruc- tured Internet documents.
          15 лет назад , @dbenz
           
            attendedwww2009
             
             
          •  

            Social bookmarking systems and their emergent information structures, known as folksonomies, are increasingly important data sources for Semantic Web applications. A key question for harvesting semantics from these systems is how to extend and adapt traditional notions of similarity to folksonomies, and which measures are best suited for applications such as navigation support, semantic search, and ontology learning. Here we build an evaluation framework to compare various general folksonomy-based similarity measures derived from established information-theoretic, statistical, and practical measures. Our framework deals generally and symmetrically with users, tags, and resources. For evaluation purposes we focus on similarity among tags and resources, considering different ways to aggregate annotations across users. After comparing how tag similarity measures predict user-created tag relations, we provide an external grounding by user-validated semantic proxies based on WordNet and the Open Directory. We also investigate the issue of scalability. We find that mutual information with distributional micro-aggregation across users yields the highest accuracy, but is not scalable; per-user projection with collaborative aggregation provides the best scalable approach via incremental computations. The results are consistent across resource and tag similarity.
            15 лет назад , @dbenz