Pattern is a web mining module for the Python programming language.
It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics), clustering and classification (k-means, KNN, SVM), and data visualization (graph networks).
Atom Interface is a novel interactive visualization of single/multiple tree structures. It is based on the metaphor of electrons, atoms and molecules. For mo...
DB2 Graph Store is an optimized way to store graph triples inside DB2 database. Support for the SPARQL query language
Support for popular RDF Java APIs like JENA
Support for HTTP SPARQL end-point via JOSEKI
Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents arrive. This task is one example of a class of data processing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput requirements of these tasks: Google's indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines. MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency.
We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%.
Giraph builds upon the graph-oriented nature of Pregel but additionally adds fault-tolerance to the coordinator process with the use of ZooKeeper as its centralized coordination service.
Giraph follows the bulk-synchronous parallel model relative to graphs where vertices can send messages to other vertices during a given superstep. Checkpoints are initiated by the Giraph infrastructure at user-defined intervals and are used for automatic application restarts when any worker in the application fails. Any worker in the application can act as the application coordinator and one will automatically take over if the current application coordinator fails.
S. Baluja, D. Ravichandran, и D. Sivakumar. Proceeding of the International Conference on Knowledge Discovery and Information Retrieval (KDIR 2009), INSTICC, (6-8 oct 2009)
M. Grimnes. EKAW 2010 Demo & Poster Abstracts. International Conference on Knowledge Engineering and Knowledge Management (EKAW-10), 17th International Conference on Knowledge Engineering and Knowledge Management, October 11-15, Lisbon, Portugal, -, (октября 2010)Best Poster.
P. Teufl, и G. Lackner. 10th International Conference on Knowledge Management and Knowledge Technologies 1–3 September 2010, Messe Congress Graz, Austria, стр. 18 - 18. (2010)
G. Grimnes, P. Edwards, и A. Preece. Proceedings of the 5th European Semantic Web Conference (ESWC 2008), том 5021 из Lecture Notes in Computer Science, стр. 303-317. Springer, (2008)
J. Lehmann. Machine Learning and Data Mining in Pattern Recognition, 5th International Conference, MLDM 2007, Leipzig, Germany, July 18-20, 2007, Proceedings, том 4571 из Lecture Notes in Computer Science, стр. 883--898. Springer, (2007)
K. Dellschaft, и S. Staab. In Proceedings of the 5th International Semantic Web Conference (ISWC2006), том 4273 из LNCS, Athens, GA, USA, (ноября 2006)