The concept of a graph has been around since the dawn of mechanical computing and for many decades prior in the domain of pure mathematics. Due in large part to this golden age of databases, graphs are becoming increasingly popular in software engineering. Graph databases provide a way to persist and process graph data. However,…
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.
PyTables is built on top of the HDF5 library, using the Python language and the NumPy package.
Based on the work of Apache CouchDB, PouchDB provides a simple API in which to store and retrieve JSON objects, due to the similiar API, and CouchDB’s HTTP API it is possible to sync data that is stored in your local PouchDB to an online CouchDB as well as syncing data from CouchDB down to PouchDB (you can even sync between 2 PouchDB databases).
Drools 5 introduces the Business Logic integration Platform which provides a unified and integrated platform for Rules, Workflow and Event Processing. It's been designed from the ground up so that each aspect is a first class citizen, with no compromises.
FACTORIE is a toolkit for deployable probabilistic modeling, implemented as a software library in Scala. It provides its users with a succinct language for creating relational factor graphs, estimating parameters and performing inference.
Pattern is a web mining module for the Python programming language.
It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics), clustering and classification (k-means, KNN, SVM), and data visualization (graph networks).
Atom Interface is a novel interactive visualization of single/multiple tree structures. It is based on the metaphor of electrons, atoms and molecules. For mo...
DB2 Graph Store is an optimized way to store graph triples inside DB2 database. Support for the SPARQL query language
Support for popular RDF Java APIs like JENA
Support for HTTP SPARQL end-point via JOSEKI
Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents arrive. This task is one example of a class of data processing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput requirements of these tasks: Google's indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines. MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency.
We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%.
Giraph builds upon the graph-oriented nature of Pregel but additionally adds fault-tolerance to the coordinator process with the use of ZooKeeper as its centralized coordination service.
Giraph follows the bulk-synchronous parallel model relative to graphs where vertices can send messages to other vertices during a given superstep. Checkpoints are initiated by the Giraph infrastructure at user-defined intervals and are used for automatic application restarts when any worker in the application fails. Any worker in the application can act as the application coordinator and one will automatically take over if the current application coordinator fails.
SparQLed is an interactive SPARQL editor that provides context-aware recommendations, helping users in formulating complex SPARQL queries across multiple heterogeneous data sources.
Tulip is an information visualization framework dedicated to the analysis and visualization of relational data. Tulip aims to provide the developer with a complete library, supporting the design of interactive information visualization applications for relational data that can be tailored to the problems he or she is addressing.
B. Elliott, E. Cheng, C. Thomas-Ogbuji, and Z. Ozsoyoglu. Proceedings of the 2009 International Database Engineering & Applications Symposium, page 31--42. New York, NY, USA, ACM, (2009)
D. Shahaf, C. Guestrin, and E. Horvitz. Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, page 1122--1130. New York, NY, USA, ACM, (2012)
G. Forman, and E. Kirshenbaum. CIKM '08: Proceeding of the 17th ACM conference on Information and knowledge management, page 1221--1230. New York, NY, USA, ACM, (2008)
D. Lin. Proceedings of the 17th international conference on Computational linguistics, page 768--774. Morristown, NJ, USA, Association for Computational Linguistics, (1998)
M. Banko, and E. Brill. ACL '01: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, page 26--33. Morristown, NJ, USA, Association for Computational Linguistics, (2001)