I was doing some work and thought, "Wouldn't it be nice to have my own cluster?" I'm guessing not many people have those types of revelations, and probably fewer that decide they should go ahead and solve the problem. I wanted a cheap, small, easy to pack, light, quiet, low-power cluster that I could sit on my desk, and not even think about it.
Last week I moderated a webinar entitled Optimizing Performance for HPC: Part 2 - Interconnect with InfiniBand. It was a great presentation with a lot of practical information and good questions. If you missed it, it will be available for a few months, so you still have a chance to check it out. As part of the webinar, Vallard Benincosa of IBM, mentioned that the speed of light was a becoming an issue in network design. In engineering terms, that is refered to as a hard limit.
Traditionally, large scale-up servers used cache-coherent buses for inter-processor communications. These proprietary buses and servers are very costly and power-hungry. Today’s powerful x86 servers replace proprietary scale-up architectures with low-cost machines connected through high-speed, low-latency clustered interconnects. This article will take an in-depth view of their cost and power benefits compared to scale-up architectures, and explain that Ethernet can be tunneled through a PCI Express (PCIe) fabric to provide a very-high-performance, low-cost cluster interconnect suitable for storage IO.
Linux magazine HPC Editor Douglas Eadline had a chance recently to discuss the current state of HPC clusters with Beowulf pioneer Don Becker, Founder and Chief Technical Officer, Scyld Software (now Part of Penguin Computing). For those that may have come to the HPC party late, Don was a co-founder of the original Beowulf project, which is the cornerstone for commodity-based high-performance cluster computing. Don’s work in parallel and distributed computing began in 1983 at MIT’s Real Time Systems group. He is known throughout the international community of operating system developers for his contributions to networking software and as the driving force behind beowulf.org.
In late 2004, Google surprised the world of computing with the release of the paper MapReduce: Simplified Data Processing on Large Clusters. That paper ushered in a new model for data processing across clusters of machines that had the benefit of being simple to understand and incredibly flexible. Once you adopt a MapReduce way of thinking, dozens of previously difficult or long-running tasks suddenly start to seem approachable–if you have sufficient hardware.
Distributed Sage is a framework that allows one to do distributed computing from within Sage. It includes a server, client and workers as well as a set of classes that one can subclass from to write distributed computation jobs. It is designed to be used mainly for ‘coarsely’ distributed computations, i.e., computations where jobs do not have to communicate much with each other. This is also sometimes referred to as ‘grid’ computing.
Teiid is a data virtualization system that allows applications to use data from multiple, heterogenous data stores.
Teiid is comprised of tools, components and services for creating and executing bi-directional data services. Through abstraction and federation, data is accessed and integrated in real-time across distributed data sources without copying or otherwise moving data from its system of record.
Ack. Ppython requires worker threads on each cluster node. I want an ssh private key (no p/w) solution. 1) Start parallel python execution server on all your remote computational nodes:
Rocks is an open-source Linux cluster distribution that enables end users to easily build computational clusters, grid endpoints and visualization tiled-display walls. Hundreds of researchers from around the world have used Rocks to deploy their own cluster (see the Rocks Cluster Register).
HBase is the Hadoop database. Its an open-source, distributed, column-oriented store modeled after the Google paper, Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop. HBase's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Try it if your plans for a data store run to big.
HBase: Bigtable-like structured storage for Hadoop HDFS Just as Google's [WWW] Bigtable leverages the distributed data storage provided by the [WWW] Google File System, HBase provides Bigtable-like capabilities on top of Hadoop Core. Data is organized into tables, rows and columns. An Iterator-like interface is available for scanning through a row range (and of course there is the ability to retrieve a column value for a specific key). Any particular column may have multiple versions for the same row key.
Apache's Hadoop project aims to solve these problems by providing a framework for running large data processing applications on clusters of commodity hardware. Combined with Amazon EC2 for running the application, and Amazon S3 for storing the data, we can run large jobs very economically. This paper describes how to use Amazon Web Services and Hadoop to run an ad hoc analysis on a large collection of web access logs that otherwise would have cost a prohibitive amount in either time or money.
F. Perteneder, M. Bresler, E. Grossauer, J. Leong, C. Rendl, and M. Haller. Proceedings of the 19th ACM Conference on Computer Supported Cooperative Work and Social Computing Companion, page 81--85. New York, NY, USA, ACM, (2016)event-place: San Francisco, California, USA.