9781449327279 - Hadoop Operations - If you’ve been tasked with the job of maintaining large and complex Hadoop clusters, or are about to be, this book is a must. You’ll learn the particulars of Hadoop operations, from planning, installing, and configuring the system to providing ongoing maintenance.
The Cloudera Solutions team shares their insights into getting the most out of your Hadoop deployment. Webinar recording available on www.cloudera.com/events
From the blog hadoopnew: Last month the HCatalog project (formerly known as Howl) was accepted into the Apache Incubator. We have already branched for a 0.1 release, which we hope to push in the next few weeks. Given all this activity, I thought it … Continue reading →
Der Chiphersteller Intel hat eine eigene Hadoop-Distribution veröffentlicht, die speziell für die eigenen Prozessoren angepasst ist. Sie soll Daten deutlich schneller analysieren, vor allem,
Spark is a fast, in-memory cluster computing framework with a language-integrated interface in Scala. It shines at iterative MapReduce (e.g. machine learning) and interactive data mining, where keeping data in memory provides substantial speedups.
Hadoop On Demand (HOD) is a system for provisioning virtual Hadoop clusters over a large physical cluster. It uses the Torque resource manager to do node allocation. On the allocated nodes, it can start Hadoop Map/Reduce and HDFS daemons. It automatically generates the appropriate configuration files (hadoop-site.xml) for the Hadoop daemons and client. HOD also has the capability to distribute Hadoop to the nodes in the virtual cluster that it allocates. In short, HOD makes it easy for administrators and users to quickly setup and use Hadoop. It is also a very useful tool for Hadoop developers and testers who need to share a physical cluster for testing their own Hadoop versions.
HBase is the Hadoop database. Its an open-source, distributed, column-oriented store modeled after the Google paper, Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop. HBase's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Try it if your plans for a data store run to big.
HBase: Bigtable-like structured storage for Hadoop HDFS Just as Google's [WWW] Bigtable leverages the distributed data storage provided by the [WWW] Google File System, HBase provides Bigtable-like capabilities on top of Hadoop Core. Data is organized into tables, rows and columns. An Iterator-like interface is available for scanning through a row range (and of course there is the ability to retrieve a column value for a specific key). Any particular column may have multiple versions for the same row key.
Apache's Hadoop project aims to solve these problems by providing a framework for running large data processing applications on clusters of commodity hardware. Combined with Amazon EC2 for running the application, and Amazon S3 for storing the data, we can run large jobs very economically. This paper describes how to use Amazon Web Services and Hadoop to run an ad hoc analysis on a large collection of web access logs that otherwise would have cost a prohibitive amount in either time or money.
Introduction This document describes how Map and Reduce operations are carried out in Hadoop. If you are not familiar with the Google [WWW] MapReduce programming model you should get acquainted with it first.