Hadoop is a framework for running applications on large clusters of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion.
Apache's Hadoop project aims to solve these problems by providing a framework for running large data processing applications on clusters of commodity hardware. Combined with Amazon EC2 for running the application, and Amazon S3 for storing the data, we can run large jobs very economically. This paper describes how to use Amazon Web Services and Hadoop to run an ad hoc analysis on a large collection of web access logs that otherwise would have cost a prohibitive amount in either time or money.
In late 2004, Google surprised the world of computing with the release of the paper MapReduce: Simplified Data Processing on Large Clusters. That paper ushered in a new model for data processing across clusters of machines that had the benefit of being simple to understand and incredibly flexible. Once you adopt a MapReduce way of thinking, dozens of previously difficult or long-running tasks suddenly start to seem approachable–if you have sufficient hardware.
C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and K. Olukotun. Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems Vancouver, British Columbia, Canada, December 4-7, 2006, page 281-288. MIT Press, (2006)
H. chih Yang, A. Dasdan, R. Hsiao, and D. Parker. SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, page 1029--1040. New York, NY, USA, ACM, (2007)
H. chih Yang, A. Dasdan, R. Hsiao, and D. Parker. SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, page 1029--1040. New York, NY, USA, ACM, (2007)
D. Hiemstra, and C. Hauff. Multilingual and Multimodal Information Access Evaluation, volume 6360 of Lecture Notes in Computer Science, page 64--69. Berlin, Springer Verlag, (2010)
T. Sandholm, and K. Lai. SIGMETRICS '09: Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems, page 299--310. New York, NY, USA, ACM, (2009)