Hadoop is a framework for running applications on large clusters of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion.
Apache's Hadoop project aims to solve these problems by providing a framework for running large data processing applications on clusters of commodity hardware. Combined with Amazon EC2 for running the application, and Amazon S3 for storing the data, we can run large jobs very economically. This paper describes how to use Amazon Web Services and Hadoop to run an ad hoc analysis on a large collection of web access logs that otherwise would have cost a prohibitive amount in either time or money.
In late 2004, Google surprised the world of computing with the release of the paper MapReduce: Simplified Data Processing on Large Clusters. That paper ushered in a new model for data processing across clusters of machines that had the benefit of being simple to understand and incredibly flexible. Once you adopt a MapReduce way of thinking, dozens of previously difficult or long-running tasks suddenly start to seem approachable–if you have sufficient hardware.
A. Ghoting, P. Kambadur, E. Pednault, and R. Kannan. Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21-24, 2011, page 334-342. (2011)
J. Urbani, S. Kotoulas, E. Oren, and F. van Harmelen. International Semantic Web Conference, volume 5823 of Lecture Notes in Computer Science, page 634-649. Springer, (2009)
M. Bayir, I. Toroslu, A. Cosar, and G. Fidan. WWW '09: Proceedings of the 18th international conference on World wide web, page 161--170. New York, NY, USA, ACM, (2009)
M. Becker, H. Mewes, A. Hotho, D. Dimitrov, F. Lemmerich, and M. Strohmaier. International Conference Companion on World Wide Web, page 17--18. Republic and Canton of Geneva, Switzerland, International World Wide Web Conferences Steering Committee, (2016)
C. Bellettini, M. Camilli, L. Capra, and M. Monga. Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), 2012 14th International Symposium on, page 295-302. IEEE Computer Society, (September 2012)
P. Ravindra, V. Deshpande, and K. Anyanwu. MDAC '10: Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud, page 1--6. New York, NY, USA, ACM, (2010)
P. Pantel, E. Crestan, A. Borkovsky, A. Popescu, and V. Vyas. EMNLP '09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, page 938--947. Morristown, NJ, USA, Association for Computational Linguistics, (2009)