Data analytics is becoming increasingly prominent in a variety
of application areas ranging from extracting business intelligence
to processing data from scientific studies. MapReduce
programming paradigm lends itself well to these data-intensive
analytics jobs, given its ability to scale-out and leverage several
machines to parallely process data. In this work we argue
that such MapReduce-based analytics are particularly synergistic
with the pay-as-you-go model of a cloud platform. However,
a key challenge facing end-users in this environment is
the ability to provision MapReduce applications to minimize
the incurred cost, while obtaining the best performance. This
paper firstmotivates the importance of optimally provisioning a
MapReduce job, and demonstrates that existing approaches can
result in far from optimal provisioning. We then present a preliminary
approach that improves MapReduce provisioning by
analyzing and comparing resource consumption of the application
at hand with a database of similar resource consumption
signatures of other applications.
P. Pantel, E. Crestan, A. Borkovsky, A. Popescu, and V. Vyas. EMNLP '09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, page 938--947. Morristown, NJ, USA, Association for Computational Linguistics, (2009)