In the months prior to leaving Heavy, I led an exciting project to build a hosting platform for our online products on top of Amazon’s Elastic Compute Cloud (EC2). We eventually launched our newest product at Heavy using EC2 as the primary hosting platform. I’ve been following a lot of what other people have been doing with EC2 for data processing and handling big encoding or rendering jobs. We set out to build a fairly standard LAMP hosting infrastructure where we could easily and quickly add additional capacity. In fact, we can add new servers to our production pool in under 20 minutes, from the time we call the “run instance” API at EC2, to the time when public traffic begins hitting the new server. This includes machine startup time, adding custom server config files and cron jobs, rolling out application code, running smoke tests, and adding the machine to public DNS. What follows is a general outline of how we do this.
Disco is an oss implementation of the Map-Reduce framework for distributed computing. Disco supports parallel computations over large data sets on unreliable cluster of computers. The Disco core is written in Erlang. Users of Disco typically write jobs in Python, which makes it possible to express even complex algorithms or data processing tasks often only in tens of lines of code. This means that you can quickly write scripts to process massive amounts of data. Disco was started at Nokia Research Center as a lightweight framework for rapid scripting of distributed data processing tasks. This far Disco has been succesfully used, for instance, in parsing and reformatting data, data clustering, probabilistic modelling, data mining, full-text indexing, and log analysis with hundreds of gigabytes of real-world data. Linux is the only supported platform but you can run Disco in the Amazon's Elastic Computing Cloud.