We started High Scalability to help you build successful scalable websites. This site tries to bring together all the lore, art, science, practice, and experience of building scalable websites into one place so you can learn how to build your system with confidence.
February 17th, 2008 A 'problem' with the Mongrel/Rails platform A request comes in to the web server, and if it's dynamic, falls down to a waiting mongrel process. Mongrel calls Rails which wraps a big lock around most of the request/response cycle so the mongrel thats serving the request has to wait for a response from Rails to unlock and start serving other requests. This is the Blocked Thread stability anti-pattern that Michael Nygard talks about in Release It!. The problem is that Nginx or Apache doesn't know that a mongrel is blocked and keeps blindly sending requests. Thats means that even a single blocked mongrel will result in some slow responses for requests that come in to that same mongrel. Cut the Request Fat And one way to do that is to use the "Skinny Request, Fat Backend" principle. What that means in practical terms is pretty simple: Do as little as possible inside of the Request/Response cycle.
In the months prior to leaving Heavy, I led an exciting project to build a hosting platform for our online products on top of Amazon’s Elastic Compute Cloud (EC2). We eventually launched our newest product at Heavy using EC2 as the primary hosting platform. I’ve been following a lot of what other people have been doing with EC2 for data processing and handling big encoding or rendering jobs. We set out to build a fairly standard LAMP hosting infrastructure where we could easily and quickly add additional capacity. In fact, we can add new servers to our production pool in under 20 minutes, from the time we call the “run instance” API at EC2, to the time when public traffic begins hitting the new server. This includes machine startup time, adding custom server config files and cron jobs, rolling out application code, running smoke tests, and adding the machine to public DNS. What follows is a general outline of how we do this.
Disco is an oss implementation of the Map-Reduce framework for distributed computing. Disco supports parallel computations over large data sets on unreliable cluster of computers. The Disco core is written in Erlang. Users of Disco typically write jobs in Python, which makes it possible to express even complex algorithms or data processing tasks often only in tens of lines of code. This means that you can quickly write scripts to process massive amounts of data. Disco was started at Nokia Research Center as a lightweight framework for rapid scripting of distributed data processing tasks. This far Disco has been succesfully used, for instance, in parsing and reformatting data, data clustering, probabilistic modelling, data mining, full-text indexing, and log analysis with hundreds of gigabytes of real-world data. Linux is the only supported platform but you can run Disco in the Amazon's Elastic Computing Cloud.
mainly marketing articlle by cofiiunder What if you didn't have to do any of this funny business to get scalability and reliability? What if the JVM had access to a service that you could plug into to make its heap durable, arbitrarily large, and shared with every other JVM in your application tier? Enter Terracotta, network-attached, durable virtual heap for the JVM. In the spirit of full-disclosure, I'm a co-founder of Terracotta and work there as a software developer. Terracotta is an infrastructure service that is deployed as a stand-alone server plus a library that plugs into your existing JVMs and transparently clusters your JVM's heap. Terracotta makes some of your JVM heap shared via a network connection to the Terracotta server so that a bunch of JVMs can all access the shared heap as if it were local heap. You can think of it like a network-attached filesystem, but for your object data; see Figure 1.
RP has extremely good performance and scalability properties. Many uses of RP in the Linux Kernel have resulted in several orders of magnitude performance improvement compared to locking and transactional memory. Is it easy to program with? RP is not difficult to program with. Allowing each execution sequence to proceed using its own view of memory, by default, simplifies concurrent programming because it prevents memory from changing spontaneously. Threads are guaranteed to observe coherent memory, i.e., memory will contain values that were actually written at some time in the past. Read paths also exhibit deterministic performance characteristics, since they can not block or retry. This feature simplifies programming of time-sensitive systems. Nevertheless, RP is a new programming paradigm with a new interface and there are several ways to misuse it. Read Copy Update (RCU), an early example of RP, has seen extensive use in the Linux Kernel at over 2000 uses
Google Maps, Yahoo! Mail, Facebook, MySpace, YouTube, and Amazon are examples of Web sites built to scale. They access petabytes of data sending terabits per second to millions of users worldwide. The magnitude is awe-inspiring. Users view these large-scale Web sites from a narrower perspective. The typical user has megabytes of data that are downloaded at a few hundred kilobits per second. Users are not so interested in the massive number of requests per second being served; they care more about their individual requests. As they use these Web applications, they inevitably ask the same question: "Why is this site so slow?" The answer hinges on where development teams focus their performance improvements. Performance for the sake of scalability is rightly focused on the back end. Database tuning, replicating architectures, customized data caching, and so on allow Web servers to handle a greater number of requests.
James Hamilton has published a thorough summary of Facebook's Cassandra, another scalable key-value store for your perusal. It's open source and is described as a "BigTable data model running on a Dynamo-like infrastructure." Cassandra is used in Facebook as an email search system containing 25TB and over 100m mailboxes. # Google Code for Cassandra - A Structured Storage System on a P2P Network # SIGMOD 2008 Presentation. # Video Presentation at Facebook # Facebook Engineering Blog for Cassandra # Anti-RDBMS: A list of distributed key-value stores # Facebook Cassandra Architecture and Design by James Hamilton