Today we are introducing Amazon Elastic MapReduce , our new Hadoop-based processing service. I'll spend a few minutes talking about the generic MapReduce concept and then I'll dive in to the details of this exciting new service.
Over the past 3 or 4 years, scientists, researchers, and commercial developers have recognized and embraced the MapReduce programming model. Originally described in a landmark paper, the MapReduce model is ideal for processing large data sets on a cluster of processors. It is easy to scale up a MapReduce application to jobs of arbitrary size by simply adding more compute power. Here's a very simple overview of the data flow in a typical MapReduce job:
Given that you have enough computing hardware, MapReduce takes care of splitting up the input data into chunks of more or less equal size, spinning up a number of processing instances for the map phase (which must, by definition, be something that can be broken down into independent, parallelizable work units) apportioning the data to each of the mappers, tracking the status of each mapper, routing the map results to the reduce phase, and finally shutting down the mappers and the reducers when the work has been done. It is easy to scale up MapReduce to handle bigger jobs or to produce results in a shorter time by simply running the job on a larger cluster.
Hadoop is an open source implementation of the MapReduce programming model. If you've got the hardware, you can follow the directions in the Hadoop Cluster Setup documentation and, with some luck, be up and running before too long.
Developers the world over seem to think that the MapReduce model is easy to understand and easy to work in to their thought process. After a while they tend to report that they begin to think in terms of the new style, and then see more and more applications for it. Once they start to show that the model has a genuine business model (e.g. better results, faster) demand for hardware resources increases rapidly. Like any true viral success, one team shows great results and before too long everyone in the organization wants to do something similar. For example, Yahoo! uses Hadoop on a very large scale. A little over a year ago they reported that they were able to use the power of over 10,000 processor cores to generate a web map to power Yahoo! Search.
Over the past year or two a number of our customers have told us that they are running large Hadoop jobs on Amazon EC2. There's some good info on how to do this here and also here. AWS Evangelist Jinesh Varia covered the concept in a blog post last year, and also went into considerable detail in his Cloud Architectures white paper.
Given our belief in the power of the MapReduce programming style and the knowledge that many developers are already running Hadoop jobs of impressive size in our cloud, we wanted to find a way to make this important technology accessible to even more people.
Processing in Elastic MapReduce is centered around the concept of a Job Flow. Each Job Flow can contain one or more Steps. Each step inhales a bunch of data from Amazon S3, distributes it to a specified number of EC2 instances running Hadoop (spinning up the instances if necessary), does all of the work, and then writes the results back to S3. Each step must reference application- specific "mapper" and/or "reducer" code (Java JARs or scripting code for use via the Streaming model). We've also included the Aggregate Package with built-in support for a number of common operations such as Sum, Min, Max, Histogram, and Count. You can get a lot done before you even start to write code!
We're providing three distinct access routes to Elastic MapReduce. You have complete control via the Elastic MapReduce API, you can use the Elastic MapReduce command-line tools, or you can go all point-and-click with the Elastic MapReduce tab within the AWS Management Console! Let's take a look at each one.
The Elastic MapReduce API represents the fundamental, low-level entry point into the system. Action begins with the RunJobFlow function. This call is used to create a Job Flow with one or more steps inside. It accepts an EC2 instance type, an EC2 instance count, a description of each step (input bucket, output bucket, mapper, reducer, and so forth) and returns a Job Flow Id. This one call is equivalent to buying, configuring, and booting up a whole rack of hardware. The call itself returns in a second or two and the job is up and running in a matter of minutes. Once you have a Job Flow Id, you can add additional processing steps (while the job is running!) using AddJobFlowSteps. You can see what's running with DescribeJobFlows, and you can shut down one or more jobs using TerminateJobFlows.
The Elastic MapReduce client is a command-line tool written in Ruby. The client can invoke each of the functions I've already described. You can create, augment, describe, and terminate Job Flows from the command line.
Finally, you can use the new Elastic MapReduce tab of the AWS Management Console to create, augment, describe, and terminate job flows from the comfort of your web browser! Here are a few screen shots to whet your appetite:
I'm pretty psyched about the fact that we are giving our users access to such a powerful programming model in a form that's really easy to use. Whether you use the console, the API, or the command-line tools, you'll be able to focus on the job at hand instead of spending your time wandering through dark alleys in the middle of the night searching for more hardware.
What do you think? Is this cool, or what?