Our customers have used Amazon Elastic MapReduce to process very large-scale data sets using an array of Amazon EC2 instances. One such customer, Seattle's Razorfish, was able to side-step the need for a capital investment of over $500K while also speeding up their daily processing cycle (read more in our Razorfish case study).
Our implementation makes it easy for you to create and run a complex multi-stage job flow composed of multiple job steps. Until now, the same number of EC2 instances (known as slave nodes in Hadoop terminology) would be used for each step in the flow. Going forward, you now have more control over the number of instances in the job flow:
- You can add nodes to a running job flow to speed it up. This is similar to throwing more logs on a fire or calling down to the engine room with a request for "more power!" Of course, you can also remove nodes from a running job.
- A special "resize" step can be used to change the number of nodes between steps in a flow. This allows you to tune your overall job to make sure that it runs as quickly and as cost-efficiently as possible.
- As a really nice side effect of being able to add nodes to a running job, Elastic MapReduce will now automatically provision new slave nodes in the event that an existing one fails.
You can initiate these changes using the Elastic MapReduce APIs, the command line tools, or the AWS SDK for Java. You can also monitor the overall size and status of each job from the AWS Management Console.
We've got a number of other enhancements to Elastic MapReduce in the works, so stay tuned to this blog.