We've updated Amazon Elastic MapReduce with support for the latest and greatest versions of the following components:
The new version of Hadoop includes the following enhancements:
- MultipleInputs class for reading multiple types of data.
- MultipleOutputs class for writing multiple types of data.
- ChainMapper and ChainReducer which allows users to perform M+RM* within one Hadoop job. Previously you could only run one mapper and one reducer per job.
- Ability to skip bad records that deterministically kill your process. This allows you to complete a job even if a few records cause your process to fail.
- JVM reuse across task boundaries. This should increase performance when processing small files.
- New MapReduce API. This introduces a new Context object that will allow the API to evolve without causing backwards incompatible changes. This should allow customers to write jobs that will maintain compatibility beyond Hadoop 1.0.
- Support for bzip2 compression
If you are thinking about implementing a large-scale data processing system on AWS you may find the Razorfish case study of interest. They use Elastic MapReduce to analyze very large click-stream datasets without having to invest in their own infrastructure. The Elastic MapReduce Tutorial and the Elastic MapReduce Developer Guide will teach you what you need to know to do this type of work yourself.