The Amazon Elastic MapReduce (EMR) team has been hard at work on a series of updates and new features. You now have access to Hadoop 2.2 and new versions of Hive,Pig, HBase, and Mahout. Cluster startup time has been reduced, S3DistCp (for data movement), has been augmented, and MapR M7 is now supported.
If you are already using Hadoop to do large-scale parallel processing, the value and appeal of these features should be self-evident! If you are new to this whole world and want to learn more, you might want to start by reading the Hadoop Tutorial. Then you can get started with EMR by following our step-by-step tutorial.
Elastic MapReduce now supports version 2.2 of Hadoop. Among other things, this release supports YARN, which allows Hadoop to behave in Platform as a Service (PaaS) fashion. We have a number of Hadoop Amazon Machine Images (AMIs) to choose from; consult our AMI version table to learn more.
Along with Hadoop 2.2, EMR’s latest Amazon Machine Image (3.0.0) - also comes with HBase 0.94.7 and Mahout 0.8. HBase is a popular NoSQL database for Hadoop. EMR provides some special features for HBase including backup/restore using S3. Mahout is an equally popular machine learning library for Hadoop. The latest EMR AMI runs on Amazon Linux (previous AMIs used Debian), and also includes major upgrades to Perl, PHP, Python, R, and Ruby. You can learn more about AMI 3.0.0 here.
Hive is a popular data warehouse system for Hadoop. It has been upgraded from version 0.8.1.8 to 0.11.0.1, gaining a number of features in the process including support for the Optimized Row Columnar (ORC) file format and additional windowing and analytics functions. Learn how to use Hive with Elastic MapReduce.
Pig is an analytics platform that is often used for ETL (Extract / Transform / Load) processing. It has been upgraded from version 0.9.2.2 to 0.11.1.1. This version adds new data types, functions, and operators. Learn how to Process Data with Pig.
EMR clusters now start up about 60 seconds faster and the team continues to bring down the average startup time. S3DistCp uses Hadoop to efficiently transfer data between S3 buckets and HDFS. The new version does a better job of handling S3 metadata such as the S3 storage class during copy operations.As I discussed earlier this year, MapR M7, a premium offering of Hadoop and HBase, is now supported. MapR M7 lets you run production HBase applications by providing capabilities such as seamless splits, no compactions, instant recovery from failures, point-in-time recovery, full HA, mirroring, and consistent low latency.
As always, these features are available now and you can start using them today!