When you launch an Amazon Elastic MapReduce job flow, the Hadoop job is run on a a generic AMI that we supply. Until now, there's been no easy way to customize the image by modifying configuration files or installing additional software.
By popular demand, we now support bootstrap actions for each Elastic MapReduce job flow. The bootstrap actions are scripts stored in Amazon S3. You can write the scripts in any language that's already installed on the instance -- Perl, Python, Ruby, or Bash. Bash is probably your best bet for simple customizations. Here's an example of running a job flow with a bootstrap action that uses the Elastic MapReduce command-line client:
This command uses a bootstrap action provided by Elastic MapReduce that will override the settings in the Hadoop site config with settings loaded from a file in S3.
Another predefined bootstrap action allows you to modify the amount of memory allocated to various Hadoop daemons:
This command sets the heap size for NameNode to be 2048M and adds the Java command line argument –XX:GCTimeRatio=19 which will increase the frequency with which the Java garbage collector runs.
Bootstrap actions run as the user “hadoop”, but this user is allowed to escalate to root using sudo, so if you wanted to install a Debian package you could write a bootstrap action like this:
If your bootstrap action fails then your jobflow will be shutdown, so you’ll want to test your bootstrap action script on a running jobflow before specifying it as a bootstrap action. To do this run a development jobflow with the –alive option like this:
Then you can SSH to the master node of your job flow. download your script from S3 with hadoop fs –copyToLocal, and execute it. Once you know that it works then try it as a bootstrap action on a new jobflow.
There's more information on Elastic MapReduce bootstrap actions in the newest version of the documentation.