Taking Massive Distributed Computing to the Common Man - Hadoop on Amazon EC2/S3
Not so long ago, it was both difficult and expensive to perform massive distributed processing using a large cluster of machines. Mainly because:
- It was difficult to get the funding to acquire this 'large cluster of machines'. Once acquired, it was difficult to manage (powering/cooling/maintenance) it and we always had a fear of what-if the experiment failed and how would one recover the losses from the investment already made.
- After it was acquired and managed, there were technical problems. It was difficult to run massively distributed tasks on the machines, storing and accessing large datasets, parallelization was not easy and Job scheduling was error-prone. Moreover, If nodes failed, detecting this was difficult and recovery was very expensive. Tracking jobs and status was often ignored because it quickly became complicated as number of machines in cluster increased.
Hence it was difficult to innovate and/or solve real-world problems like these:
- Web Company : Analyze large-data sets of user behavior and clickstream logs
- Social Networking Company : Analyze social, demographic and market data
- Phone Company : Locate all customers who have called in a given area
- Large Retailer Chain : Wants to know what items a particular customer bought last month or recall a certain product and inform customers who bought that product.
- Surveillance Company : Wants to transcode video accumulated over several years
- Pharma Company : Wants locate people who were prescribed a certain drug
Just a few years ago, it was difficult. But now, it is easy.
The Open Source Hadoop framework has given developers the power to do some pretty extraordinary things.
Hadoop gives developers an opportunity to focus on their idea/implementation and not worry about software-level "muck" associated with distributed processing (#2 above). It handles job scheduling, automatic parallelization, and job/status tracking all by itself while developers focus on the Map and Reduce implementation. It allows processing of large datasets by splitting the dataset into manageable chunks, spreading it across a fleet of machines and managing the overall process by launching jobs, processing the job no matter where the data is physically located and, at the end, aggregating the job output into a final result.
Large companies can afford to acquire 10,000 node clusters and run their experiments on massive distributed processing platforms that process 20000 TB/day.
But if I am a startup, or a university with minimal funding, or a self-employed individual who would like to test distributed processing over a large cluster with 1000+ nodes, can I afford it? OR even If I am a well funded company (think "enterprise") with lot of free cash flow, will management approve the budget for my experiment? Every organization has a person who says "no". Will I be able to fight the battle with those people? Should I even fight the battle (of logistics)? Will I be able to get an environment to experiment with large datasets (think "weather data simulation", oer "genome comparisons")?
Cloud Computing makes this a reality (solving #1 above). Click a button and get a server. Flick a switch and store terabytes of data geographically distributed. Click a button and dispose of temporary resources.
Posts like this and this inspired me to write this post. Amazon Web Services is leveling the playing field for experimentation, innovation and competition. Users are able to iterate on their ideas quickly, if your idea works, bingo! If it does not, shutdown your "droplet" in the cloud and move on to the next idea and start a new "droplet" whenever you are ready.
I would say:
The Open Source Hadoop framework on Amazon EC2/S3 has given every developer the power to do some pretty extraordinary things.
Everyday, I hear new stories about running Hadoop on EC2. For example, The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 1.1 million finished PDFs in the space of 24 hours at a computation cost of just $240. It not only makes massive distributed processing easy but also makes it headache-free.
Whether it is Startup companies or University Classrooms in UCSB, BYU, Stanford or even enterprise companies, its just amazing to see every new story that is utilizing Hadoop on Amazon EC2/S3 in innovative ways.
That’s what I love about Amazon Web Services - a common man with just a credit card can afford to think about massive distributed computing and compete with the rest and emerge to the top.
--Jinesh
p.s.The real power and potential of hadoop over Amazon EC2 would be when I see Hadoop-on-demand with Condor spawning EC2 instances on-the-fly when I need them (or when situation demands them) automatically and shutting them down when I don’t need them. Has anybody tried that yet ?


Amazon Web Services has completely changed the way I deliver web applications. It has been extremely disruptive in my business operations over the last year, in a positive way.
I am introducing clients to it daily, it takes time but slowly people are understanding the power.
Posted by: Kin Lane | February 27, 2008 at 06:05 PM
The ability to throw instances into a pipeline is especially attractive for implementation of OGC WPS, a web service standard for geospatial processing. www.cadmaps.com/gisblog
WPS provides a standard that would allow chaining of multiple WPS nodes to produce an end result.
Amazon AWS is really changing the way small businesses think about compute resources.
Posted by: Randy George | February 28, 2008 at 09:37 AM
There are Hadoop AMIs available to make running Hadoop on EC2 easy - see http://wiki.apache.org/hadoop/AmazonEC2 for full instructions. Also, check out the article on Amazon Web Services Developer Connection which shows you how to do log analysis on Hadoop and EC2: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873.
Posted by: Tom White | February 29, 2008 at 02:36 PM
--Jinesh
yes i do.. you can change system to manual and it'll not shut down automatically
Posted by: ankara evden eve | March 13, 2008 at 03:34 AM
Great inspiring post. In fact, I am inspired by the same and I am working on a discovery engine on EC2,
http://www.top8.biz/wordpress/?p=21.
Derek of New York Times even added some clarifications on upload: 4 Terabytes took 4 days, which is quite reasonable. And they were throttling for budget reasons.
Posted by: Mark Kerzner | March 13, 2008 at 07:52 AM
Hadoop is fine for Map/Reduce but for more complex interchanges Rio provides a great deal of additional power to services running on EC2. See http://blog.elastic-grid.com/elastic-grid/how-to-start-rio-on-amazon-ec2/ and http://rio.dev.java.net/.
Posted by: Jeff | May 04, 2008 at 11:27 PM