My Photo
E-Commerce Service
Amazon E-Commerce Service (ECS) exposes Amazon's product data and e-commerce functionality.

Elastic Compute Cloud
Amazon Elastic Compute Cloud is a web service that provides resizable compute capacity in the cloud.

Historical Pricing
The Amazon Historical Pricing web service gives developers programmatic access to over three years of actual sales data for books, music, videos, and DVDs.

Mechanical Turk
One of the best ways to understand Amazon Mechanical Turk is to complete a HIT and see what the experience is like.

Simple Storage Service
Amazon S3 is storage for the Internet. It is designed to make web-scale computing easier for developers.

Simple Queue Service
Amazon Simple Queue Service offers a reliable, highly scalable hosted queue for storing messages as they travel between computers.

Alexa Thumbnails
All thumbnail images are accessible via web services, using SOAP or REST.

Alexa Top Sites
The Alexa Top Sites web service provides ranked lists of the top sites on the Internet.

Alexa Web Information Service
The Alexa Web Information Service makes Alexa's vast repository of information about the traffic and structure of the web available to developers.

Alexa Web Search
The Alexa Web Search web service offers programmatic access to Alexa's web search engine.

« January 2008 | Main | March 2008 »

Taking Massive Distributed Computing to the Common Man - Hadoop on Amazon EC2/S3

Not so long ago, it was both difficult and expensive to perform massive distributed processing using a large cluster of machines. Mainly because:

  1. It was difficult to get the funding to acquire this 'large cluster of machines'. Once acquired, it was difficult to manage (powering/cooling/maintenance) it and we always had a fear of what-if the experiment failed and how would one recover the losses from the investment already made.
  2. After it was acquired and managed, there were technical problems. It was difficult to run massively distributed tasks on the machines, storing and accessing large datasets, parallelization was not easy and Job scheduling was error-prone. Moreover, If nodes failed, detecting this was difficult and recovery was very expensive. Tracking jobs and status was often ignored because it quickly became complicated as number of machines in cluster increased.

Hence it was difficult to innovate and/or solve real-world problems like these:

  • Web Company : Analyze large-data sets of user behavior and clickstream logs
  • Social Networking Company : Analyze social, demographic and market data
  • Phone Company : Locate all customers who have called in a given area
  • Large Retailer Chain : Wants to know what items a particular customer bought last month or recall a certain product and inform customers who bought that product.
  • Surveillance Company : Wants to transcode video accumulated over several years
  • Pharma Company : Wants locate people who were prescribed a certain drug

Just a few years ago, it was difficult. But now, it is easy.

The Open Source Hadoop framework has given developers the power to do some pretty extraordinary things.

Hadoop gives developers an opportunity to focus on their idea/implementation and not worry about software-level "muck" associated with distributed processing (#2 above). It handles job scheduling, automatic parallelization, and job/status tracking all by itself while developers focus on the Map and Reduce implementation. It allows processing of large datasets by splitting the dataset into manageable chunks, spreading it across a fleet of machines and managing the overall process by launching jobs, processing the job no matter where the data is physically located and, at the end, aggregating the job output into a final result.

Large companies can afford to acquire 10,000 node clusters and run their experiments on massive distributed processing platforms that process 20000 TB/day.


But if I am a startup, or a university with minimal funding, or a self-employed individual who would like to test distributed processing over a large cluster with 1000+ nodes, can I afford it? OR even If I am a well funded company (think "enterprise") with lot of free cash flow, will management approve the budget for my experiment?  Every organization has a person who says "no". Will I be able to fight the battle with those people? Should I even fight the battle (of logistics)? Will I be able to get an environment to experiment with large datasets (think "weather data simulation", oer "genome comparisons")?


Cloud Computing makes this a reality (solving #1 above). Click a button and get a server. Flick a switch and store terabytes of data geographically distributed. Click a button and dispose of temporary resources.

Posts like this and this inspired me to write this post. Amazon Web Services is leveling the playing field for experimentation, innovation and competition. Users are able to iterate on their ideas quickly, if your idea works, bingo! If it does not, shutdown your "droplet" in the cloud and move on to the next idea and start a new "droplet" whenever you are ready.


I would say:

The Open Source Hadoop framework on Amazon EC2/S3 has given every developer the power to do some pretty extraordinary things.

Everyday, I hear new stories about running Hadoop on EC2. For example, The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 1.1 million finished PDFs in the space of 24 hours at a computation cost of just $240. It not only makes massive distributed processing easy but also makes it headache-free.

Whether it is Startup companies or University Classrooms in UCSB, BYU, Stanford or even enterprise companies, its just amazing to see every new story that is utilizing Hadoop on Amazon EC2/S3 in innovative ways.


That’s what I love about Amazon Web Services - a common man with just a credit card can afford to think about massive distributed computing and compete with the rest and emerge to the top.


--Jinesh


p.s.The real power and potential of hadoop over Amazon EC2 would be when I see Hadoop-on-demand with Condor spawning EC2 instances on-the-fly when I need them (or when situation demands them) automatically and shutting them down when I don’t need them. Has anybody tried that yet ?

Orglex: Using AWS to Build Semantic Web Ontologies

Orglex Nik Rao of Orglex wrote to tell me about his site and about how they are using Amazon EC2, S3, and the Alexa Web Information Service to build their site.

Orglex is focused on providing information services to industry professionals in vertical markets such as health insurance, clinical trials, and venture capital. Their service has three facets: aggregated content specific to the industry, community and networking hubs for the industry, and a targeted recruiting platform.

Orglex usea EC2 and S3 to build and maintain domain-specific ontologies for each vertical market. In contrast to the usual top-down models hand-built by domain experts, their system extracts clues from the information and uses them in a scalable, bottom-up fashion.

The process of building and refining the algorithms is iterative in nature. The scalable nature of EC2 allows them to tune and re-run their algorithms as needed without the need for a dedicated compute cluster. Nik told me that the ability for them to scale up and down has really driven down the cost of experimentation for them and has allowed them to get to market quickly, and with a high quality product.

-- Jeff;

Vote For Your Favorite AWS-Powered Sites

Webware_2008 Voting for the 2008 Webware awards is now underway. Please consider voting for your favorite AWS-powered sites including these. Click Site to visit the site or Vote to begin the voting process.

If your site has been nominated and you are not on the list, leave a comment, include a link to the voting page and I will update this post!

-- Jeff;

AWS Jobs

Jungle Disk is looking for a C/C++ developer to work directly on the product, and for a C#/ASP.Net developer to work on the web site. More info can be found in Jungle Dave's blog post.

Ronald Lewis is offering a 10% discount on EC2-based consulting jobs.

-- Jeff;

PS - I will be happy to post additional positions offered and positions sought as long as they are directly related to an Amazon Web Service.


Increasing Your Amazon EC2 Instance Limit

Ec2_bump_limit We have simplified the process of requesting additional EC2 instances. You no longer need to call me at home or send a box of dog biscuits to Rufus.

You can now make a request by simply filling out the Request to Increase the Amazon EC2 Instance Limit form. We'll need to know a little bit about you and about your application and the number of instances that you need, and we'll take care of the rest.

As always, if you are doing something cool with EC2, we really  want to hear about it! Write a blog post that we can link to, or simply send us an email at awseditor@amazon.com .

-- Jeff;

New Zealand Trip Report

If there was ever any doubt about the power each of us have, this week proved that one person makes a real difference. I am midway two-week trip to New Zealand and Australia, and writing this post from New Zealand. The person that I’m talking about is Nick Jones—let me explain how this evangelism trip came about, and along the way I’ll talk a bit about what I found once here.

How the Trip Came About
Amazon’s own Jeff Barr came up with an idea that has changed the course of evangelism—at least here at Amazon Web Services. We have a wiki at evangelists.wetpaint.com that allows community members to request that we come to them, rather than some centralized process where we decide who “should” hear about Amazon Web Services. And so in this case Nick posted a request that Amazon send a Web Services evangelist down under. I replied to Nick to say “sure, but not just for one meeting”. Must have been a challenge—check out the wiki page for this trip and you’ll see just how dense the schedule is. Nick wasn’t responsible for every meeting; however a large percentage of these meetings in both New Zealand and Australia were due to his efforts.

The Result
Lots of opportunity to meet with the academic/research community (Nick works at the University of Auckland), government agencies, startups, and individual developers on this trip. It’s amazing what you learn—especially when others set the agenda. I am going to describe just a few highlights, which will shortchange others who reinforced the same point; but given the number of meetings it’s the only approach possible.

New Zealand is a long way from traditional tech centers, and there is a single undersea cable that serves the country (although a second one is on the way). The result is that Internet access is expensive, with a wholesale cost of $0.03/MB to communicate with North America. So the research community makes use of KAREN, a network that is funded by the NZ government and that eliminates that transit fee—as long as the other end has a peering agreement. None of this seems to affect the local startup scene though, as I'll describe shortly.

Every city seemed to have a take-charge person. In Christchurch there were two: with Robin Harrington taking the lead at the University of Canterbury, and Christopher Sawtell leading the charge for the Linux group. Robin set up a series of sessions with researches and faculty on campus. It's always exciting to see people think about what these new Web service offerings afford in the way of potential and cost savings. And I was able to learn more about the university and what their needs are. The campus is on a very large piece of land; yet the actual buildings are compact so that there is lots of very lush green space. Kiwis are definitely into "green"--in both the garden and environmental sense.
As mentioned, the other Christchurch leaders were long-time officers of the local Linux user group. They went well out of their way to accommodate my schedule and arrange a meeting on a non-normal night. Then they even invited me out for dinner at a Chinese restaurant. Great place to eat! We met on the university campus; you know it's a comp sci department when the name on the lab door says "Crypt 2".
The Kiwi research community has access to the highest number of supercomputers per capita in the world. These were used for at least part of the rendering of Lord of the Rings, a fact that many techies say “thank you” for.
Wellington has a vibrant Web community, and seems to be a hotbed of tech startups. The original intent was that I'd present to a few local startups. The event kept growing on its own until Catalyst Consulting stepped in and agreed to host it. Then it got bigger yet, presenting venue challenges... Don Christie from Catalyst posted a blog entry about the meeting, where I presented to a group of well over 100 people (believe that it was closer to 150), in a packed incubation center. Wow, what energy! What the folks in the room didn’t realize was that from the balcony outside the meeting I was able to see the neighborhood where I lived briefly many years ago (in the background of this photo). What a distraction! Another blog post by a different attendee is here.
In Hamilton I met with one of New Zealand’s largest Web design firms. They have all sorts of innovation in their reference list; not least of which was setting themselves up as an Internet registrar. Like so many others, they were enthusiastic and excited about the potential of Web-Scale Computing. At this point I also switched to renting a car--was a combination of destinations in suburban areas and a late-night travel schedule to Auckland. The rental vehicle reminded me that New Zealand uses the other side of the road, and that I should too...

Finally, Auckland is a more traditional business community but still full of tech startups. Had an opportunity to meet with some of them as well. In both Wellington and Auckland I realized how hands-on the government is about promoting their software industry as an export. The folks in NZTE (New Zealand Trade & Export) were impressive--unlike a typical government agency these staff members come from the software industry, and have a very realistic view of the world. There are plenty of success stories in New Zealand's software industry that don't involve government agencies, of course; however being promoted as an export industry definitely provides lift.

I finally met Nick on Thursday.

Who wants to be next? Nick and the rest of the New Zealand community set the bar...

-- Mike

Two New Case Studies: Sonian Networks and Digital Chalk

Success_digital_chalk I want to make sure that you are aware of the Success Stories section of our web site!

Within that section you can learn about how companies large and small are using our Utility Computing Platform (EC2, S3, SimpleDB, and SQS), the Amazon Associates Web Service, the Amazon Mechanical Turk and the Alexa Web Information Service to solve existing problems and to create entirely new types of businesses.

Earlier this month we introduced two new stories. You can read about how Digital Chalk used 3 different services to create a system for creating, editing, and hosting training videos.

You can also read about how Sonian Networks (previously blogged here) used the same services to create a highly scalable system for archiving and indexing corporate email and other internally generated content.

We've got more stories in the works, so please check the Success Stories part of our site from time to time.

-- Jeff;

Bungee Connect Opens up and Adds Amazon SimpleDB Access

Bungee_simpledb_tree My friends at Bungee Labs have rolled out the newest release of Bungee Connect, their browser-based application development and hosting platform.

They have also released a library which makes it really easy to make calls to Amazon SimpleDB. The library wraps all of the SimpleDB SOAP calls and handles all of the authentication as well. Per their recent blog post, all you need to do to get started is to enter your AWS developer credentials. You can read about the library here. Per my earlier blog post, you can also access Amazon FPS from Bungee Connect with ease.

Bungee Connect is the development component of Bungee's Platform-as-a-Service model. Without leaving your desk (or your web browser) you can design, build, and deploy a complex application. The application might involve calling SOAP or REST web services, mashing up data from multiple local and remote sources, and doing some significant local processing as well. There's no charge to develop an application. Once built and deployed, the developer is billed based on actual usage of the application. There's more on this over at ProgrammableWeb.

You may be reading this and thinking that it sounds cool, only to realize that you don't yet have access to Amazon SimpleDB. We are adding new users to the SimpleDB beta just as fast as possible. If you are not yet on the waiting list, go here and click the Sign Up for Web Service button near the top right of the page. Before you do that, make absolutely sure that you have attached a credit card to your AWS account. If you are already using another for-pay service such as Amazon S3 or EC2, you have already done this. About 99.9% of our existing SimpleDB beta testers gained their access in this way.

The other 0.1% were desperate for access and managed to beg their way into the beta using various social engineering tricks. Sample tricks include desperate emails to me, emails with a very predictable pattern:

Paragraph 1 is always something like "Hey Jeff, remember that time we were using a PDP-8 together back in 7th Grade? Man, those were the good old days. I've been meaning to catch up with you for a long time. How's life?"

Paragraph 2 is then "I'm now at a startup, and my life wont be complete without access to SimpleDB. Can you help?"

Believe it or not, I get at least one such email per week. In fact, our limited betas have proven to be very effective at getting reconnected to old friends, which is never a bad thing. Of course, the more clever and more desperate the appeal, the better.

-- Jeff;

S3 Stat - Usage Stats for S3 Files

Jeffbarr_s3_stats S3Stat is a log analysis tool for Amazon S3. This very helpful tool uses the log files generated by S3, analyzes them using Webalizer, and generates a variety of insightful and colorful reports. I have been using S3Stat on one of my own buckets for the last couple of months and have been pleased with the results. I use an S3 bucket to store the pictures that I post on my personal blog and now I know a lot more about the popularity of each one.

Take a look at the sample reports to learn more.

There's a one-month free trial and usage after that costs just $2 per month. Take a look at the pricing plan to learn more. While you are on the site you may want to take a look at their handy list of S3 resources as well.

-- Jeff;

Amazon SimpleDB Query Tutorials

Sdb_201 In order to allow developers to gain a better understanding of the Amazon SimpleDB Query language, we have just posted a pair of tutorials:

In Query 101: Building Amazon SimpleDB Queries, you will learn about the basic principles of the language, including the comparison and set operators, and how to use them in simple and range queries. With that as a base, you will then learn about multi-valued queries, which (naturally enough) operate on SimpleDB attributes which have multiple values. Finally, you will learn about multi-predicate queries, using the union and intersection operators to ask more complex questions.

In Query 201: Tips & Tricks for Amazon SimpleDB Query, you will learn about lexicographic comparison, querying for numerical data and dates, using negation, tuning your queries using BoxUsage, partitioning your data for best query performance, and efficient retrieval of result sets.

-- Jeff;

Zmanda Webinar

Zmanda At 10 AM on Wednesday, February 13th, Dmitri Joukovski from Zmanda will explain how backup to Amazon S3 complements traditional backup to disk and tape. He will also demonstrate how easy it is to configure Amanda Enterprise for backup and recovery to Amazon S3.

Dmitri will talk about the benefits of using S3 for off-site backup and archiving. He will also list the advantages of using Amanda Enterprise with S3 vs online backup services. The webinar will show how easy it is to configure and backup to S3 using Amanda Enterprise.

The webinar is free but pre-registration is a must.

-- Jeff;

New Release of Amazon Simple Queue Service (SQS)

We've just rolled out a new version of the Amazon Simple Queue Service (SQS for short).

After taking a detailed look at the ways that the clever folks in our developer community (now 330,000 - strong by the way) are using SQS, we made a number of changes to increase the cost efficiency of the service, removing a few calls and reducing the maximum message size.

Use_sqsThe original version of SQS features a pricing model base on the number of messages sent. The new and improved model is based on the actual number of web service requests. Instead of paying 10 cents for every 1000 messages, you will now pay 1 cent for every 10,000 requests.

If you poll your SQS queues to check for new messages, you should re-examine your code to see if you can do things more efficiently under this new pricing system. In particular, if you find that there's nothing to process, you should implement a backoff scheme of some sort. If your first request in a series doesn't find anything to process, wait 1 second and try again. If there's still nothing there, wait 2 seconds, and then 4, up to some upper bound based on the specific needs of your application.

We also found that a number of our developers are using SQS as a fundamental part of their Amazon-powered infrastructure. One common use is to dedicate Amazon EC2 instances to host particular components of a multi-stage image or document processing system using SQS queues to buffer the messages between stages. A system built on this architecture can be scaled to take on additional load by simply adding more EC2 instances at any stage. Systems built like this also exhibit a high degree of fault tolerance. You can learn more about this particular architectural model in Jinesh's post on SQS: The Super Queue Service and you can read about this model in action by reading François Beausoleil's post about Using SQS and S3 to decouple image resizing from uploading.

Using SQS as a piece of application infrastructure is now more economical than ever because (drum roll) there's no longer any charge to move data back and forth between EC2 and SQS! Note that you can use the AWS Simple Calculator to model your system and to estimate your monthly charges.

There's more information on these exciting changes in the forum post.

-- Jeff;

Second Life Developer Chat Schedule for February 2008

Here's the Second Life chat schedule for February. All of the chats will take place at 10 AM PST on the Amazon Developers Islands in Second Life.

If you are new to Second Life and have trouble locating the island, simply log in 15 minutes or so before the chat and send me ("Jeffronius Batra") an Instant Message (IM). I will help you. Once you have Second Life installed you should be able to get to the right place by clicking here.

  • Thursday, February 7th
  • Thursday, February 14th
  • Thursday, February 21st
  • Thursday, February 28th

We discuss the various Amazon Web Services, application architectures, issues, use cases, and lots more.

Important Update: Members of the AWS Developer Support team will be in attendance on the 7th (and on the other days if we get enough questions). So bring us your S3, EC2, SQS, and SimpleDB questions and we'll do our best to answer them right away.

Hope to see you there!

-- Jeff;

Wowza Media Server Pro On Amazon EC2

The unpredictable nature of the demand for rich media files, coupled with the amount of bandwidth needed to deal with a video that suddenly "goes viral" combine to create the circumstances for an event that I sometimes refer to as a "success disaster." This happens when you create something cool and hope that it will catch on (bringing with it at least a modicum of fame and fortune), but when it does catch on you are faced with a wall of traffic that you cannot hope to satisfy. It is expensive and impractical to have spare servers standing by in the off chance that you are faced with a sudden demand for one of your media products.

Wowza_ec2_user_guide An on-demand, web-scale solution can help you to sidestep this particular problem. The Wowza Media Server Pro can be purchased in two ways. You can buy a traditional, fixed-price license and use it to stream as much media as you'd like for a single server. This is probably the way to go if you have access to a dedicated server and if you have a good handle on the demand for your media files. The server supports the RTMP protocol (and the RTMPT and RTMPS variants) used to stream Flash media files.

Alternatively, you can run the same server on one or more Amazon EC2, paying a very modest monthly fee (currently $5) and then per-hour and bandwidth-based charges a bit above and beyond what's charged for EC2 itself. If you don't have servers of your own, if you are faced with unpredictable demand, then this is a great way to go. You could keep a single server running at all times and then add additional instances when traffic surges. The product is available in a 32-bit version suitable for Small EC2 instances and in a 64-bit version for Large and Extra Large instances.

You can learn more about how to run the product on EC2 by reading the User Guide (That's a PDF). After a quick reading it looks like this would be pretty easy to set up. The AMIs are generic, with customization information supplied via an external startup package. The package enumerates the set of application packages to be installed, specifies the folder structure for the server, initiates downloads of content (Amazon S3 would be a good place to keep it, of course) and provides a mechanism to call startup or tuning scripts. The AMI also includes a password-protected FTP server and an interface to the Java Management Extensions (JMX).

This is all very cool stuff, and another way that Web-Scale Computing is changing the economics of doing business online.

-- Jeff;

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31