My Photo
E-Commerce Service
Amazon E-Commerce Service (ECS) exposes Amazon's product data and e-commerce functionality.

Elastic Compute Cloud
Amazon Elastic Compute Cloud is a web service that provides resizable compute capacity in the cloud.

Historical Pricing
The Amazon Historical Pricing web service gives developers programmatic access to over three years of actual sales data for books, music, videos, and DVDs.

Mechanical Turk
One of the best ways to understand Amazon Mechanical Turk is to complete a HIT and see what the experience is like.

Simple Storage Service
Amazon S3 is storage for the Internet. It is designed to make web-scale computing easier for developers.

Simple Queue Service
Amazon Simple Queue Service offers a reliable, highly scalable hosted queue for storing messages as they travel between computers.

Alexa Thumbnails
All thumbnail images are accessible via web services, using SOAP or REST.

Alexa Top Sites
The Alexa Top Sites web service provides ranked lists of the top sites on the Internet.

Alexa Web Information Service
The Alexa Web Information Service makes Alexa's vast repository of information about the traffic and structure of the web available to developers.

Alexa Web Search
The Alexa Web Search web service offers programmatic access to Alexa's web search engine.

White Paper on 'Cloud Architectures' and Best Practices of Amazon S3, EC2, SimpleDB, SQS

I am very happy to announce my white paper on Cloud Architectures is now ready. This is one incarnation of the Emerging Cloud Service Architectures that Jeff wrote about a few weeks ago.

If you are new to the cloud, the first section of the paper will help you understand the benefits of building applications in-the-cloud. If you are using the cloud already, the second section of the paper will help you to use the cloud more effectively by utilizing some of the best practices.

In this paper, I discuss a new way to design architectures. Cloud Architectures are Services-Oriented Architectures that are designed to use On-demand infrastructure more effectively. Applications built on Cloud Architectures are such that the underlying computing infrastructure is used only when it is needed (for example to process a user request), draw the necessary resources on-demand (like compute servers or storage), perform a specific job, then relinquish the unneeded resources after the job is done. While in operation the application scales up or down elastically based on actual need for resources. Everything is automated and operates without any human intervention.

Figure2_2

As an example of a Cloud Architecture, I discuss the GrepTheWeb application. This application runs a regular expression against millions of documents from the web and returns the filtered results which match the query. The architecture is interesting because it is runs completely on-demand in automated fashion. Triggered by a regex request, hundreds of Amazon EC2 instances are launched, a Hadoop Cluster is started on them, transient messages are stored on Amazon SQS queues, statuses in Amazon SimpleDB, and all Map/Reduce jobs are run in parallel. Each Map task fetches the file from Amazon S3 and runs the regular expression - and aggregates all the results in the Reduce/Combine Phase and then disposes all the infrastructure back into the cloud (when the Hadoop job is processed)

GrepTheWeb is one of many applications built by Amazon that uses all our services (Amazon EC2, Amazon SimpleDB, Amazon SQS, Amazon S3) together.

Figure4

A wide variety of different types of applications that can be built using this design approach - from nightly batch processing systems to media processing pipelines.

An excerpt:

Cloud Architectures address key difficulties surrounding large-scale data processing. In traditional data processing it is difficult to get as many machines as an application needs. Second, it is difficult to get the machines when one needs them.  Third, it is difficult to distribute and co-ordinate a large-scale job on different machines, run processes on them, and provision another machine to recover if one machine fails. Fourth, it is difficult to auto-scale up and down based on dynamic workloads.  Fifth, it is difficult to get rid of all those machines when the job is done. Cloud Architectures solve such difficulties.

Applications built on Cloud Architectures run in-the-cloud where the physical location of the infrastructure is determined by the provider. They take advantage of simple APIs of Internet-accessible services that scale on-demand, that are industrial-strength, where the complex reliability and scalability logic of the underlying services remains implemented and hidden inside-the-cloud. The usage of resources in Cloud Architectures is as needed, sometimes ephemeral or seasonal, thereby providing the highest utilization and optimum bang for the buck.

In the first section I discuss the advantages and business benefits of Cloud Architectures and how each service was used. In the second section, I discuss best practices for the various Amazon Web Services.

You can download the PDF version or access it on AWS Resource Center

I talked about this briefly at the Hadoop Summit 2008 and QCon 2007. I got some good reviews after the talk and hence I decided to put all my thoughts in this paper along with some Best Practices for the use of Amazon Web Services (Amazon EC2, Amazon SQS, Amazon S3 and Amazon SimpleDB together). Many developers from our community have been asking for a real-world example of a complex, large-scale application. I will presenting this paper at the 2008 NSF Data-Intensive Scalable Computing Workshop at UW and 9th IEEE/NATEA Conference on Cloud Computing later this week.

I believe this new and emerging way of building applications, that run in-the-cloud, is going to change the way we do business.

-- Jinesh

The Emerging Cloud Service Architecture

I'm going to go out on a limb today and try to paint a picture of where some of this cool and crazy cloud-based infrastructure may be going. While none of what I will write about is idle speculation, it is based on just a few data points, and may be totally off base. However, I do get to talk to plenty of entrepreneurs and developers on a daily basis, and I am starting to see a very interesting pattern emerge.

Skynet_smugmug The existing state of the art in cloud-based architectures takes the shape of an application running in the cloud, calling upon services running within and provided by the operator of the cloud. There are any number of great examples of this type of architecture. Doug Kaye at IT Conversations built and documented his implementation over a year ago. Earlier today, Don MacAskill of SmugMug send me a link to his new post, SkyNet Lives (aka EC2 @ SmugMug). In that article, Don provides a detailed review of SmugMug's use of Amazon EC2 and S3 to implement a dynamic, highly scalable system which simultaneously minimizes response time and cost by optimizing the number of EC2 instances.

As I said, I am starting to see something which goes beyond this in a subtle yet important way. Developers are now building services in the cloud for other developers, with the understanding that important (and perhaps primary) consumers of the service will also be resident within the same cloud.

I'm going to call this the CSA, or Cloud Service Architecture.

Applications communicating with each other inside of the Amazon cloud enjoy some important benefits. They get high-bandwidth, low-latency communication, at little or no cost. They inherit all of the other attributes of cloud-based applications such as on-demand scalability, fault tolerance, cloud-wide network security, and cost efficiency. Applications running in loosely coupled fashion within the cloud can share data using SQS, S3, or other communication protocols of their choosing.

Right now, I see that forward-looking companies are starting to build components which fit into the CSA. On the database side, we have Vertica for the Cloud and MySQL Enterprise for EC2. On the media side, there's Cruxy's MuxCloud, IntrIdea's MediaPlug, and Wowza Media Server Pro for Amazon EC2. I'm sure that there are others that I don't know about.

Two_point_trend So who's calling these services from other EC2 instances within the cloud? Here are my first two data points (that's enough to draw a trend line, right?):

  1. I had breakfast with the CEO of Sonian yesterday. He told me that they are now using the Vertica product to help them store, index, and retrieve massive amounts of data (more info can be found in their case study).
  2. Earlier this year I paid a visit to VisualCV in Reston, Virginia. They use MediaPlug to support uploading and processing of a variety of types of images and videos.

My sense is that this is the start of something big. Web services made it possible to cross organizational boundaries with a simple HTTP request. Now, running within the cloud makes it possible to do this with minimal network latency.

As individual developers learn more about cloud computing, they will naturally look for some very high-level components up and running within the cloud. Over time I am sure that there will be a need for more sophisticated tracking and billing mechanisms, key management, a catalog of services, and other facilities that we can't even envision just yet. As always, we love to get this feedback from you, so let us know what you need.

I'm sure that there are some other CSA-style applications running in the Amazon cloud now. If you've built one, post a comment!

-- Jeff;

Two Good Podcasts

Rightscale_mashable_podcast I hardly ever listen to broadcast radio in my car anymore. Instead, I subscribe to a whole bunch of podcasts, some technical, some fun, and others educational. Here are two episodes which should be of interest to anyone who reads this blog:

The Mashable Podcast interviews Michael Crandell, CEO of RightScale. Michael talks about their product and how it helps organizations to use Amazon EC2 in a cost-effective fashion.

The IT Conversations Podcast captures Amazon CTO Werner Vogels as he talks about AWS at last years ETech conference.

You can listen to either or both of these on the respective sites or you can simply subscribe to their RSS feeds.

-- Jeff;

PS - Congratulations are due to to RightScale for the successful completion of their fund raising endeavor.

Taking Massive Distributed Computing to the Common Man - Hadoop on Amazon EC2/S3

Not so long ago, it was both difficult and expensive to perform massive distributed processing using a large cluster of machines. Mainly because:

  1. It was difficult to get the funding to acquire this 'large cluster of machines'. Once acquired, it was difficult to manage (powering/cooling/maintenance) it and we always had a fear of what-if the experiment failed and how would one recover the losses from the investment already made.
  2. After it was acquired and managed, there were technical problems. It was difficult to run massively distributed tasks on the machines, storing and accessing large datasets, parallelization was not easy and Job scheduling was error-prone. Moreover, If nodes failed, detecting this was difficult and recovery was very expensive. Tracking jobs and status was often ignored because it quickly became complicated as number of machines in cluster increased.

Hence it was difficult to innovate and/or solve real-world problems like these:

  • Web Company : Analyze large-data sets of user behavior and clickstream logs
  • Social Networking Company : Analyze social, demographic and market data
  • Phone Company : Locate all customers who have called in a given area
  • Large Retailer Chain : Wants to know what items a particular customer bought last month or recall a certain product and inform customers who bought that product.
  • Surveillance Company : Wants to transcode video accumulated over several years
  • Pharma Company : Wants locate people who were prescribed a certain drug

Just a few years ago, it was difficult. But now, it is easy.

The Open Source Hadoop framework has given developers the power to do some pretty extraordinary things.

Hadoop gives developers an opportunity to focus on their idea/implementation and not worry about software-level "muck" associated with distributed processing (#2 above). It handles job scheduling, automatic parallelization, and job/status tracking all by itself while developers focus on the Map and Reduce implementation. It allows processing of large datasets by splitting the dataset into manageable chunks, spreading it across a fleet of machines and managing the overall process by launching jobs, processing the job no matter where the data is physically located and, at the end, aggregating the job output into a final result.

Large companies can afford to acquire 10,000 node clusters and run their experiments on massive distributed processing platforms that process 20000 TB/day.


But if I am a startup, or a university with minimal funding, or a self-employed individual who would like to test distributed processing over a large cluster with 1000+ nodes, can I afford it? OR even If I am a well funded company (think "enterprise") with lot of free cash flow, will management approve the budget for my experiment?  Every organization has a person who says "no". Will I be able to fight the battle with those people? Should I even fight the battle (of logistics)? Will I be able to get an environment to experiment with large datasets (think "weather data simulation", oer "genome comparisons")?


Cloud Computing makes this a reality (solving #1 above). Click a button and get a server. Flick a switch and store terabytes of data geographically distributed. Click a button and dispose of temporary resources.

Posts like this and this inspired me to write this post. Amazon Web Services is leveling the playing field for experimentation, innovation and competition. Users are able to iterate on their ideas quickly, if your idea works, bingo! If it does not, shutdown your "droplet" in the cloud and move on to the next idea and start a new "droplet" whenever you are ready.


I would say:

The Open Source Hadoop framework on Amazon EC2/S3 has given every developer the power to do some pretty extraordinary things.

Everyday, I hear new stories about running Hadoop on EC2. For example, The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 1.1 million finished PDFs in the space of 24 hours at a computation cost of just $240. It not only makes massive distributed processing easy but also makes it headache-free.

Whether it is Startup companies or University Classrooms in UCSB, BYU, Stanford or even enterprise companies, its just amazing to see every new story that is utilizing Hadoop on Amazon EC2/S3 in innovative ways.


That’s what I love about Amazon Web Services - a common man with just a credit card can afford to think about massive distributed computing and compete with the rest and emerge to the top.


--Jinesh


p.s.The real power and potential of hadoop over Amazon EC2 would be when I see Hadoop-on-demand with Condor spawning EC2 instances on-the-fly when I need them (or when situation demands them) automatically and shutting them down when I don’t need them. Has anybody tried that yet ?

Amazon S3 for Science Grids

S3_for_science_grids_revised A team of researchers from the University of South Florida and the University of British Columbia have written a very interesting paper, Amazon S3 for Science Grids: A Viable Solution?

In this paper the authors review the features of Amazon S3 in depth, focusing on the core concepts, the security model, and data access protocols. After characterizing science storage grids in terms of data usage characteristics and storage requirements, they proceed to benchmark S3 with respect to data durability, data availability, access performance, and file download via BitTorrent. With this information as a baseline, they evaluate S3's cost, performance, and security functionality.

They conclude by observing that many science grid applications don't actually need all three of S3's most desirable characteristics -- high durability, high availability, and fast access. They also have some interesting recommendations for additional security functionality and some relaxing of limitations.

I do have one small update to the information presented in the article! Since it article was written, we have announced that S3 is now storing 5 billion objects, not the 800 million mentioned in section II.

-- Jeff;

Search Engine Packed as an AMI?

Mix_dining_room_2 It never hurts to try to wish a product into existence...

I received an email from an EC2 user asking me about search tools. This user runs a high traffic site on an array of EC2 instances, and is in need of a search solution. He knew that he could buy a search appliance, but this didn't fit with his company's model. As he told me:

"we don't want to do anything that involves us owning and operating a server...since we're big believers in web services."

After thinking about this for a while, I believe that one really cool solution would involve a search engine installed into an EC2 AMI (Amazon Machine Image), perhaps made available for use on a by-the-hour basis. This hypothetical AMI would incorporate all of the usual components: a crawler, data storage, and a query page for access to the actual search engine. There are bonus points for APIs for inserting and retrieving data, of course.

Perhaps the crawler runs once every 24 hours and then generates some indexed data structures which it stores in S3, where they are picked up by the engine and loaded into the instance's RAM for fast processing. Once again, I'll offer bonus points if spinning up multiple instances of the crawler makes the entire crawling and indexing process run faster.

To top it all off, the query page would be customizable and skinnable, so that this could be plugged into an existing site in a seamless fashion.

If you are doing something like this or have even thought about doing something similar, I'd like to hear from you. If you would pay to use it, same deal. Post some comments and let's see what happens.

-- Jeff;

Avoid the Crunch, go Web-Scale

Over at TechCrunch, Mike Arrington recently shared some of his concerns about linking to up and coming new sites. Any site featured on TechCrunch will see a huge inflow of traffic when seemingly everyone the tech world visits it within the course of a couple of hours to check it out:

There’s a spike, and then most of the people never come back. Hopefully a few stick around, register and tell their friends, but building an application to scale to handle a TechCrunch post is a long term solution to a short term problem.

Before I could even respond, the Crunchback blog proposed theTechcrunch Reference Architecture. In his words:

Build using Amazon EC2 and S3.
Use a load-balanced architecture
Add EC2 nodes when you go live - as many as you can…
Alert TechCrunch
Wait for mention (pray for mention)
2 days later start reducing nodes…

It almost goes without saying that this is in alignment with our own thinking in this area. Instead of scaling in advance for traffic that may or may not materialize, we believe that developers should create a scalable architecture, host it on Amazon EC2, and then simply "turn the knob" (so to speak) when traffic surges. They pay for actual usage while those servers are active, and then simply turn that knob back down when the surge subsides. No fuss, no muss, and no rack full of servers that are sometimes running at capacity and at other times sitting idle.

-- Jeff;

Updates:

  1. TechCrunch reference architecture came from CrunchBack, not from Phil Wolff.
  2. Lots of good conversation can be found in the comments.

Commodity Computing with Amazon's S3 and EC2

Commodity_grid_computing_with_s3_and_ec2Researcher Simson Garfinkel has written a detailed review of Amazon's S3 and EC2 services.

His new article, Commodity Computing with Amazon's S3 and EC2, starts out by reviewing the basics of S3 and then digs deeper into performance and security. After a brief review of EC2, he describes his own s3_glue, a C++ client library for S3. You can find s3_glue attached to this discussion board thread.

While the overall review is positive, Simson identifies some areas where there is clearly room for improvement. We find this kind of information to be extremely helpful. The feedback that we get from reviews, discussion board messages, audience Q&A, and private emails is taken very seriously and feeds directly into our product plans.

-- Jeff;

Top Ten Mistakes Startups Make Building Technology

Have been thinking about startups and technology, for the obvious reason that Amazon Web Services seem to serve Startup needs so well. That led to some discussion, followed by more thinking, and finally an inevitable “top ten” list.

But rather than just saying “these are the top ten mistakes that Startups need to avoid”, it seems like the perfect opportunity to ask all of you what YOU think the top pitfalls are. Feel free to share your actual experiences, or those of your "friend." What do you think? Love to hear your comments. Just asking that you restrict your list to technical reasons: we’ll save the business reasons for another list.

There’s a zillion ingredients in startup success. The purpose of this list is to identify ten technical pitfalls to avoid in a Web startup.

  1. Failure to anticipate success, and failure to architect for it. (You can’t anticipate all the bottlenecks in advance, or at least you can’t afford to out-engineer them.)
  2. Failure to plan for failure (a.k.a. over-investment in hardware leads to inability to exit one idea and move on the next one.)
  3. Bad Location (Internet Alley instead of Internet Highway means that your bottleneck is bandwidth, latency, and second-tier operational environments).
  4. Technology Religion: (The louder someone’s opinion on a particular technology, the smaller the chance that their opinion is well reasoned.)
  5. Late adoption (Contrary to the "bleeding edge" cliche, early adopters are able to use technology as a differentiator that accelerates them out in front of the competition.)
  6. Failure to use technology as a strategic weapon. (Viewing technology as overhead or strictly as an operational expense is the fastest road to making decisions for the wrong reason.)
  7. Failure to plan results in an urgent care center rather than an online business. (You can’t just throw stuff together and expect success.)
  8. Selecting the wrong bank, and the wrong payment gateway. (There are many anecdotes about gateway horror stories)
  9. Staying in the closet too long. (Startups are about success, and they thrive on new business. It’s better to iterate on what works rather than hide behind a beta, because success never finds your plan, it just finds you.)
  10. Adding audio to your home page. (Doh!)

-- Mike

Amazon's Werner Vogels on Scalability

Werner_it_conversation Over the last couple of years I have found that listening to podcasts is an effective way to learn and to keep up with what's happening in the software industry. My car has an iPod interface built in, so I spend a few minutes each morning setting up a playlist for my daily commute.

Last week I listened to Amazon's Werner Vogels talk about Scalability while walking my dog. In this 27 minute IT Conversations podcast, Werner talks about Amazon as a platform. He also discusses some of the ways that we stress-test our environment. It is a very interesting talk, and I think you will find that listening to it was a good use of your time.

-- Jeff;

PS - If you dig deep into the IT Conversations archive, you'll find this ancient interview with me as well.

Developer Help Wanted

I received a call for help via LinkedIn early this morning:

How can I find a developer familiar with Amazon's API for a mashup I'm planning?

We don't have a formal way to make these connections right now, but the answers to this question are certainly interesting and thought-provoking. Among other things, we've thought about adding a "help wanted / help available" section to our Discussion Forums. This would be a natural rendezvous point for those who can provide development services, and those who need them.

If you think that this is a good idea, or if you have a better one, please feel free to leave a comment!

-- Jeff;

 

TalkCrunch interview with Jeff Bezos

Jeffbezos If you've got 17 minutes to spare, this TalkCrunch interview with Jeff  Bezos is worth a listen. In addition to learning more about our Web-Scale Computing initiative, you can hear Jeff's famous laugh for yourself. There's also a summary of the talk on TechCrunch.

If that's not enough, there's another podcast interview over at Information Week, and here's the transcript.

-- Jeff;

Avoiding a Success Disaster

For a while I have been using the term "success disaster" to characterize what can happen on the web all too easily. What's a success disaster? You put up a piece of content somewhere and you get ready to handle a reasonable number of downloads.

Being the creative person that you are, however, links to your content shows up on Digg and Slashdot the same day that you are written up on TechCrunch. Suddenly the whole world wants in, and unless you've been Slashdotted before, you have no idea how to respond. Your basic assumptions about server capacity, network traffic, and data transfer limitations have just been blown out the window.

Getting more hardware to address this peak demand is probably not the right thing to do. All that traffic will  go away just as quickly as it came, and you don't want to raise your monthly burn rate just to accomodate these infrequent peaks in demand.

You need a shock absorber to help you deal with this transient surge in attention.

Think of it this way. If 200 of your closest friends suddenly showed up on your doorstep and said that they'd be hanging around for a while, you probably wouldn't drop everything and build an extension to your house. Instead, you would use an on-demand resource, in this case the nearest hotel, to handle this (hopefully transient) need for more room.

The folks over at the Spanning Sync Blog found themselves in just this situation a few days ago. After putting a new video online, traffic surged and they were saturating their server's network connection. They quickly moved the video over to Amazon S3 and the downloads proceeded very smoothly. In fact they served up over 6,000 11,726 copies of the video in just one day - more info in the comments.

For more information, read their post: 5,000 Video Downloads. Time for Amazon S3.

-- Jeff;

And This Too...

Jon Boutelle, CTO of Slideshare.net (previously featured here) sent me some cool comments that just happen to reinforce the Web-Scale message I've been talking about recently. Here's what he had to say (links are mine):

The dedicated hardware we were initially considering would have cost $1000 to startup, and $800/month in ongoing costs. Most importantly, this would have meant 1 month in time-to-market lost as we configured the hardware and customized and tested our software on it!

Hearing about the success that SmugMug had, we looked at their html, and realized that it would be simple to use S3 for our purposes. It took less than one day to switch our code over to S3 (using the Ruby library provided by Amazon), with no support from anyone at Amazon.

S3 provided a scalable solution from day one. A massive surge in traffic doesn't stress our own system: in fact on the first day our site was hosting embeds on techcrunch and similar sites without any problems. And our costs using S3 have been 16 times lower than they would have been using dedicated hardware, since we only pay for what we use.

What more can I say? Jon's doing my job for me, and I couldn't be happier!

-- Jeff;

This is Web-Scale...

There's a really interesting post over on the Texas Startup Blog. Here are some tidbits:

  • "The Amazon web services products (Amazon Elastic Compute Cloud - EC2 and Amazon S3) are built for the little guy AND the big guy."

    Yes, absolutely. And let's not forget the little guys who want to become big guys -- start small and then grow, or as we say around here sometimes, "From dorm room to board room."
  • "I know of more than ten startups in Dallas that are using Amazon’s services as a way to start without spending any money on servers, bandwidth and colocation.  This is big."

    This really changes the economics for startups. Put your precious capital into building up your proprietary systems," not into depreciating infrastructure. I've personally talked to several startups that have used the Web-Scale model to forego venture capital entirely.

Read the whole post: 1000 Pounds Web Hosting Gorilla: Amazon. There's some good info in the comments too.

-- Jeff;

Amazon EC2, MySQL, Amazon S3

I was on a conference call yesterday and the topic of ways to store persistent data when using Amazon EC2 came up a couple of times. It would be really cool to have a persistent instance of a relational database like MySQL but there's nothing like that around at the moment. An instance can have a copy of MySQL installed and can store as much data as it would like (subject to the 160GB size limit for the virtual disk drive) but there's no way to ensure that the data is backed up in case the instance terminates without warning.

Or is there?

It is fairly easy to configure multiple instances of MySQL in a number of master-slave, master-master, and other topologies. The master instances produce a transaction log each time a change is made to a database record. The slaves or co-masters keep an open connection to the master, reading the changes as they are logged and mimicing the change on the local copy. There can be some replication delay for various reasons, but the slaves have all of the information needed to maintain exact copies of the database tables on the master.

Put another way, the master essentially implements a simple service API for fetching changes as they occur and the slaves slavishly do those same changes.

Hmmm...services...

What if the slave (client) wasn't another instance of MySQL? What if it was a very simple application which pulled down the transaction logs and wrote them into Amazon S3 objects on a frequent and regular basis? If  the master were to disappear without warning (I could say crash here, but I won't), the information needed to restore the database to an earlier state would be safely squirreled away in S3.

For recovery, we need another service. This one pretends to be a master, but it simply pulls out that squirreled-away cache of transactions logs from S3 and feeds them to a MySQL instance which it is temporarily slaved to. After a replay of all of the transactions the slave becomes the master and processing resumes.

Make sense? Could this work? What do you think, Brian?

I'd better run, or I'll be late for my talk!

-- Jeff;

Using Amazon EC2 to Explore Server Configuration Parameters

Earlier this week I was trying (and failing miserably) to explain an idea to a co-worker. I promised to clear up my thoughts and to encapsulate them in a blog post, so here goes.

In a nutshell, if you take a peek behind the scenes at a web site, you will find a highly configurable (and with any luck, very highly tuned) set of services -- application servers, database servers, queues, and so forth. In my experience, tuning even a single service to behave well under a particular load can be very difficult and time consuming. All too often the temptation is simply to add hardware, when in fact this will simply spread the inefficiency to even more locations. On the other hand, finding the proper combination of  configuration values can seem like a never-ending pursuit of a mythical sea creature.

Consider a database server such as MySQL and its associated configuration file, my.cnf. This file contains dozens of tunable parameters, many of which affect how much memory is allocated and how that memory is used. Examples of such parameters include the sort_buffer_size, join_buffer_size, and query_cache_size. Given the finite amount of RAM available on the system, it is simply not reasonable to set each of these parameters to overly generous values since there are multiplicative effects which would raise the overall amount of RAM consumed by MySQL to an impractically large value. There are also interaction effects, where raising one value makes one function more efficient while slowing down others.

Let's call the set of parameters that are of concern the parameter space. Perhaps we want to explore what happens as the sort_buffer_size varies from 1 MB up to 16 MB, while also varying (in all possible combinations) the join_buffer_size from 1 MB up to 32 MB. This is a two dimensional space, but it could have any number of dimension from 1 on up.

Although there are a number of excellent guides to MySQL optimization, getting it right is still a big job.

I would like to propose the use of a structured benchmarking system built around Amazon EC2 to help developers measure and optimize complex servers similar to the one I've described above. Let's start with a simple setup using just three instances:

  1. The first instance is a controller or test harness. It requires network access to the other two instances. While iterating over the parameter space, the controller repeatedly sets up the server under test, fires off a simulated load, and collects the results.
  2. The second instance is the test subject server. It would hold the service to be tested and optimized, and would also incorporate a simple web service (called by the controller) with the power to set server parameters (e.g. MySQL's sort_buffer_size) and to start and stop the service . This instance would also contain a copy of the database (if relevant).
  3. The third and final instance is the load generator. Also under the direction of the controller, this instance produces a repeatable, controlled, load on the test subject server. If the server in question is a database server, the load generator would fire off and benckmark a series of queries to the database, measure execution time and report back to the controller when done.

The controller iterates over the parameter space like this:

for (sort_buffer_size  = 1; sort_buffer_size <= 16; sort_buffer_size++)
{
  for (join_buffer_size = 1; join_buffer_size <= 32; join_buffer_size++)
  {
    // Configure test server (sort_buffer_size, join_buffer_size)
    // Start load generator and run test
    // Record results
  }
}

The inner loop is executed 512 times and the parameters and timing information is recorded after each iteration. The net result (if viewed graphically) would be a 2-dimensional grid of parameter values and the resulting execution times (other metrics could also be gathered, of course). Visual inspection of this grid (for local minimums and maximums) would provide considerable insight into the server's performance in different configurations.

This model could be expanded to use multiple load generators and/or test servers, and it could also serve to test the fairness of a load balancer.

It might also be possible to avoid exhaustive search of the parameter space by using Monte Carlo methods (trying some points at random and then paying more attention to the most promising areas.

At first glance this might not sound like it has a lot to do with EC2, but something I hear a lot when I go out on speaking tours and get to talk 1 on 1 to developers is that they simply don't have much in the way of infrastructure to test scalability, performance under load, or alternative configurations. Any and all available hardware is currently configured to be part of the production system and there's simply no test hardware to spare in advance of weekly or monthly site updates. Given the sporadic need for such hardware (on average you need almost none, but for a couple of hours a month you need a lot), an on-demand solution like EC2 makes perfect sense. Even if you need to run 3 instances flat-out for 24 hours, that will cost you just $7.20. That's a pretty small price to pay to get a highly tuned server.

I would be very interested in hearing (via comments) your thoughts on this quick note.

-- Jeff;

 

Building a Telco For 15 Cents Per Hour

Lily_tomlin_1 Fixed costs are the enemy of any business. Money that must be invested up front to pay for land, buildings, furniture, machine tools, and computers all constitute fixed costs. Regardless of the amount of income that's coming in, interest must be paid on the capital expended on fixed costs.

In a post titled "Amazon S3... Building a Telco for only $0.15 per hour", Nuclei Networks CEO Thomas Anglero describes how variable cost, on-demand infrastructure is a game changer for telco (fixed and mobile phone) operators.

Thomas certainly sees that something is afoot at Amazon:

"While focusing on my IP communications start-up, (Nuclei Networks) I have learned a great deal and one company that keeps appearing on my radar again and again when contemplated opportunities for new business models and improvements in infrastructure performance, Amazon.com!"

Thomas describes how Amazon S3 and Amazon EC2 can be used as the heart of a next-generation telephone company in a post peppered with real-world illustrations of how this can be done. For example, computing and issuing bills is a monthly exercise. Why keep a bunch of expensive hardware up and running all month if you only do this once per month for a couple of days? Use some EC2 time when you need it, and eliminate those fixed costs.

This might be my favorite part of his post:

"Amazon's S3 and EC2 web services are the foundation building blocks that can potentially shift the IT prowess of nations. We all joked about start-ups that came from people's garages. Well in some countries they are too poor to even have a garage (or know what it is), but for $0.15 per GB-month they now can provide services to the entire world without any infrastructure, only some code."

Hard to argue with that.

-- Jeff;

I'll have a Lemonade and Some Links, Por Favor...

I simply couldn't wait until next week to post a couple of new items!

In S3 Meets R3 (Reliability, Robustness, and Resilience), the authors benchmark Amazon S3 against the venerable SCP (Secure Copy) protocol. You can read the entire article to see the details, but the conclusion pretty much sums it up:

Amazon's S3 Services provide a standards-based and high performance mechanism for managing content. S3's performance characteristics make it ideal for end-user as well as enterprise applications where cost-effective data back up is desired. During our regression we never had a problem with reading or writing files. The S3 service had enterprise-class availability and reliability.

In addition to declaring that S3 is enterprise-ready, the authors noted that the entire battery of tests cost just 52 cents to perform.

Also in Dr. Dobb's, be sure to read about Synchronizing Files with .Net 2.0, S3, and FTP.

Meanwhile, Martin Kochanski continues his informative series of S3 articles.

Thought-Provoking Series of S3 Posts

I met Martin Kochanski, developer of Cardbox, in London last month. We met at the Athenaeum Club and had a very pleasant working lunch.

As we talked, it was really clear to me that Martin held a number of interesting opinions about all sorts of subjects and we talked about blogging as an information sharing vehicle. Today, I see that Martin has posted S3 in Business: 1 - Introduction. This is the first in a two week series of articles. The first part is interesting and thought-provoking, and I'm looking forward to the rest of the series.

-- Jeff;

Amazon S3 and SmugMug

Smugmug

SmugMug CEO Don MacAskill describes how they use Amazon S3 in his newest blog post, Amazon S3 = The Holy Grail. In the post, Don reveals that SmugMug stores 500 million images which collectively occupy 300 terabytes of storage.

Don goes on to reveal how S3 allowed SmugMug to take their architecture to the next level of safety, security, economy, and speed. This sentence in particular caught my eye:

Perhaps even more importantly, our cash-flow situation is vastly improved. Instead of paying $25,000 for a handful of terabytes of redundant storage up-front, even before they’re used, we now pay $0.15/GB/month as we use it.

An entire business school case study should be written around that one sentence!

By providing infrastructure components on an as-needed, pay as you go basic, I believe that we are changing the economics of building large-scale web applications. Don notes that S3 is a playing-field leveler, lowering barriers to entry. Indeed, the chasm between concept and reality is now significantly narrower. Ventures that once required angel funding to get off the ground can now be built from the comfort of a dorm room or a kitchen table. As long as you have a business model that provides you with a return as soon as you have primed the pump, you can get started without making those up-front $25K investments on a regular basis.

-- Jeff;

Very Interesting Game-Changer using Mechanical Turk

There's a really interesting post over at the BitPorters blog today. Read it, think about it, and let us know (via comments here or on the original post) what you think.

I regularly tell my audiences that our services allow developers to be innovative and creative. I also tell them that innovation and creativity isn't limited to building something that's new and cool, but that it can mean creating an entirely new business model in addition to (or even instead of) building something new and cool.

Today's post is a perfect example of this. Sure there's code involved, but the real innovation is turning site visitors into site supporters by asking them to do a little bit of work on your behalf.

-- Jeff;

The Profit in Altruism

Rats_to_riches Long-time AWS developer MrRat wrote a really nice article for Revenews.

In The Profit in Altruism, MrRat recounts the story of how his Amazon Products Feed script came to be. At first he was looking to make an immediate profit from his script (I guess that would be a rats to riches story). After a while he decided to give his script away and to derive satisfaction from creating a name for himself, helping others, and creating a customer base for possible future products.

As he says, "You don't have to monetize every action to realize a profit."

Definitely not.

-- Jeff;

Cross-Domain XmlHTTPRequest

Oysterchamgne Today I had lunch with Peter Nixey of Web Kitchen. We talked about all sorts of interesting topics, including his recent blog posting, Why XHR Should Become Opt-In Cross Domain. This posting was written in a very interesting style and compares cross-domain scripting permissions to ordering beer in a pub. Peter explains the problem, the issues, and proposes some intriguing solutions. He's obviously given it a lot of thought and has shared his thinking in the piece. It is a must-read if you think that you are building mashups or think of yourself as a competent "web 2.0" style developer.

I would be very interested in getting your feedback on this post -- is this a problem that we need to address, do you agree with his solution, would it work for you, and so forth? Reply to his post, and I'll check in to see what you have to say.

-- Jeff;

Jon Udell on SQS

Jon Udell's newest InfoWorld column talks about Amazon's pragmatic approach to metered infrastructure.

The entire article is worth reading, but I like this part the best: " Amazon’s S3/SQS duo is a green field that invites entrepreneurs to think way outside the box." Definitely!

-- Jeff;

Sometimes You Need Just a Little...

Aztec On the way to work this morning I stopped by my local gourmet supermarket for some Aztec Trail Mix. I went to the bulk foods aisle, found what I wanted, and used the dispenser to measure out exactly what I needed -- just enough for the next couple of days of random snacking.

I hopped back into my car and started my commute to Seattle. I never listen to the radio anymore. Instead I listen to a number of podcasts -- some technical, some business, some fun, and some that are totally random. This morning the latest edition of the Amazon Wire was at the top of the list, and I enjoyed listening to that. At the very end of the show, Pat Kearney (the host) was kind enough to credit me with helping with some of the "engineering" behind the show.

This actually made me think back to a time a couple of months ago when I was helping to get the first version ready to go. We use a blogging tool to produce the Wire RSS feed, but it needed some custom modifications before being sent along to FeedBurner for final processing. I took the basic feed, did my hand edits, and then needed a place to put the modified version.

I could have stored it on one of my personal servers, or I could have checked it into the official Amazon CMS (content management system). It didn't seem right to use one of my own servers, and I didn't have time to figure out the best way to use our CMS. I was literally holding a little pile of bits in my hand, and I needed a robust URL-addressible place to put them. As always happens with these things, it was a Sunday evening and I had promised to get this done before the official launch on Monday.

Of course, the answer was to upload the feed to S3, and to point FeedBurner into Amazon S3. This was simple to do (I used the S3Curl example) and took just a few seconds. We already had an S3 account for our group, so I didn't even have to sign up.

At this point you are probably wondering what Aztec Trail Mix has to do with S3, and I am glad that you asked! Like that dispenser in the bulk foods aisle, S3 let me use just a little bit of disk storage, less than 5000 bytes. I got to choose how much I needed, and I didn't have to round up to the "family size" of Trail Mix, or use an entire dedicated server for data storage. This is the new world of scalable, on-demand web services. Pay for what you need and use, and not a byte more.

Best of all, that 5000 byte block of fast and reliable storage will cost far less than one penny per year to store, and  the same to transfer.

-- Jeff;

Making Money With Web Services

Today I thought that it would be worthwhile to write a back-to-basics post on the topic of making money with web services. I've been talking about this subject for a couple of years, but this is the first time that I have pulled together all of the material in written form.

In this post I will take for granted the fact that you want to make money using web services. There are lots of other reasons to use web services that don't involve money -- for research purposes, as a learning vehicle, a creative outlet, or simply to have fun. Those reasons are all totally legitimate, but I won't be addressing them today.

Introduction

Ismm_1 Some people start out the with the desire to create enough income from their web business to replace their full time job. This is a great goal and I know many people who have been able to do this. At first, you should think of your new enterprise as a side business, with full-time potential. The technical term for this is a "money machine." My favorite book on this subject is Don Lancaster's Incredible Secret Money Machine. Although the book was written in the pre-web era it contains a lot of excellent advice and a number of timeless truths about work and life. The book is apparently out of print, but available directly from Don or through the Amazon Marketplace. You can read the first chapter here.

So what do we need in order to make money using web services. Here is my laundry list:

  1. A great idea.
  2. Technical skills.
  3. Web services.
  4. A business model.
  5. Traffic.
  6. Metrics.

Let's address each of these in turn.

Great Idea

I listed this first, but you don't always have to start here!

A lot of cool things have been built by people who just start experimenting in an area that they find interesting. Many people have an almost intuitive sense of what the world needs, or they start by solving a problem that's of interest to them. You can approach this question a bit more formally by asking yourself what unmet need you can fill with your proposed business. Or you can look around at your friends and family and see if there's a way to use technology and web services to make them more informed or more productive.

Technical Skills

Once you know what you want to build, you will need to actually build it. I'm certainly not going to give you a crash course in client-server Ajax Web 2.0 development in the next paragraph or two.

If you have some computer science training and you want to gain some genuine web services experience to enhance your career and your resume, building your own money machine on the side is a great way to get started. At first your vision may very well exceed your ability to implement it, but as your skills grow this will become less and less of a constraint.

If you don't have the formal background or experience, you may find it advisable to find a partner with complimentary skills. In many cities you can find meetups or user group meetings on technical subjects such as PHP programming or web development.

Web Services

Programmable_web_matrix You can find all sorts of interesting web services over at the Programmable Web. As I write this there are currently 222 distinct services listed there, and more are added almost every day.

This might seem obvious, but you have to proceed with care as you choose your services Not all web services are of commercial grade, and not all services are licensed for commercial use. Before you go too far down any particular road, it is advisable to check the license agreement for the services that you intend to use, and make sure that your use is in accord with the license. You can use any of Amazon's web services for commercial purposes,  but you should still familiarize yourself with the license.

The simplest applications will call a single service. That's a great way to start. As you progress, you can start thinking about creating a mashup which combines data from two or more services. You can see what's already been built by looking at the Mashup Matrix. While you are there, look at the "white space" (mashups that don't yet exist) and use this as you start the creative process.

Business Model

There are lots of different ways to define a business model. For the purposes of a online or web application, my definition is quite simple: the business model turns traffic (site visitors) into money, in a fashion that grows in proportion to the amount of traffic that's present.

Two of the main web business models are affiliate programs and advertising.

Amazon_associates With an affiliate program, you (the site owner) basically act as a sales agent for goods or services offered by other organizations.  The Amazon Associates program is a great example of what I'm talking about. To get started, you apply to the program. After they verify that your site is within their guidelines, they will supply you with an Associate Id. This is simply a short string (something like "webservices-20") that identifies you to Amazon. You include your Associate Id in the web service calls that you make to the Amazon E-Commerce Service. The links returned by ECS are then customized to include your Id. When you take the ECS data, put it on a site and use the supplied links, you will earn commissions (currently ranging up to 8.5%) on the sales that happen through your links.

There are other affiliate marketing programs in existence, and there's even a regular Affiliate Summit conference. You should be able to find out a lot more with a bit of searching. Once again, be sure to check the rules before you go off and write a lot of code.

With advertising, you dedicate some portion of your site's real estate to the display of advertisements. Some advertisers will pay you on an exposure basis, paying you based on the number of times that your site displays their ad. Others will pay on a click (also known as a CPC or cost-per-click) basis. Finally, the most sophisticated advertisers are moving toward CPA, or cost-per-action. With this model you get paid when your site visitor takes the action desired by the advertiser (signing up for a list, making a purchase, and so forth). Again, there are many advertising networks out there. You can learn a lot more about advertising models and terminology here and here.

Traffic Generation

At this point you have realized your vision in code, and you have a great business model. You know for a fact that more visitors will mean more revenue, and you are ready to roll. Now you need to attract some visitors!

The web has created an amazing number of publicity vehicles and you should take full advantage of each of them.

Word of mouth, as always, is great. If you show your site to one person and he's so impressed that he in turns shows it to two more people, you've got a hit on your hands. This is often called customer evangelism (of course there's a great blog on the subject). Treat your visitors and customers right and they'll naturally want to tell others.

Inetoffice_blog_2 If you have a blog, write about what you are doing, and write about the industry context in which it operates. Show that you are an expert not just on your own application but about the entire field. A good example of this is the President's Blog, run by Tom Snyder. Tom's blog shows that he is immersed in the field, and that he's fully cognizant of the technology and of his competition.

Get others to blog about you. This is another form of word of mouth advertising, and is very powerful. If you are using the Amazon Web Services, drop me a note and I'll be happy to queue you up for an article or a mention. Use your own blog to link to others in the industry, and make polite, respectful requests that they consider blogging about what you've done. Do something so cool, compelling, and relevant that they'll want to do this before you even have to ask.

You will want to make sure that the relevant search engines are aware of your site, and you may want to learn about two really important acronyms, namely SEO and SEM. SEO is short for Search Engine Optimization and you can read more here. SEO refers to the practice of getting good placement and representation in the search engines. SEM is short for Search Engine Marketing, and you can read more here. SEM and SEO can be a bit difficult to distinguish; my recommendation is to learn a bit about each one and to decide for yourself which avenue you would like to take.

Finally, you can also buy traffic by running advertising of your own; this is a form of Search Engine Marketing. There are many online advertising networks, and they each have their own good and bad points. With most of them you will ultimately end up "buying" keywords, often bidding against other people who would also like to send traffic to their sites. Perhaps your site sells rubber chickens. You could buy search engine phrases and keywords such as "rubber chicken," "rubber duck," and other popular variations. Once you start buying advertising, you will want to make sure that you understand all of the factors that influence each component of your business model. If you spend $1 to attract a visitor to your site, but earn just 50 cents per visitor, you are losing 50 cents on each click and you won't be in business for very long. On the other hand, if you spend 10 cents to attract a visitor and you still earn 50 cents, you are making 40 cents on each click and your model is very simple -- pour more money into advertising, and collect more in revenue (I know there are lots of other factors at the extremes; I have kept this simple to prove my point).

Metrics

If you are going to create a real business, you need to do a lot of measuring and you need to run your business based on what you see. You need to know how much traffic is coming to your site, how much money you are spending, and how much you are earning. You need to know which search engines are sending you the most traffic, and which ones are sending you the most valuable traffic. If you are paying for advertising, you need to know which of your keywords has the best "conversion" (the ratio of visitors to purchasers).

You can get some of this from expensive commercial packages, or you may end up building your very own business dashboard.

Conclusion

I hope that I've given you some food for thought here, as well as some guidelines for getting started. I'd love to get your reaction to this, so please feel free to post a comment or two. Let me know what you come up with. Good luck!

-- Jeff;

Podcast Interview: Dominic DaSilva, Author of jSh3ll

The Swampcast has an interview (in podcast form) with Dominic DaSilva the developer of jSh3ll, a popular shell tool for Amazon S3. According to Dominic, "The talk covers the basics of Amazon S3, accessing it via jSh3ll, and a little code walkthrough of jSh3ll."

-- Jeff;

Amazon Mechanical Turk and Image Processing

Image processing (and its cousin, machine vision) both entail designing algorithms to process pixel-based images and to extract recognizable information from them.

Earlier today I saw the following request on the Ask Metafilter site:

i need an algorithm to extract statistics about the diameters (in pixels) of some roughly-circular objects from an image file.

Having done several presentations on the Amazon Mechanical Turk in the last couple of days, it sometimes feels like I am holding an all-purpose hammer in my hand!

Without knowing how many images the requester needs to process, I could see several interesting ways to get this job done using one or more types of Turk HITs (Human Intelligence Tasks).

First, the work could simply be converted into HITs. The worker would be presented with an enlarged image of the circular objects with a grid superimposed, and then asked to count the pixels, something like this:

Circle_grid

Of course, some Qualifications would be used to ensure high quality results, and the same HIT can be sent to more than one worker and plurality used to select the result chosen by the majority of the workers.

Second, the Mechanical Turk could be used to run a quality check on the results of an actual image processing algorithm.

Third, the Mechanical Turk could  be used to handle only certain images, perhaps  those that don't clearly separate the image from its background, or those for which the algorithm returns an indeterminate result.

Fourth, the Mechanical Turk could be used  to provide the feedback needed to train some type of learning algorithm, perhaps one that uses a neural net or something even more advanced. The machine would, in effect, ask the worker "is there a circle of size N pixels here?"

I'll be the first to admit that this isn't real Computer Science, but it is problem solving, and in the end that's what is really needed.

It is worthwhile to take a look at the economics here as well. How many hours is it going to take to program and test a suitable recognition and measuring algorithm, and how much would it cost to do all of the processing using HITs? How much testing will be needed to see how well the algorithm works, and what if the problem statement changes and we also need to measure squares or octagons?

Let's say that we pay workers 1 cent for each such HIT, and that we send each one to 5 workers to get a high quality result, accepting only those answers that are within 10% of the acceptable size. We end up paying around 4 cents per image.

Sometimes it makes sense to put a real human being in the loop, and this is quite possibly one of them.

-- Jeff;

Product Idea: Linux Live CD With Integral S3 Access

Amazon_talk_sign Last week I did an AWS presentation at Westminster University. The audience was very lively and asked a number of great questions. They also wrote some nice blog posts.

One of the audience members asked if anyone had integrated Amazon S3 into a Linux live CD distribution. I'd thought about this a couple of times, but I had to tell him that no one had done this yet, but that I thought it was a really good idea. I'm stuck in a hotel room in Amsterdam today (it is raining cats and dogs right now), so I thought I'd take a little time to explain the concept in the hope that some enterprising hacker would pick up on this idea and run with it.

First, let's talk about the concept of a live CD. As far as I can tell, this idea was invented by the Linux community as a way to let people try out Linux without actually installing it on their machine. You simply insert the CD into the drive, boot, and you are up and running. The live CD contains a complete, bootable Linux distribution, often with a graphical user interface such as Gnome or KDE, and a full compliment of tools, games, utilities and documentation. When booted on modern hardware there's no need to do any special configuration at all. A few months ago I burned a copy of Simply Mepis onto a CD at home, and booted it on a laptop. Without any help from me it configured the display, the touchpad, the network interface, and the entire network configuration.

Amsterdam The live CD was invented as a learning and transition tool, but there are many other uses. If you are heading into a "hostile" computing environment, you can simply carry a live CD with you, boot up, and get started without any risk of viruses, trojan horses,  or other contamination. They are very popular in schools and among people who frequent internet cafes.

One of the great things about the live CD is that it doesn't need any permanent local storage. The operating system loads into RAM, and all of the other programs are simply run from the CD. There's often an option to install the files on a local hard disk, but this isn't  necessary. It is amazing how much of what we now consider "computing" can be done without local storage -- web surfing, email access, instant messaging,  and so forth. However, the lack of local storage is a definite shortcoming, and I think that Amazon S3 can help.

Imagine a live CD with built-in, pre-configured S3 access, custom-burned to access your very own S3 account.

You boot the CD from any machine that you'd like, and you are instantly connected to your data and your files. Wouldn't that be useful, and wouldn't it be sweet? Your files are permanently stored on S3, and you can get to them from anywhere that you would like,

Speaking as an armchair blogger, I don't think that this would be all that hard to do. Start with any of the fine live CD distributions, add in S3 access using JungleDisk or s3DAV, arrange for it to use the S3 partition for storage, and your almost there. Add an optional customization step that would happen after the download and before the CD burn, prompt for the user's S3 account and bucket information, burn that to the CD, and Bob's your uncle, as they say.

I'm sure I've missed out on a detail or two, but I think that the concept is sound, and I would love to see this happen. If there's anything that I can do to help make this happen, drop me an email.

-- Jeff;

Entrepreneur's Perspective on Amazon S3

I've known Scott Johnson for quite a while now. He did some really great work at Feedster, and he's a diligent podcaster. Scott is never at a loss for words, so I asked him what he liked about Amazon S3, and how it helped him to architect and to build Ookles. Scott responded with a great blog post, "Why Ookles Used S3 — and why your startup should too."

Scott's rationale for why Amazon S3 is helping him to build his startup more quickly and with less cash is a must-read. This part in particular caught my eye:

Now being a startup there is nothing more important than cash on hand. If you don’t have cash then you’re effectively out of business. And since Amazon now offers S3 we can conserve our cash and buy gear only as we need to grow, not proactively in