Many fields in industry and academia are experiencing an exponential growth in data production and throughput, from social graph analysis to video transcoding to high energy physics. Constraints are everywhere when working with very large data sets, and provisioning sufficient storage and compute capacity for these fields is challenging.
This is particularly true for biological sciences after the recent quantum leap in DNA sequencing technology. These advances represented a step change for the field of genomics, which had to learn quickly about how housing and processing terabytes of data through complex, often experimental workflows.
Processing data of this scale for a single user is challenging, but moving to the cloud meant Michigan State University were able to provide real world training to whole groups of new scientists using Amazon's EC2 and S3 services.
Titus Brown writes about his experiences of running a next-generation sequencing workshop using Amazon's Web Services in a pair of blog posts:
After the two week event:
"Students can choose whatever machine specs they need in order to do their analysis. More memory? Easy. Faster CPU needed? No problem.
All of the data analysis takes place off-site. As long as we can provide the data sets somewhere else (I've been using S3, of course) the students don't need to transfer multi-gigabyte files around.
The students can go home, rent EC2 machines, and do their own analyses -- without their labs buying any required infrastructure."
"I have little doubt that this course would have been nearly impossible (and either completely ineffective or much more expensive) without it.
In the end, we spent more on beer than on computational power. That says something important to me."
A great example of using EC2 for ad-hoc, scientific computation and reaping the rewards of a cloud infrastructure for low cost, reproducibility and scale.