Recently I blogged about The Server Labs, a consultancy that
specializes in high-performance computing – including on Amazon Web Services.
Here’s another story that I found fascinating: nominally it is about how The Server Labs uses Amazon Web Services as a scale-out solution that also implements Oracle databases; however it’s really about space exploration (or should I say “nebula computing”). It began with an email asking whether there would be a problem running up to 1,000 Amazon EC2 High-CPU Extra-Large instances.
The Server Labs is a software development/consulting group based in Spain and the UK that works closely with the European Space Agency, and they needed to prove the scalability of an application that they helped build for ESA's Gaia project. In addition to the instances, they also requested 2 large and 3 X-Large instances to host Oracle databases that coordinate the work being performed by the high-CPU instances.
Gaia’s goal is to make the largest, most precise three-dimensional map of our Galaxy by surveying an unprecedented number of stars - more than one billion. This, by the way, is less than 1% of all stars! The plan is to launch a mission in 2011, collect data until 2017; and then publish a completed catalog no later than 2019.
I had the opportunity to see a PowerPoint deck created and presented by The Server Lab’s founder, Paul Parsons, and their software architect, Alfonso Olias, who is currently assigned to this project.
The deck explained that the expected number of samples in Gaia is 1 billion stars x 80 observations x 10 readouts, which is approximately equal to 1 x 1012 samples—or as much as 42 GB per day transferred back to Earth. There’s a slide in the deck that says “Put another way, if it took 1 millisecond to process one image, the processing time for just one pass through the data on a single processor) would take 30 years.”
As the spacecraft travels, it will continuously scan the sky in 0.7 degree arcs, sending the data back to Earth. Some involved algorithms will come into play in order to process the data; and the result is a fairly complex computing architecture that is linked to an Oracle database. Scheduling the cluster of computational servers is not quite so complicated, and is based on a scheduler that is focused on keeping each machine as busy as possible.
However the amount of data to process is not steady—it will increase over time. Which means that infrastructure needs will also vary over time. And of course idle computing capacity is deadly to a budget.
The opportunity to solve large computational problems usually turns to grid computing. No difference this time either – except that as mentioned above, the required size of the grid is not constant. Because Amazon Web Services is on-demand, it’s possible to apply just enough computational resources to the problem at any given time.
In their test, The Server Labs set up an Oracle database using an AWS Large Instance running a pre-defined public AMI. Then they mounted 5 EBS volumes of 100 GB each, and mounted them to the instance.
Then they created Amazon Machine Images (AMIs) to run the actual analysis software. These images were based on large instances and included Java, Tomcat, the AGIS software and an rc.local script to self-configure an instance when it’s launched.
The requirements break down as follows:
To process 5 years of data for 2 million stars, they will need to run 24 iterations of 100 minutes each, which works out to 40 hours running a grid of 20 Amazon EC2 instances. A secondary update has to be run once and requires 30 minutes per run, or 5 hours running a grid of 20 EC2 instances.
For the full 1 billion star project numbers extrapolate out more or less as follows: They calculated that they will analyze 100 million primary stars, plus 6 years of data, which will require a total of 16,200 hours of a 20-node EC2 cluster. That’s an estimated total computing cost of 344,000 Euros. By comparison, an in-house solution would cost roughly 720,000 EUR (at today’s prices) – which doesn’t include electricity or storage or sys-admin costs. (Storage alone would be an additional 100,000 EUR.)
It’s really exciting to see the Cloud used in this manner; especially when you realize that an entire set of problem solutions that were beyond economic possibility before the Cloud became a reality.
Mike


This is a really good example of how cloud computing can benefit scientific research. As a researcher myself (and one who's looking forward to making use of the Gaia dataset) using EC2 is an extremely efficient way to use my computing budget. My computing needs vary from near zero for long periods of time while new algorithms are being developed then ramp up to requiring hundreds of instances over short periods to perform calculations rapidly. Maintaining my own cluster would be economically impractical, but using EC2 allows me to buy the computing I need when I need it.
Posted by: Andrew | June 28, 2009 at 05:24 PM
Ioan Raicu has done a lot of work with dynamic resource provisioning for analysis of astronomy data. He uses grid resources (e.g., TeraGrid) for the most part, but the methods would translate directly to EC2. See, for example: http://people.cs.uchicago.edu/~iraicu/presentations/2009_HPDC09_06-13-09.pdf
Posted by: Ian Foster | August 07, 2009 at 12:32 PM