Hi, I am Deepak Singh, a business development manager at Amazon Web Services. One of my areas of focus is scientific computing on AWS, and I am guest blogging today about an exciting new initiative that will bring great benefit to researchers and scientists.
"Science was always about mashing up, taking one result and applying it to your [work] in a different way. The question is ‘Can we make that as effective [for] samples [of] data and analysis as it [is] for a map and set of addresses for a coffee shop?’ That is the vision." -- Cameron Neylon
One way to achieve Cameron Neylon's vision is to have access to public sources of data. This becomes even more powerful if scientists and analysts can use the available data to perform all kinds of computational and analytical tasks. At Amazon Web Services we believe that making it easy for people to get access to data spurs innovation. In line with that thinking, we have launched Public Data Sets on AWS, a new program that significantly lowers the barrier for researchers and data analysts to access and use some of the most commonly used data sets in their communities without the need to manage data within their own AWS accounts. Public Data Sets on AWS provides a convenient way to share, access, and consume publicly available data within your Amazon EC2 environment. Here is how it works
- Select public data sets will be hosted by Amazon Web Services for free as an Amazon EBS snapshot.
- You can access the data by creating your own personal Amazon EBS volume from a publicly shared Amazon EBS public data set snapshot.
- You can then access, modify, and perform computations on these data sets directly using an Amazon EC2 instance and just pay for the compute and storage resources that you use.
Some of the areas we have found people interested in include scientific research, economic data analysis and market research. An example of a data set that we have seen interest in from the life science community is Ensembl. Ensembl is a joint project of the European Bioinformatics Institute and the Wellcome Trust Sanger Institute, and produces and maintains automated annotation on a number of eukaryotic genomes. Ensembl have made their MySQL databases for Ensembl release 51 available via the Public Datasets on AWS program and will continue to make updated versions of Ensembl available in the future. This data set consists of more than 650 GB of data and over 31000 files. People who want to use the snapshot will be able create an EBS volume from the snapshot, mount that volume on an AMI running MySQL, and configure the MySQL instance to point to the database files. In other words, you will now have the capability of doing bioinformatics in the cloud without needing to keep your Ensembl databases up to date.
The real power of these data sets comes from developers who can now provide tools and API's that can be used to analyze the data, or mash them up with other data sources. It will be interesting to see how people make use of the available data sets, what kinds of data sets will be utilized, and the kinds of data types being requested and submitted. With the availability of these initial data sets, and more in the future, we would like to invite developers to provide analysis pipelines, tools and API's that can be leveraged by the community and potential customers.
If you are interested in making a data set available as part of the Public Data Sets on AWS program, please submit your request on the form at http://aws.amazon.com/publicdatasets/. We would love to hear from you.


There are a number of existing open datasources sitting behind APIs that could be provided as raw EBS volumes:
http://www.readwriteweb.com/archives/where_to_find_open_data_on_the.php
A good one to put up would be wikipedia data dumps, maybe formatted for hadoop?
http://www.cloudera.com/hadoophack/datasets/wikipedia
Anyone out there with some initiative can sift through this list to find some other large public datasets to submit:
http://delicious.com/pskomoroch/dataset
Posted by: Peter Skomoroch | December 04, 2008 at 12:00 PM
Congratulations, this is a big deal.
Question:
How will data updates be handled? For example I assume The Bureau of Labor Statistics and other federal data will be updated at least monthly? Will there be a method/area to determine when these updates happen without cloning an Amazon EBS volume to look for changes.
Posted by: iolaire | December 04, 2008 at 01:47 PM
How about getting all SEC filings in the public db? Is that in the works?
Posted by: Michael Bigger | December 05, 2008 at 11:46 AM
We have wanted to grant guest access to large datasets in an S3 bucket, but the process is awkward because every item in the bucket must have its ACL reset individually.
Any chance there will ever be a recursive ACL operation, or a way to specify that bucket grants override item grants?
(Been hoping for this for a long while... see: http://developer.amazonwebservices.com/connect/message.jspa?messageID=46691#46691 )
- Gordon @ IA
Posted by: Gordon Mohr | December 05, 2008 at 02:25 PM
What about grabbing NASA data? I think it's public record one year after it's recorded.
Posted by: Mike Miller | December 05, 2008 at 03:48 PM
First of all - Great effort and intention.
Don't know how this will succeed? If I have a very large dataset why would I bother submitting a form and doing the work of giving it to others. What do I get(unless I am a .ORG)
You need to give atleast $100 Credit of AWS usage for the guy giving you a TB of data.
The key to crowdsourcing is leveraging the right person to do the right thing and hence the costs come down automatically.
Posted by: Niraj | December 07, 2008 at 04:37 PM
Secondly if you get into the mode of policing data(i.e do I have the right to submit the data) you are not going to get anywhere. If you treat this as a platform(like torrent or youtube) you will have better success.
i.e Just provide a direct interface to load data and tag it. Do not out a gatekeeper on top of it.
Posted by: Niraj | December 07, 2008 at 04:47 PM
Why not post all the public records data from all government agencies ? This should help the local and state govt. that are on the brink of bankruptcy by eliminating this cost from their balance sheets. This might also enable private sector to build tools and utilities based on this public data much more effectively for general public use and leave the government free from having to manage this for its people.
Posted by: Kris | December 12, 2008 at 09:07 AM