My Photo

« Amazon SimpleDB Grows Up | Main | Oracle and AWS Webinar »

Paging Researchers, Analysts, and Developers

Hi, I am Deepak Singh, a business development manager at Amazon Web Services. One of my areas of focus is scientific computing on AWS, and I am guest blogging today about an exciting new initiative that will bring great benefit to researchers and scientists.

"Science was always about mashing up, taking one result and applying it to your [work] in a different way. The question is ‘Can we make that as effective [for] samples [of] data and analysis as it [is] for a map and set of addresses for a coffee shop?’ That is the vision." -- Cameron Neylon

One way to achieve Cameron Neylon's vision is to have access to public sources of data. This becomes even more powerful if scientists and analysts can use the available data to perform all kinds of computational and analytical tasks. At Amazon Web Services we believe that making it easy for people to get access to data spurs innovation. In line with that thinking, we have launched Public Data Sets on AWS, a new program that significantly lowers the barrier for researchers and data analysts to access and use some of the most commonly used data sets in their communities without the need to manage data within their own AWS accounts. Public Data Sets on AWS provides a convenient way to share, access, and consume publicly available data within your Amazon EC2 environment. Here is how it works

  • Select public data sets will be hosted by Amazon Web Services for free as an Amazon EBS snapshot.
  • You can access the data by creating your own personal Amazon EBS volume from a publicly shared Amazon EBS public data set snapshot.
  •  You can then access, modify, and perform computations on these data sets directly using an Amazon EC2 instance and just pay for the compute and storage resources that you use.

Some of the areas we have found people interested in include scientific research, economic data analysis and market research. An example of a data set that we have seen interest in from the life science community is Ensembl. Ensembl is a joint project of the European Bioinformatics Institute and the Wellcome Trust Sanger Institute, and produces and maintains automated annotation on a number of eukaryotic genomes. Ensembl have made their MySQL databases for Ensembl release 51 available via the Public Datasets on AWS program and will continue to make updated versions of Ensembl available in the future. This data set consists of more than 650 GB of data and over 31000 files. People who want to use the snapshot will be able create an EBS volume from the snapshot, mount that volume on an AMI running MySQL, and configure the MySQL instance to point to the database files. In other words, you will now have the capability of doing bioinformatics in the cloud without needing to keep your Ensembl databases up to date.

The real power of these data sets comes from developers who can now provide tools and API's that can be used to analyze the data, or mash them up with other data sources. It will be interesting to see how people make use of the available data sets, what kinds of data sets will be utilized, and the kinds of data types being requested and submitted. With the availability of these initial data sets, and more in the future, we would like to invite developers to provide analysis pipelines, tools and API's that can be leveraged by the community and potential customers.

If you are interested in making a data set available as part of the Public Data Sets on AWS program, please submit your request on the form at http://aws.amazon.com/publicdatasets/. We would love to hear from you.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c534853ef0105361ca685970b

Listed below are links to weblogs that reference Paging Researchers, Analysts, and Developers:

Comments

There are a number of existing open datasources sitting behind APIs that could be provided as raw EBS volumes:
http://www.readwriteweb.com/archives/where_to_find_open_data_on_the.php

A good one to put up would be wikipedia data dumps, maybe formatted for hadoop?
http://www.cloudera.com/hadoophack/datasets/wikipedia

Anyone out there with some initiative can sift through this list to find some other large public datasets to submit:
http://delicious.com/pskomoroch/dataset

Congratulations, this is a big deal.

Question:
How will data updates be handled? For example I assume The Bureau of Labor Statistics and other federal data will be updated at least monthly? Will there be a method/area to determine when these updates happen without cloning an Amazon EBS volume to look for changes.

How about getting all SEC filings in the public db? Is that in the works?

We have wanted to grant guest access to large datasets in an S3 bucket, but the process is awkward because every item in the bucket must have its ACL reset individually.

Any chance there will ever be a recursive ACL operation, or a way to specify that bucket grants override item grants?

(Been hoping for this for a long while... see: http://developer.amazonwebservices.com/connect/message.jspa?messageID=46691#46691 )

- Gordon @ IA

What about grabbing NASA data? I think it's public record one year after it's recorded.

First of all - Great effort and intention.

Don't know how this will succeed? If I have a very large dataset why would I bother submitting a form and doing the work of giving it to others. What do I get(unless I am a .ORG)

You need to give atleast $100 Credit of AWS usage for the guy giving you a TB of data.

The key to crowdsourcing is leveraging the right person to do the right thing and hence the costs come down automatically.

Secondly if you get into the mode of policing data(i.e do I have the right to submit the data) you are not going to get anywhere. If you treat this as a platform(like torrent or youtube) you will have better success.

i.e Just provide a direct interface to load data and tag it. Do not out a gatekeeper on top of it.

Why not post all the public records data from all government agencies ? This should help the local and state govt. that are on the brink of bankruptcy by eliminating this cost from their balance sheets. This might also enable private sector to build tools and utilities based on this public data much more effectively for general public use and leave the government free from having to manage this for its people.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Comments are moderated, and will not appear until the author has approved them.

Email Subscription

Enter your email address:

Delivered by FeedBurner

July 2009

Sun Mon Tue Wed Thu Fri Sat
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31