We have just released four additional AWS public data sets, and have updated another one.
In the Economics category, we have added a set of transportation
databases from the US
Bureau of Transportation Statistics.
Data and statistics are provided for aviation, maritime, highway,
transit, rail, pipeline, bike & pedestrian, and other modes of
transportation, all in
CSV format.
I was able to locate employment data for our
hometown airline and found out
that they employed 9,322 full-time and 1,122 part-time employees as of
the end of 2007.
In the Encyclopedic category, we have added access to the DBpedia Knowledge Base, the Freebase Data Dump, and the Wikipedia Extraction, or WEX.
The DBpedia Knowledge Base currently describes more than 2.6 million things including 213,000 people, 328,000 places, 57,000 music albums, 36,000 films, and 20,000 companies. There are 274 million RDF triples in the 67 GB data set.
The 66 GB Freebase Data Dump is an open database of the world's information, covering millions of topics in hundreds of categories.
The Wikipedia Extraction (WEX) is a processed, machine-readable dump of the English-language section of the Wikipedia. At nearly 67 GB, this is a handly and formidable data set. The data is provided is the TSV format as exported by PostgreSQL.
Finally, we have updated the NCBI's
Genbank data.
Weighing in at a hefty quarter of a petabyte terabyte, this public data set
contains information on over 85 billion bases and 82 million
sequence records.
Instantiating these data sets is basically trivial. You create a new EBS volume of the appropriate size, basing it on the snapshot id of the data. Next, you attach the volume to a running EC2 instance in the same availability zone. Finally, you create a mount point and mount the EBS volume on the instance. The last step can take a minute or two for a large volume; the other steps are essentially instantaneous. Instead of spending days or weeks downloading these data sets you can be up and running from a standing start in minutes. Once again, cloud computing reduces the friction between "I have a good idea" and "here's the realization of my idea." You don't need loads of bandwidth, processing power, or local disk space in order to do interesting and significant work with these world-scale data sets.
-- Jeff;
Quarter of a petabyte? Don't you mean quarter of a terabyte?
Posted by: AaronSw | February 24, 2009 at 04:42 PM
How often are these databases updated by AWS?
Or is this a single-point-in-time snapshot and the user is responsible for updating them from the original sources after the initial download to a EBS volume?
Thanks!
Posted by: Deva Rajan | February 25, 2009 at 01:13 AM
Thank you for offering these datasets. They have a potential to have a huge impact.
I recommend that there be public documentation on how often you plan to update these files. I asked how often they will be updated in the last blog post on this subject and also in the EC2 forum and have not received an answer.
I'll have to assume that you did not update the Bureau of Labor data. Given the decline in employment since the November dataset (a decline of ~600,000 employees for each month of Nov, Dec, and Jan or a 0.8% increase in the unemployment rate) it is a disservice to even offer this data if it will not be updated. I'm not using the data, but if I were I would have to download the files myself since they are not updated. Assume you have a student using this data, will they really get passing grades if their data ignores the most recent huge decline in employment?
Posted by: iolaire | February 25, 2009 at 06:55 AM
Jeff - As a former NCBI employee, it's great to see Genbank in the mix here. I've got lots of questions about how the data is available -- for NCBI, is it ASCII Genbank files? or in ASN.1 (their underlying format)?
One of most valuable data sets that NCBI maintains Pubmed abstracts -- text abstracts for every article published in the life sciences for the last decade. It's data that isn't available anywhere else, and I hope it might be considered for inclusion.
I look forward to seeing some How-To's and working examples of using these data sets!
MD
Posted by: Michael E Driscoll | February 25, 2009 at 04:35 PM
Wouldn't it be great if all the medical record formats were also available as a public data set. S3 is the perfect place for a medical records hub and supports Obama's plan to automate medical records.
Posted by: Trudy | February 27, 2009 at 05:50 AM