Weighing in at a whopping 500 GB (388 GB of data and 112 GB of free space to allow for some in-place decompression), the Wikipedia XML data is our newest Public Data Set.
This data set contains all of the Wikimedia wikis in the form of wikitext source and metadata embedded in XML. We'll be updating this data set every month and we'll keep the sets for the previous three months around.
As you can see from this screen shot of my PuTTY window, there are some pretty beefy files in this data set:
As an example of what can be done with this data, take a look at Cloudera's blog post on Grouping Related Trends with Hadoop and Hive. This article shows how to create a trend tracking site using a Cloudera Hadoop cluster running on EC2, using Apache Hive queries to process the data.
-- Jeff;


Curious if there's any hope of getting the "pages-meta-history" data for the enwiki archive. I've had an interest in that data for some time now but can only get a copy from Jan 2008. Wikimedia has been unable to generate more recent versions of that archive for enwiki due to technical problems (at least in part, disk space related).
Perhaps AWS could set up some kind of relationship to help offload the archival process?
Posted by: Kevin Webb | September 29, 2009 at 06:54 AM