Programs have always been good at dealing with highly structured, very uniform data. They sometime stumble when asked to deal with data that is irregular, unstructured, or otherwise messy in some way. Normalizing data that came from a casual, real-world source where people are allowed to enter free-form text can be tedious and expensive.
A recent story on ReadWriteWeb tells the tale of data scientist Peter Skomorch and his analysis of real-world data taken from LinkedIn. Peter and his colleagues used a processing pipeline which made use of the Amazon Mechanical Turk to tap into what he described as the "human brain-power of thousands of Turks." They were, for example, able to figure out that "IBM", "I.B.M.," and "IBM UK" all referred to the same company.
Earlier, Pete had used this technique to create a view of the locations of thousands of Twitter users; the code for this project can be found here.
If you are interested in the Amazon Mechanical Turk and other forms of workforce collaboration, you may also find the upcoming Net:Work 2010 conference to be of interest. Sharon Chiarella, VP of the Amazon Mecanical Turk, will be speaking. The conference will be held in San Francisco on December 9th; you can get a $100 discount by clicking here.