Recent AWS Customer Success Stories & Videos

More AWS Customer Success Stories...

« Product Idea: Linux Live CD With Integral S3 Access | Main | Mechanical Turk Requester Interface »


TrackBack URL for this entry:

Listed below are links to weblogs that reference Amazon Mechanical Turk and Image Processing:


Feed You can follow this conversation by subscribing to the comment feed for this post.

Glenn Fleishman

I've been particularly interested in the Mechanical Turk (which could have been named Maezie for short [obscure joke]) because of a particular problem with book information that persists in some forms at It was something I was working on there back in 1996 and continue to work on in different forms 10 years later. The notion that you need a combination of automatic processing and human processing to sort book information into the correct categories.

There are items like correcting and normalizing book titles, because book information is somewhere about 90 to 98 percent accurate, depending on the source. That can mostly be done automatically and be predictably improved over time by disregarding old, bad information and using an index of authority that ranks new, good information for hygiene purposes.

More complexly is the issue of - what is a work? Is The Wizard of Oz a work? Great. How is it instantiated? Are the 100 ISBNs that are The Wizard of Oz editions the same work? Yes. What if Wizard of Oz is misspelled? What if Frank Baum's name is attached to a parody? What if there are collections that contain the work The Wizard of Oz?

And are parodies of The Wizard of Oz the same work? No. But they have a relationship that would be useful to someone interested in that work. This is where Monsieur Turk could be into play.

Algorithmically without heuristics, you can sort likely ISBNs into likely work sets by author, publication year, publisher, title, and other predictable factors. The several percent of ISBNs that are not part of any work or that appear to be collections of works that can't be teased out into sets of works would go through M.Turk. The human brain can easily say, "The Blizzard of Snozz" (with a paragraph description) is a [checkbox] parody of the Wizard of Oz and that The Wiz is an alternate retelling (let's see a checkbox for that).

Where it would be more useful is a book called Collection of Great American Public Domain Novels which contain, say, 10 short novels. A human being could look at the description and enter the putative titles. Have 3 or 4 humans do the same things and check results (to correct for typos) and then run the results through the first algorithms. If those titles then appear as members of existings works (so the collection is a container of works), you have results that make sense. Otherwise, you may then run the container's works as defined by M.Turk workers into another round of M.Turk.

I just have to have some time to program the process!

The comments to this entry are closed.

Featured Events

The AWS Report

Brought to You By

Jeff Barr (@jeffbarr):

Jinesh Varia (@jinman):

Email Subscription

Enter your email address:

Delivered by FeedBurner

April 2014

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30