Thursday, December 8, 2011

Autocorrective Crowd Sourcing and Autocatalytic Systems Programming

So, a friend sends me this link about how Recapcha is being used to pawn off digitizing work to unsuspecting users, in an ingenious manner:

Setting aside for the moment any ethical considerations of virtual slavery and digitally indentured servitude, it really is a masterful plan.

The crowd sourcing model can even be used to clean up application data, even when the crowd is relatively small.

Another friend related the case where a US federal government department contacted him, to consult on  scrubbing business names from a terabyte scale database. The chief difficulty is that much of the data originated by people typing in whatever they wanted. My friend rightly points out that it is always cleaner to select a coded value from a list, rather than allow free form typing. That is not possible when the values vary, as human contrivances (company names) frequently do.

Systems may be designed to allow people to type in new names in a free form manner. Free form data input means Garbage in, Garbage out: there is no way to identify the unbounded variation of lexical strings by which a company name is known. Consider the morphisms of the Johnson & Johnson corporate apellation:

Johnson and Johnson
Johnson & Johnson
Johnson & Johnson, Inc.
Johnson and Johnson, Incorporated
J & J,
J& J,
J. & J. ,
J and J

The letter case is not such a big deal. Neither is the variation of whitespace, or even the substitution of the ampersand for the word "and".  The problem is that the set of these names is finite but unbounded and the set changes over time. There is no easy way of identifying entries that go with the canonical name for a given organization.

Or isn't there? Applications capture this data in daily interactions with users. Combo boxes can list previous entries and make coded selections, while allowing novel inputs.  So use string pattern matching to do a lookup of the possibilities based upon the first few characters typed. But don't look into the old records, look into a map of generated and collected synonyms instead.

Allow the user to start entering the name of the company. If after the first few letters, you find an exact match on a canonical name and the user accepts it, you're done: just take the code for the canonical name. If there are no perfect matches, generate a set of weighted regexp patterns based upon the user input; any original names that match and are accepted by the user are recorded as a probable match for future lookups.

That is admittedly rather sketchy, but I'm sure someone has implemented a few schemes along these lines to clean up old data. What would some of the advantages be?

  • Cost is not incurred when the effort has no value.
    • Data for companies that are no longer participating is not touched.
  • Effort is expended whenever it is needed.
    • Data for companies that are participating often are improved more rapidly. 
  • Refinement of the data set is embedded as a design feature of the system.
    • Not one-shot: Cleaning of the data continues over the lifespan of the system.
    • Tracks the moving target: Data set targets are ambiguous symbols representing shifting human contrivances (businesses).  

Can the strategy work?  Biological systems do something analogous in the form of DNA error detection and repair. They are far from perfect, but organisms maintain their integrity for decades, up to hundreds of years... thousands if you consider some of the oldest organisms like trees and fungi.

The key idea here is accepting that determinism in complex systems is a convenient myth that is sometimes more trouble than it is worth.  That realization opens the way for explicitly probabilistic feedback loops -- just one more way for complex computer systems to fulfill their original design intent to illuminate the processes of life.

No comments: