The internet is an immense collection of data. It’s getting closer, day by day, to containing the entire sum of everything that is written. ‘Everything’ is available online in one form or another. Some of it you have to pay for in the form of e-books, some of it is free, in the form of most web pages. A further huge raft, is being tirelessly assembled as you read this. For the last 9 years and counting, physical books, millions of them, have been in the process of being scanned and uploaded by Google inc. for it’s Google Books project. This is currently on track to complete it’s “we’ll scan everything” target of nearly 130 million books, magazines and journals by the end of this decade.
Of course, the technology, is only as good as the data it’s given and a lot of old books are printed on what is now faded paper, with smudged inks, cuts tears and blemishes. That’s where humans come in. Our brains are the most sophisticated optical character recognition tool on the planet, and there’s nearly 2.5 billion of them attached to the internet. The chances are, that you’ve worked on the Google Books project without even knowing it by filling out a reCAPTCHA challenge.
You’ll have seen these before – nearly every time you fill out a web form. Most people know that the primary purpose of these are to prevent ‘spam bots’ doing their dastardly deeds- specially written programs repeating registrations – essentially a form of ‘hacking’. But few of us know the second function of a reCAPTCHA challenge, it’s sub-plot is to get humans to decipher words that are not readable by the Google Books computerised systems. It simply pairs a known correct word with an unknown word and assumes that if a few people give the same answer, it’s correct. We don’t know which out of the two fuzzy words is the known correct one, so we can’t fudge the system.
We often groan, when we see a reCAPTCHA, as it represents a fiddly, unfriendly pain of a hurdle to jump, but next time you see one, think that you’re one of the huge army of people resolving 200 million deciphers a day. This represents a staggering 150,000 hours of work each day, saved. That’s gotta be worth a collective invite to the Google staff summer barbecue right?