Those annoyinng reCaptcha codes are actually translating old books

Posted by admin on Nov 23, 2011 in geekery |

ReCaptcha codes are one of those necessary evils that inhabit everyday life, as filling them in for security reasons equates to emptying the bins or picking up dry cleaning- useful and important, but hardly fun. If you’re anything like me you may have even uttered a curse word or two as the lovely ReCaptcha code popped up onto the screen, as sometimes the words they give you are so faded and hard to decipher that you feel like you’re decoding some ancient text.

Perhaps it will make you feel a little better if you know that’s actually the truth and your squinty eyed keyboard stabbing at the latest frustrating ReCaptcha code actually IS DECODING an ancient text. Seriously.

So, how did this come about, and what does interpreting those ReCaptcha scrawls actually mean?

Well, we know why we use these codes- it’s so we don’t end up spammed with adverts for Nubian princes and cheap designer deals, and that we can securely login to our social networks when we’re travelling. The text you see when you enter is very distorted and it has been proven that humans can decipher this way better than computers can, so you’re unlikely to have a spambot filling in your email for you (1 point to humans!).

However, the strange pieces of text that we actually decode (estimated at around 200 million Captchas a day worldwide) actually have a benefit to us. All our hard work (collectively) adds up to around 150,000 hours of work a day and reCaptcha has utilized this to help digitize books. Yes, by effectively crowd sourcing a whole lot of Captcha codes worldwide, you are helping translate old books and make them available to the public in a recognizable format.

The phrases you see are from book pages which have been photographically scanned and transformed into digital copies  using something called “Optical Character Recognition” (OCR).  The issue is that OCR is not 100% accurate and many of these pages contain characters that don’t make sense leading to nonsense text that would make Edward Lear happy. To correct this issue, the Captcha codes we see are actually part of the universal book pages melting pot, and every person who adds their viewpoint on the code actually works towards a more unified translation.

The problem with the Scanned text

ReCaptcha was purchased by Google in 2009 and they are using this system to digitize old issues of the New York Times and books from the Google Books catalogue. Since they started combining their project with reCaptcha over twenty years of The New York Times have been digitized

OK, the codes are no less annoying than before, but at least you can have a little glow of pride that you’re helping the bigger picture. Well, it’s not like you have a choice, is it?

Tags: ,


5 + = fourteen

Copyright © 2021 Zara Stone All rights reserved. Theme by Laptop Geek.