I just recently read about reCaptcha and found it really interesting. Maybe it’s just for me or probably you already know about this. reCaptcha is a CAPTCHA (stand for Completely Automated Public Turing test to tell Computers and Humans Apart) system owned by Google. You will find this almost everywhere nowadays on websites to prevent spam or bots attack etc. Remember when some websites require you to type in the words from some ugly looking i in order to proceed or submit the form? that is CAPTCHA.
Why is it interesting? Because apart from preventing bots (human created programs) to enter our website, we (the internet users) are made to be a voluntary Human OCR Machine. Yes, we are working for Google for free!!
Well, that’s not my point. I am willingly contributing to this project because the reCaptcha itself is free for me to use. So, it’s fair enough.
How It Works
Google apparently scan a lot of old magazine, newspapers, textbooks etc to be digitalized. Those ancient papers are distorted and ugly of course. So, a normal OCR system will not be able to convert them into digital texts accurately. Therefor, they will have collection of documents with images of words that computer don’t understand.
reCaptcha presents 2 words to us. One of these words is taken from the documents above (which Google can’t read yet). This will be the “fake” word. Another one is a computer generated word (probably from those documents as well but is already converted to digital text) and will be the “real” word.
Human is able to perceive a lot more accurately than machine. So, when we see these images, we have more chance to identify what words they are. When we enter the 2 words and submit the form, reCaptcha will check if the “real” word above is answered correctly. If it does, the answer for the “fake” word will be added to the database. In other word, we only need to answer the “real” word correctly in order to pass the test. As we don’t know which one is real and which is fake, and also we already offer to volunteer in this project, we will normally answer both words.
reCaptcha will normally repeat the use of the same “fake” words in order to collect more answers. For sure, some of us might answer correctly and some might not. So, Google will have different sets answer for every single word. The set with higher answered will then be used as the translation of that word. One shot two birds.
Using It in PHP
First of all, you need to register your website and get 2 keys. They are some random letters that you need to put them in your PHP codes. Then you need to download the library from the reCaptcha website and include it in your php files. Call the recaptcha_get_html() function to display the CAPTCHA input box and recaptcha_check_answer() to check if the answer is correct. Here is the complete tutorial.
I am so proud that I can contribute to this project. But hackers are everywhere and many technologies had been deprecated just because they were hacked once. According to Google, they already applied some security measurements for it. Read more here. So, no worry about that.
But, I’m sure there should be some flaw in that system. So, I googled again. Something very interesting show up here. The system is not hacked yet so far (though there are some rumors that I think it’s just a rumor). But, there is something called “P**** Flood Attack”.
The attack is surprisingly easy to launch. Everybody can do it in fact just by following some simple guidelines provided here. The key to perform this attack is to identify which is the “real” word. After that, you can replace/answer the “fake” word whatever you want, including but not limited to “P****” word. So why is this flooding? If millions of people are doing this, the answer set discussed above will be flooded with the P word. And don’t be surprised if in the near future, you are reading some books or magazines online with some random P words appearing in the text.
What benefit do you get? thought we had agreed to volunteer this project? don’t worry, because the bad news is, the reCaptcha team already know about this and they had numerous protections implemented to prevent the flooding. I don’t know how the protections works anyway. But I think I’m secured enough to use reCaptcha in my websites. Happy CAPTCHA-ing