DOTE

Chain And Rate

Showing posts with label Google OCR. Show all posts
Showing posts with label Google OCR. Show all posts

Sunday, December 21, 2008

Google can now OCR all PDFs

When you scan a document, your computer interprets this data as an image. You can see the words on the screen, but your computer doesn’t. As far as your computer is concerned, the letters could be birds or your child or a boat.

When you put this scan up on a website, search engines haven’t been able to index any of the content of your documents because it didn’t recognize the text as text … until now. Google has a new system that scans Acrobat PDFs on the web for words using Optical Character Recognition (OCR). Similar to its process for using OCR to detect words in PDFs that have already been OCR processed, the new system will do the same for scanned documents posted online that haven’t yet undergone OCR.

If you have scanned PDFs and are interested in having them converted into text, you can upload the images to your website and take advantage of this service. Simply follow the instructions for how to use Google OCR from the Digital Inspiration website: Create a folder in your website (say abc.com/pdf) and upload all the PDF images to that folder. Now create a public web page that links to all the PDF files. Wait for the Google bots to spider your stuff.

Once done, type the query “site:abc.com/pdf filetype:pdf” [into Google] to see the PDF documents as HTML. Lifehacker recommends using “Google’s Webmaster Tools to reign in what gets scanned and indexed on your site, although you should assume anything you put online can be found by those looking for it.” This is a really terrific way to get rid of paper clutter in your work space and in your home since you can now see the words in your scanned documents.