The Optical Character Recognition (OCR) software by Google now works for more than 248 world languages, including all the major South Asian languages, and it’s easy to use and works with over 90 percent accuracy for most languages.
OCR software has been extremely beneficial for the study of language, helping to extract text from images of virtually any printed text—and sometimes even handwriting, which opens the door to old texts, manuscripts, and more.
Typically OCR software has difficulty reading the text on old documents or pages with blemishes and ink marks, spitting out gibberish instead of legible text.
Google’s support page on this project shares additional details about character formatting, like its ability to preserve bold and italicized fonts in the output text.
Old and valuable texts in many languages could now be digitized and shared over the internet using platforms like Wikisource and could be preserved and made available for sharing knowledge.
Google’s OCR partly uses Tesseract—an OCR engine released as freeware. Developed as a community project between 1995 and 2006 (and later taken over by Google), Tesseract is considered to be one of the world’s most accurate OCR engines and works for over 60 languages.
Read more: globalvoicesonline.org – Written by Subhashish Panigrahi