What is OCR and How Does It Help With Data Extraction

By kindergarten, Emily had mastered reciting the alphabet and could recognize and write all of her letters. She knew the letters, even though she couldn’t read just yet.  In other words, Emily wasn’t able to interpret what those letters strung together in various ways represented on a page.

Optical Character Recognition (OCR) is the ability of a machine to recognize that a set of black dots is a character, which the machine can then convert into machine-encoded text.  OCR takes documents and turns them into bits and bytes of computer code.  But OCR cannot interpret that code, just as Emily could not read although she knew all the letters.

Itemize adds interpretation to OCR

Certain letter patterns and groups of words repeat; teaching children these patterns and sight words (e.g., the, it, and, etc.) is fundamental for learning how to read.  By first grade, Emily had mastered her sight words and was excited when she could read short phrases like “Sam ran fast.” Emily also used the pictures on the page to attribute meaning to words and rule out others; the cute animal with big ears, eating cheese is a mouse, not a moose, for instance.

OCR vendors use libraries of fonts and ‘pattern matching,’ which, like sight words, enable faster reading.  OCR post-processing uses a lexicon to bind the context and meanings of specific documents.  For example, on invoices, certain words such as freight, quantity, and due date are more expected than others. The results of the post-processing lexicon are complete documents translated into machine-encoded text.

Itemize Receipt OCR and Data Extraction adds context and understanding specifically to payments documents, invoices, receipts, and folios. This context includes, for example, understanding taxes vs. VAT and grand total vs. subtotals.

Today, Emily is in high school and reading complex novels like The Great Gatsby and To Kill a Mockingbird. She understands the nuances of character, perspective, and voice, and uses real intelligence to process language.  She inherently knows to apply different contexts for Fitzgerald than Lee.

By using artificial intelligence, Itemize’s engine understands the difference between a nine next to the word total, a 9 in 9/15/2019, and a nine next to the word West. 

If you are looking for a solution that simply identifies the ABC’s, many standard OCR vendors will suffice.  If, however, you are looking for a partner to read and interpret financial values, vendors, and categorize them in receipts and payments documents, Itemize should be your go-to choice for data automation. Try us for free and get real-time insight into your data in as little as 20 minutes.