Optical Character Recognition (OCR) ~ Vinay's Blog

Dec 26, 2009

Optical Character Recognition (OCR)

This article aims to look at implementation of OCR solutions in transaction handling processes to further optimize accuracy and productivity.

What is OCR?

Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of images of handwritten, typewritten or printed text (usually captured by a scanner) into machine-editable and searchable text/content. In today’s world, where information is critical, the ability to create live content where before there was only static images must be worth 10 times than the source.

Why do we need it?

Technologies like electronic data transfer, workflow optimization and OCR can significantly increase the efficiency and accuracy of operations by automating processes and enabling proactive management.

A recent case published by Accenture, where they consulted Finance operations of 120 leading organizations indicates that only about 15 percent of companies currently transact 60 percent or more of their accounts payable and accounts receivable functions on a fully automated basis. This essentially means that a significant number of organizations are still using manual and labor-intensive methods of transaction processing.

While the cost of new tools and technologies is one of the main barrier, most often the absence of an automated solution may primarily from lack of awareness within their organizations about what technology could actually improve transaction processing.

What Setup suits me

Generally, a desktop solution is most appropriate for low run workflows or environments where the quality of the scanned document is very poor and requires inline quality control and validation. Desktop solutions can add value if users need to be directly involved in the recognition process. If one anticipates requirements like scanning of the original paper document, manually selecting the zone on the page to extract content from, and then validating the recognized text, desktop might be the way to go, particularly if we want to repurpose the content.

Companies supplying SaaS or outsourcing solutions often possess cutting edge, high performance hardware and software, and access to off-shore resources to rapidly turn images into searchable output. This methodology compromises the security and confidentiality of the images. Some organizations have serious concerns about the confidentiality and security of external document processing, which overrules this methodology.

In most cases a server-based OCR is the optimal solution for the vast majority of today’s enterprises, enabling them to extract maximum value from corporate documentation at a reasonable cost. While desktop solutions offer more functionality most business find these extra add-ons unnecessary. In a typical workflow the scanned images already exist, created by scanning solutions. The content is most often destined for searchable content repositories where it needs to be indexed for later retrieval but its relationship with the original image remains intact. In these instances an “original image + hidden text” PDF maintains the visual and printable fidelity of the original scan while providing a fully searchable and indexable layer of content underneath.

Conversion Formats and Features

The features to consider in selecting an OCR solution is output formats. It begins with PDF (portable document format), the globally recognized format for standardized electronic documents. PDF offers the advantage of maintaining both the original image fidelity while creating a searchable document but it doesn’t end there. Often, companies may simply want text output for import into databases or any other applications. With an OCR solution, one can transform a scanned image into a fully editable,

MS Word: where even the headers, footers, tables and page numbering are all properly formatted.

MS Excel: where data is extracted into cells basis the layout of the original pdf files

HTML: facilitates sharing document contents, which were originally locked in an image format.

Some crucial features include:

Zonal OCR: performing OCR on a specific zone of the page to extract or read specific information. With zonal OCR we can define what pages and page regions to perform OCR.

Barcode Recognition: Whether a simple 1D barcode or a more advanced 2D barcode, OCR can extract information for use in routing, management, storage, profiling and more.

Optical Mark Recognition (OMR): Works with Zonal OCR, OMR recognizes a specific region of a page to determine the presence of content. Performing zonal recognition on specific boxes on an exam paper or mail-in survey can return a true or false value depending on whether the box was selected or not.

Compression: Scanned images are significantly larger than their electronic counterparts. Example : 25KB MS Word document could result in a 1MB TIFF when scanned. The ability to compress file outputs will enable reduction in server storage costs and allow faster and more efficient distribution or download via web / email.

Page orientation automation: In an automated environment, documents may have been rotated or come in with varying orientation can be automatically rotated to enable optimal reading by the OCR application.

Blank page detection: OCR automatically separates documents when a blank page is found during conversion.

User definable dictionaries: Allow users to setup specific collections of terms that the OCR engine can use to compare what it thinks it is seeing on a page with typical words. Law firms, for example, have a wide collection of terms that can supplement a standard OCR engine dictionary to help the engine guess more accurately what the correct word might be.

Despeckle / Deskew: Image clean-up processes are often handled upstream by the imaging software. Many OCR technologies offer additional clean-up to further ensure the highest quality output. Any extra black ‘dots’ that show up when OCR process is initiated, will result in inaccurate output or higher conversion time, wherein the tool tries to determine if the dots are actual characters or not.

Forms/Document Recognition: Match the structure with a known document type or format which can then drive a specific process.

Intelligent Character Recognition (ICR): OCR falls short when it comes to reading handwriting. In theory, ICR could “learn” to read handwriting, but the current technology is far from perfect and is used only for specialized tasks. While it is useful to have a feature which could identify handwriting but this technology is still being developed to perform with accuracy.

Accuracy of Conversion

OCR in a commercial domain such as free-form invoice processing or claims data entry faces significant challenges in image quality. Since every operation performed on an image subsequent to its acquisition hinges on the quality of the raw data, it is important to maximize the semantic content, even before OCR takes place. It has beene found that data extraction operations have error rates equal to about twice the OCR error rates in normal domains.

It becomes imperative for any process to identify the right set of tools with the right set of features that cater to the need of the business. A blob of pixels, a line, a part of an image, or noise can result into inaccurate conversion. By implementing several powerful and proprietary noise removal and character enhancement algorithms one can enhance the accuracy of the OCR output.

There are four main types of semantic enhancement for a document before OCR. This can be termed as pre-OCR steps which assist in enhancing the quality of conversion.

Line Removal

In claims processing, graphical lines that interfere with text present a significant problem. Even a small misalignment against pre-printed forms can result in a majority of the text on a page being partially obscured by horizontal or vertical lines. Line removal handles this misalignment using a three-stage approach: initial detection, line removal, and obscured text enhancement.

Let’s consider an image of an invoice which suffers misalignment resulting into line defect.

Pre:

Post:

Noise Removal

Noise comes in three flavors - patterned noise, associated noise, and random noise. Patterned noise comes from graphical patterns, especially half tone shaded areas (commonly seen on scanned forms).

Associated noise occurs when a scanned document is incorrectly threshold and surrounding valid pixels appear in the image.

Random noise comes from a bad threshold or a garbled source document or scanner.

All types of noise are removed using a combination of global and local statistical analysis of blob sizes and shapes. The following image shows an example of noise removal in a patterned area.

Dot Matrix Enhancement (Blob Aggregation)

Once noise and lines have been removed, any blob that is adjacent to another blob by less than the average horizontal character separation distance is likely to be a good candidate for aggregation. A statistical analysis is performed on all text lines over a certain height to determine if the form has been filled out with a dot matrix printer. If it has not, then the aggregation routine is bypassed.

Conclusion/Summary

OCR preprocessing and image enhancement greatly increases the accuracy of OCR, even on lower quality images. Since every point decline in OCR accuracy causes a two point decline in data extraction accuracy, this part of the process is critical to the productivity enhancements realized when using complete system for data extraction.

Once the setup is accurate and the tools are in place, OCR deployment results in productivity improvement to the extent on 150 – 180%.

References

2 comments:

Zone OCR said...: Fantastic overview. Thanks for the great post.; February 16, 2010 at 10:53 PM
AVDHESH J. MAHAJAN said...: Hey , I m webmaster From Yantram BPO Pvt Ltd. I like your Blog Information its Truly Good and Informative As Well. We Also Provide Data Entry Services If you want to Discuss anything about Data Entry then you can Contact us On This Website; May 21, 2010 at 3:39 AM

Dec 26, 2009