Optical Character Recognition (OCR)

What is OCR, how does it work, and what should you expect from it?

Optical Character Recognition - OCR

Optical character recognition, or OCR, is a method of converting a scanned image into text. When a page is scanned, it is typically stored as a bit-mapped file in TIF format. When the image is displayed on the screen, we can read it. But to the computer, it is just a series of black and white dots. The computer does not recognize any "words" on the image. 

This is what OCR does. OCR looks at each line of the image and attempts to determine if the black and white dots represent a particular letter or number. OCR was actually developed originally to assist sight-impaired individuals gain access to printed information. That same technology has been updated and improved and is now used to "read" computer files. 

OCR can be a very powerful tool for a law firm. The key is its ability to produce a text version of the scanned documents. Once a text file has been created, it then becomes possible to launch a text search and locate any page with a given word or set of words. 

For example, let's say you are working on a case that has 100 boxes of documents. You need to find every page where the name John Jones appears. How do you do it? Well, the traditional way is to have someone sit down and read each and every piece of paper in all 100 boxes and pick out the pages that are relevant. 

There are two obvious problems here. First of all, there is an enormous amount of time that must be expended for this task. Every hour of that time must be paid for. Given the level of expertise that the individual reviewing the documents must have, and their associated payroll costs, this can be a very substantial cost to the firm. 

Secondly, there is no guarantee that a critical page will not be missed. Manually reading all of those pages is a very boring task. Fatigue, boredom, and human error almost ensure that a page will be missed here and there. It is just a gamble as to whether or not the pages that will be missed are important or not. Given the huge amount of time involved in the task, no one is going to pay for a second pass. Firms have just had to accept the fact that pages will be missed. 

With OCR, though, this whole process is simplified and made more accurate. Once the documents have been scanned and processed through the OCR module, there is a text version of every page available. Now someone can launch a search for John Jones and let the computer do the searching. It will find every page of every document where that name appears. The process may take some time, depending upon how many pages are to be searched, but no matter how many pages there are, there is no cost involved. No one has to dedicate any time to the process once it starts. 

When the OCR process is completed, it will have assembled a list of every page from every document that contains the word or words that were used in the search. Those pages can be selected, reviewed, or even printed. OCR is a great research tool and can provide vastly superior access to critical information than can manual searches. 

However, it is important to understand the limitations and capabilities of OCR. While it is a great tool, it is not perfect. The biggest factor in the success or failure of an OCR process is the quality of the original documents. It has been our experience that if the original documents were clean, laser printed pages, OCR should read 98+% of the words correctly. Some words may not be read correctly if there is handwriting over it, or if there are stamps or other marks that partial cover the text. 

If the original documents were faxes, or multi-generational photocopies, or were printed with a dot matrix printer, the success rate of OCR drops off quickly. These types of documents may only have a 60%-80% successful read. The same is true of even laser printed documents that have lots of lines and boxes. The lines and boxes confuse OCR, because OCR tries to read the lines as part of the text. If the original documents were hand written, OCR will NOT read the information at all. In spite of the claims of some OCR or ICR companies, we have yet to see any software that will successfully read "real" hand written material. There are some packages that will read carefully printed block letters, but this is not generally what is done in the real world. 

In summary, OCR is a very powerful tool for research. It has the power and capability of creating vast amounts of textual data that can then be searched. As long as the limitations of OCR are understood, it can be of great benefit to various organizations, not limited to law firms only.