Pipeline overview
In the
last post, we gave a brief overview, motivations for the project and initial results for text detection. This post aims to present an overview of text extraction pipeline.
Figure 1 describes the full pipeline. The text detection module was described in the
last post. This module produces a set of bounding boxes describing the potential regions for text. Figure 2 shows example result.
Texts can be divided into two categories: printed text and handwritten text. We decide to treat the recognition steps of these types separately.
Figure 1: Full text extraction pipeline.
Figure 2. The top figure shows bounding box decided from the text detection module. The bottom figure describes the heat map for text possibilities. The brighter the color means higher probabilities for that region to be text.
1. Machine-printed text
Approach We will leverage the great success of OCR and use commercial or open-source OCR engines, such as ABBYY and Tesseract, for recognizing machine-printed text. First, we will perform a series of preprocessing steps, such as thresholding, denoising, and binarizations, to enhance the inputs. The output of the preprocessing step is then passed directly through an OCR engine to obtain the final recognition. Figure 3 shows selected examples of detected machine-printed text in the data set.
Figure 3. Selected examples of handwritting text. Most of these appears in almost full frontal pose and contains little noises.
2. Handwritten text
Compare to machine-printed text, the problem of handwritten text recognition is much harder. Handwritting often contains more noise and wider variations in colors, shapes, and appearances. Figure 4 shows an example of a handwritten text cropped from an image in the data set.
Figure 4. An example of handwritten text in the data set. The text in the picture is "Junniper Flat. Modoc Co., Calif. June 1939.
We observe that human is reasonably good at the task of recognizing handwritten text. This is especially true if the person is an expert in the fields where the handwriting is extracted from. For example, a pharmacist is better at read a prescription note than an average person.
Approach. We approach this problem by combining human intelligence, through crowd-sourcing(ZooUniverse or Amazon Mechanical Turk), with machine intelligence (computer vision and machine learning). This approach is often referred to as human-in-the-loop. We aim to create a system where human interacts with the machine by performing a small series of manual annotations. The number of annotations is much less than the total examples in the data set. Our algorithm then learns from these annotations, becomes smarter as more annotations are available, and improve its ability to recognize handwritten text automatically. Figure 5 describes the general flow of a human-in-the-loop system.
Figure 5. General flow for human-in-the-loop system. Users from online crowdsourcing provides the annotation for handwritting texts. These annotations improves the machine learning model. The machine model, then, selects a new set of images to be annotated. As more and more handwritten texts are annotated, the model becomes "smarter".
In the next blog post, we will discuss the further details of the handwritting recognition module, such as the preprocessing steps prior to handwritten recognition, the interface for which user can perform annotations, and backend machine learning model. Figure 6 shows an example of back-end systems proposed by Manmatha et. al.(1996) that helps reducing human annotation time.
Figure 6. Illustrations of the word spotting ideas. Similar-looking words are clustered into the same set. When a user annotates a word in this set, the system then broadcasts the annotation to the rest of the group. Therefore, reducing the number of manual annotations required.
Reference
R. Manmatha, C. Han and E. Riseman: Word Spotting: A New Approach to Indexing Handwriting . In: Proc. of the IEEE Computer Vision and Pattern Recognition Conference, San Francisco, CA, June 1996, longer version available as UMass Technical Report TR95-105.