Friday, February 24, 2012

Introduction

OCR Bug Project

Updated(3/1/2012):
1. Added heatmap
2. Explain the results clearer


Introduction
The UC Berkeley Entomology department is looking to digitize and geo-reference 1.2 million specimens. One of the tasks is to record related information of the specimens in database. Due to the large amount of data, manual solutions are unscalable. A more automatic approach is to use commercial OCR software to perform text extraction. However, due to the complex nature of text(both printed and handwritten), this approach has not seen much success. A Computer Vision project was started at UCSD to automatically(or semi-automatically) extract texts from a series of images containing texts and the specimen. Figure 1 shows two examples of such images.


Figure 1
Example of images from data set


Who are we?


We are part of the Computer Vision group at UC, San Diego. Professor Belongie directly oversees the progress and technical design of the project. Kai Wang acts a direct technical consultant for this project. I, Phuc Nguyen, am implementing and managing the text extraction algorithm and this blog.
This blog describes the progress and a brief technical details of the text extraction process.

Technical details


In this section, we are going to briefly describe the technical aspects of the current progress. If you are more interested in the results, please scroll down to the result section.


General Pipeline


We break the text extraction problem into two sub-problems: text detection and text recognizing. One expected difficult is recognizing handwritten text. A proposed solution to solve this problem is through the combination of clustering and leveraging the power of Amazon Mechanical Turk services.
In this blog post, we will discuss the technical details and results of the text detection algorithm


Text Detection


We implement a sliding-window classifier as it proves to be an effective technique in previous detection problems. We are using features described in [Chen 2004] and experimenting with other features, such as local binary pattern[Ojala 2002].
For classifying, we use a logistic regression model and stochastic gradient ascent to train the parameters of such model.


Results


We use the 2:1 aspect ratio for our window. Window sizes vary from 35 to 70 pixels to 125 to 250 pixels. For each image in the data set, we hand-annotated the bounding boxes for the text regions. To measure accuracy at testing time, for each bounding box classified as positives, we check whether it is in the hand-annotated bounding boxes.

We achieve 96.2% accuracy in detection. Figure 2 shows two instances of text detection. Figure 3 shows another example with its corresponding heat map.

Figure 2: Examples of text detection. The overlapping red boxes are the sliding windows that the classifier detects as text.

Figure 3: An image and its corresponding heat map. The brighter the color in the heat map means a higher probability that the region is a text region.

Next step

In the next post, we will discuss an attempt to clustering the windows from the detected regions using different clustering techniques.

Other results


To further test the performance of our text detection algorithm, we use the Street View Text Dataset[Kai]. The algorithm does not perform as well in this data set as there are more noises and variations for text in the wild. We will not attempt to improve the performance of the text detector for this data set, as the current detector is sufficient for our objective. Figure 3 shows two examples from the street view data set.
Figure 3: Examples of text detection in the wild. There are few false positives and false negatives in the images.