Monday, August 13, 2012

Initial interface

Hello, this blog post describes the current and the envisioned functionality of the interface to process specimen records.

DEMO: link

Short tutorials:

  1. To start, click on "New Annotations" button on the top right
  2. Draw a bounding box around a text region
  3. A new window will pop up
    1. Here you can select whether this text is machine-printed or handwritten
    2. In the text area, input the text within the bounding box
    3. Press "Done" when finished.

Figure 1. Click on the "New annotations" to start the process
Figure 2. Drawn bounding box and information filled. Click "Done" to finish.

Figure 3. A complete example. The text area in the right illustrates the information that is going to be sent back to the back-end server of choice.
Interactions:
  • Draw bounding boxes around text
  • Resize and move the bounding boxes
  • Input and modify the texts for the drawn bounding box by clicking on the bounding box
  • Delete bounding box

Current benefits:
  • Allows easy annotations of texts and data
  • Easy interface with other system (Mechanical Turk, ZooUniverse, etc.)
  • Easy access - web interface - no installation required.

Before reading the next part, please take a few minutes and experience the interface. The next section will make more sense.

Upcoming features:
  • Using computer vision and machine learning to streamline the process
    • Use automatic text detection for automatic bounding boxes localization
    • Use OCR to automatically fill in the text box section
    • Use word spotting algorithm to deal with tough cases
  • A time measurement mechanism to evaluate the performance of the interface.
In the next blog post, we will describe the use of text detection in assist of automatically localizing bounding boxes.

Thursday, April 19, 2012

Pipeline overview

Pipeline overview

In the last post, we gave a brief overview, motivations for the project and initial results for text detection. This post aims to present an overview of text extraction pipeline.


Figure 1 describes the full pipeline. The text detection module was described in the last post. This module produces a set of bounding boxes describing the potential regions for text. Figure 2 shows example result.

Texts can be divided into two categories: printed text and handwritten text. We decide to treat the recognition steps of these types separately.

Figure 1: Full text extraction pipeline.

Figure 2. The top figure shows bounding box decided from the text detection module. The bottom figure describes the heat map for text possibilities. The brighter the color means higher probabilities for that region to be text.



1. Machine-printed text


Approach We will leverage the great success of OCR and use commercial or open-source OCR engines, such as ABBYY and Tesseract, for recognizing machine-printed text. First, we will perform a series of preprocessing steps, such as thresholding, denoising, and binarizations, to enhance the inputs. The output of the preprocessing step is then passed directly through an OCR engine to obtain the final recognition. Figure 3 shows selected examples of detected machine-printed text in the data set.




Figure 3. Selected examples of handwritting text. Most of these appears in almost full frontal pose and contains little noises.

2. Handwritten text
Compare to machine-printed text, the problem of handwritten text recognition is much harder. Handwritting often contains more noise and wider variations in colors, shapes, and appearances. Figure 4 shows an example of a handwritten text cropped from an image in the data set.
Figure 4. An example of handwritten text in the data set. The text in the picture is "Junniper Flat. Modoc Co., Calif. June 1939.

We observe that human is reasonably good at the task of recognizing handwritten text. This is especially true if the person is an expert in the fields where the handwriting is extracted from. For example, a pharmacist is better at read a prescription note than an average person.

Approach. We approach this problem by combining human intelligence, through crowd-sourcing(ZooUniverse or Amazon Mechanical Turk), with machine intelligence (computer vision and machine learning). This approach is often referred to as human-in-the-loop. We aim to create a system where human interacts with the machine by performing a small series of manual annotations. The number of annotations is much less than the total examples in the data set. Our algorithm then learns from these annotations, becomes smarter as more annotations are available, and improve its ability to recognize handwritten text automatically. Figure 5 describes the general flow of a human-in-the-loop system.

Figure 5. General flow for human-in-the-loop system. Users from online crowdsourcing provides the annotation for handwritting texts. These annotations improves the machine learning model. The machine model, then, selects a new set of images to be annotated. As more and more handwritten texts are annotated, the model becomes "smarter".


In the next blog post, we will discuss the further details of the handwritting recognition module, such as the preprocessing steps prior to handwritten recognition, the interface for which user can perform annotations, and backend machine learning model. Figure 6 shows an example of back-end systems proposed by Manmatha et. al.(1996) that helps reducing human annotation time.


















Figure 6. Illustrations of the word spotting ideas. Similar-looking words are clustered into the same set. When a user annotates a word in this set, the system then broadcasts the annotation to the rest of the group. Therefore, reducing the number of manual annotations required.


Reference


R. Manmatha, C. Han and E. Riseman: Word Spotting: A New Approach to Indexing Handwriting . In: Proc. of the IEEE Computer Vision and Pattern Recognition Conference, San Francisco, CA, June 1996, longer version available as UMass Technical Report TR95-105.

Friday, February 24, 2012

Introduction

OCR Bug Project

Updated(3/1/2012):
1. Added heatmap
2. Explain the results clearer


Introduction
The UC Berkeley Entomology department is looking to digitize and geo-reference 1.2 million specimens. One of the tasks is to record related information of the specimens in database. Due to the large amount of data, manual solutions are unscalable. A more automatic approach is to use commercial OCR software to perform text extraction. However, due to the complex nature of text(both printed and handwritten), this approach has not seen much success. A Computer Vision project was started at UCSD to automatically(or semi-automatically) extract texts from a series of images containing texts and the specimen. Figure 1 shows two examples of such images.


Figure 1
Example of images from data set


Who are we?


We are part of the Computer Vision group at UC, San Diego. Professor Belongie directly oversees the progress and technical design of the project. Kai Wang acts a direct technical consultant for this project. I, Phuc Nguyen, am implementing and managing the text extraction algorithm and this blog.
This blog describes the progress and a brief technical details of the text extraction process.

Technical details


In this section, we are going to briefly describe the technical aspects of the current progress. If you are more interested in the results, please scroll down to the result section.


General Pipeline


We break the text extraction problem into two sub-problems: text detection and text recognizing. One expected difficult is recognizing handwritten text. A proposed solution to solve this problem is through the combination of clustering and leveraging the power of Amazon Mechanical Turk services.
In this blog post, we will discuss the technical details and results of the text detection algorithm


Text Detection


We implement a sliding-window classifier as it proves to be an effective technique in previous detection problems. We are using features described in [Chen 2004] and experimenting with other features, such as local binary pattern[Ojala 2002].
For classifying, we use a logistic regression model and stochastic gradient ascent to train the parameters of such model.


Results


We use the 2:1 aspect ratio for our window. Window sizes vary from 35 to 70 pixels to 125 to 250 pixels. For each image in the data set, we hand-annotated the bounding boxes for the text regions. To measure accuracy at testing time, for each bounding box classified as positives, we check whether it is in the hand-annotated bounding boxes.

We achieve 96.2% accuracy in detection. Figure 2 shows two instances of text detection. Figure 3 shows another example with its corresponding heat map.

Figure 2: Examples of text detection. The overlapping red boxes are the sliding windows that the classifier detects as text.

Figure 3: An image and its corresponding heat map. The brighter the color in the heat map means a higher probability that the region is a text region.

Next step

In the next post, we will discuss an attempt to clustering the windows from the detected regions using different clustering techniques.

Other results


To further test the performance of our text detection algorithm, we use the Street View Text Dataset[Kai]. The algorithm does not perform as well in this data set as there are more noises and variations for text in the wild. We will not attempt to improve the performance of the text detector for this data set, as the current detector is sufficient for our objective. Figure 3 shows two examples from the street view data set.
Figure 3: Examples of text detection in the wild. There are few false positives and false negatives in the images.