OCR in the Wild: SOTA in Text Detection and Text Recognition

Rédigé par Noé Achache

Deep Learning based approaches enable the detection and recognition of complex text instances in natural images.

Introduction

Optical Character Recognition (OCR) consists in automatically decoding text instances in images. The complexity of this task varies substantially from an application to another. On one hand, reading words from a scanned newspaper is relatively easy, since the text is straight, consistent and well contrasted with the background. On the other hand, a curved text instance with a creative font in a natural image is not only much more challenging to read (text recognition): it also requires a tricky preliminary step to locate the text in the image (text detection). While traditional computer vision techniques handle well the former case, the latter requires State-Of-The-Art (SOTA) deep learning methods combining CNNs and RNNs.

Two steps are necessary to read scene text, i.e. text instances “in the wild”:

Text Detection: finding the text instances in the image.
Text Recognition: Decoding each character of the detected text instances.

Scene Text Detection Recognition — Fig.2: Scene Text Detection Pipeline

Due to the complexity of the font, reading the words ‘Metropolitain’ and ‘Rome’ of Fig.2 is challenging!

To better understand scene text detection and recognition, this post discusses:

Textsnake [Long et al., 2018], a text detection algorithm with the specificity of handling very complex text shapes.
MORAN [Luo et al., 2019], a text recognition algorithm using a rectification network and the attention mechanism to correct and read complicated textboxes.
FOTS [Liu et al., 2018], an end-to-end approach sharing the convolutions between the detection step and the recognition step to improve robustness and efficiency.
An interesting result from an experiment mixing up FOTS and Textsnake.

Text Detection

SOTA detection algorithms all use a U-net (Fig.3). They usually differ from each other with the channels they choose to output and their postprocessing methods. Indeed, the segmentation usually is not accurate enough, and postprocessing is essential to reconstruct the text box.

U-net in the Context of Text Detection

The segmentation maps shown next to each step of the encoder in Fig.3 are the result of performing a single deconvolution from there (images taken from [Zhang et al., 2019]). This visualisation highlights how the encoder progressively extracts the deep features at the expense of losing spatial information, because of the height and width reduction. Hence, progressively merging the encoded maps to the decoded maps (grey arrows) enables retrieving the spatial information lost during the encoding.

Fig.3: The general structure of the U-net with a visualization of the deep features in each step of the encoder. The number of convolutions and channels at each step depends on the backbone architecture.

Textsnake: A Flexible Representation for Detecting Text of Arbitrary Shapes [Long et al., 2018]

Unlike most detection algorithm, Textsnake handles text structures of any shape. To do so, the algorithm computes a one-pixel thick line, passing through the centre of the text instance, shown in green in Fig.4. A radius r associated to each point of this line enables to draw the circles delineating the text region. Additionally, an angle θ indicates the necessary rotation to apply to each circle to flatten the text.

Textsnake detection — Fig.4: Textsnake’s detection parameters

The output channels are as follows (Fig.5):

TR (Text Region): Segmentation map indicating text and non-text pixels.
TCL (Text Center Line): Regions passing through the centre of the text instances.
radius, cos θ, sin θ: Parameters required for the construction of the circles.

For the TCL’s training, only pixels within the TR are considered. Consequently, the TCL does not learn to discriminate between text and non-text pixels, simplifying its task. Hence, the TR map enables masking the TCL map to filter out the false positives. A string of pixel is then obtained for each remaining TCL regions using an iterative striding algorithm. Lastly, the text boxes are reconstructed (Fig 4 and 5).

Fig.5: Outputs of Textsnake and boxes reconstruction pipeline

Textsnake only requires boxes coordinates to train. Indeed, training the TCL map does not require any extra labelling as its ground truth can be computed directly from the bounding boxes (more details in the paper).

The algorithm accurately detects the contours of complex text boxes in the wild (Fig.6.a). Fig.6.b and c demonstrate the flattening ability of Textsnake. It performs very well on slightly curved text instances. However, as the angle increases, the rectification’s quality decreases, as observed with the first and last letters of “STARBUCKS”.

Fig.6: Results of Textsnake. (a) Green: ground truth. Yellow: detected text contours. (b) Text identified and rectified (from the paper). (c) Same as (b) but with our implementation.

Text Recognition

There are two main approaches to text recognition, both using a CNN to preprocess the image followed by an RNN to decode the text.

CRNN* + Connectionist Temporal Classification (CTC) [Shi et al., 2016]: Currently the most common approach (used by Tesseract for instance).
*CRNN = RNN on top of a CNN
CNN + Attention RNN: Currently the main area of research in text recognition. The paper detailed in this section follows this method.

Attention RNNs in the Context of Text Recognition

As specified above, a CNN first preprocesses the input text image. This includes (most of the times) flattening the image by reducing its height to 1 pixel and increasing its number of channels. Then, the resulting array is split in the width axis into w vectors, w being the width, constituting the inputs of the attention RNN.

The attention mechanism allows the RNN to focus on the right inputs when generating each output, i.e. each letter of the text. It therefore specifies how much attention should be paid to each input (i.e. each region of the image), regardless of where they are in the sequence. Fig.7 summarizes the functioning of an attention RNN. To get a better intuitive understanding of the attention mechanism, watch this video and this one from Andrew Ng.

Moran: A multi-object rectified attention network for scene text recognition [Luo et al., 2019]

MORAN’s architecture combines two networks. The first one, MORN, rectifyies the text box (as in Textsnake but not all detection networks do so) before the second one, ASRN, reads the text (Fig.8).

MORN first processes the input image (i.e. the text box) with a CNN rectifying the image. To do so, its outputs maps specify for each corresponding region of the image the number of pixels by which it should be shifted in the x and y directions (i.e. 2 output maps) to straighten the text.
ASRN decodes the text with an attention RNN as explained above. The significant amount of noise in natural images often misleads the network into focusing on the wrong part of the image. The ASRN overcomes this issue with a regularization method called fractional pickup. It consists of randomly mixing up fractions of the attentions weights of neighbouring inputs.

A remarkable aspect of MORAN is that it only requires weak labels, i.e. the output text. Indeed, the network is trained end-to-end with a single loss function. By backpropagating the decoding error, the MORN manages to learn to rectify the image!

MORN does not perform as well with curved text as with tilted text (Fig.9.a). However, it still manages to remove useless features (red circle) which are likely to confuse the attention layer. To decode the text, the ASRN draws its attention to each letter one by one (Fig.9.b). Furthermore, due to fractional pickup, it does not get confused by features such as shadows. Although the MORN rectification is far from perfect, it is effective enough to allow the ASRN to decode the text correctly (Fig.9.c).

Fig. 9: Results of the Moran. (a) Results of MORN. (b) Decoding with and without Fractional Pickup (FP). (c) Comparison of MORAN with another network.

End-to-end approach

Fots: Fast oriented text spotting with a unified network [Liu et al., 2018]

The CNNs used in text detection and recognition both aim to extract the text features from the input image. Therefore, they are highly correlated. Hence, the major drawback of using a different algorithm for these two steps is computing twice the same information. FOTS addresses this issue by reusing the convolutions computed by the detection algorithm EAST (Efficient and Accurate Scene Text Detector) [Zhou et al., 2017] in a CRNN+CTC decoder (mentioned above).

In FOTS, the text boxes are directly extracted from the shared features instead of the original image. The shared features denote the feature maps produced by the penultimate layer of the detection step, i.e. the maps from which the detection outputs are computed. Thus, the recognition part can use a smaller CNN since its input has already been processed.

FOTS is very efficient. Indeed, it employs EAST, known for its great speed, and shared features, to avoid useless computations. The algorithm also handles tilted text instances, by turning the corresponding boxes cropped from the shared features. Additionally, the end-to-end approach improves the robustness of the detection step (Fig.10).

omparison between FOTS a classic approach — Fig.10: Comparison between FOTS and a classic approach

Experiment: Mixing Textsnake and FOTS

During a previous experiment, I implemented a custom algorithm based on FOTS, but with Textsnake instead of EAST for the detection part. Curved texts were rectified as in Fig.6, but directly within the shared features instead of the original image. The algorithm was performing very well both in detection and recognition. However, the rectification of the text was sometimes very poor and challenging to read even for a human. Nevertheless, in the case of Fig.11, the decoder managed to give a great output: only the “M” of “MÉDECINS” is missing! Indeed, each element of the shared features has a large receptive field, and hence, the elements in the box contain information about the missing parts of the letters. For instance, the information about the missing segments of the letter “E” in “MONDE” is contained within the shared features.

Fig.11: Result obtained using a mix of FOTS and Textsnake, with a poor text rectification but a great recognition. Note that the text rectified from the original image is only for visualisation purposes

I hope this post gave you a good understanding of the papers described! I would be thrilled to hear your feedback and answer any questions you may have (noe.achache@gmail.com).

You are looking for computer vision experts? Don’t hesitate to contact us!