Generate text boxes yourself with the TextBoxGAN repo (trained model available).
Generative Adversarial Networks (GAN) have gained a lot of attention recently, mostly with the mesmerizing results of StyleGAN2. How do they work? How can we use them to generate readable text boxes from input words? Why is it useful? We are going to answer all these questions.
Why generate text boxes?
Labeling data to train Optical Character Recognition (OCR) network, commonly used to read text from natural images, is expensive. Indeed, it involves cropping the text instances within your image to get text boxes, and manually writing the text. If you consider upper and lower cases and special characters, the number of classes becomes quite high. Hence, a large amount of data is needed for training.
To solve this problem, there are a few synthetic datasets (SynthText, Synth90K, ...) with millions of images. They use traditional methods to render the text boxes (i.e. with no AI) and these methods include many random parameters such as fonts, sizes, or whether there is a border.
Our model, TextBoxGAN, generates text boxes from input words with a GAN, as a new approach to create a synthetic dataset. The main advantage compared to other synthetic datasets is that the texts generated are not constrained by a pre-defined font.
How do GANs work?
The aim of a GAN is to train a generator to synthesize images from random noise, that resemble the images of a training set. To do so, another network is trained alongside the generator, that tries to discriminate between real images (from the dataset) and fake images (synthesized by the generator). The loss of the discriminator is backpropagated through the discriminator and the generator, training both networks.
There are plenty of online resources available if you wish to gain a more in-depth knowledge of GANS, such as this article.
We incorporated several adjustments to the original GAN architecture, mainly to allow control over the characters our network generates.
Generating text boxes
TextBoxGAN mainly consists of a generator synthesizing text boxes and a discriminator trying to make the difference between real and fake text boxes, which is the common pattern of a GAN as explained above.
The generator takes as input a normally distributed random noise, encoded by the latent encoder. This network, introduced in the StyleGAN paper, passes the noise into a succession of dense layers and outputs a style vector, defining the style of the text box (color, texture, ...). The other input of the generator is a word, which is encoded by the word encoder.
The word encoder transforms each character of the input word individually through an embedding followed by a dense layer. Then, the characters encoding are concatenated while preserving the order. Since the synthesis network requires a fixed size input, the maximum number of characters the input word can have is fixed. When the number of characters in the input word is less than this maximum, the output tensor is simply padded with zeros.
Both the encoded word and the style vector are fed to the synthesis network, which generates the text box through a succession of deconvolutions.
The network uses two losses: the GAN loss to ensure the generated text box resembles a real text box and the OCR loss, which allows controlling the text on the image.
The GAN loss consists of a Softplus (similar to a ReLu) of the discriminator’s output. Naturally, the generator is trained with the opposite of this loss since its objective is to fool the discriminator.
To compute the OCR loss, the generated text box is inferred through Aster, a pre-trained OCR with fixed weights. There are two possible losses for the OCR loss:
- A softmax crossentropy between the logits (probability distribution over all possible characters, produced by the OCR) and the ground truth (input word).
- A mean squared error between the resulting logits and the logits computed on the real text box. This requires that the input word matches the word on the real text box. Doing so, our classification problem becomes more of a regression problem, because the objective is not to write a perfect text, but rather to write the text as it is on the image.
Using the softmax crossentropy loss gives significantly better results. The ReadMe of the repo provides more details on the related experiments.
Inferring the test set in our model leads to an average OCR loss (softmax crossentropy) 5-6 times larger than the baseline loss, computed on real images. Indeed, some style vectors lead to hardly readable text boxes, which justifies the difference between the two losses.
The projector is a concept borrowed from the StyleGAN paper, which allows extracting the style of an image. In our version, the process starts with the generation of a text box with the word on the original image. Then, the style produced by the latent encoder is trained using the OCR loss and a perceptual loss, to make the generated text box as similar as possible as the original image.
The resulting style can be reused to write new words, as shown below. This feature could be very useful if you know that your OCR will have to read text boxes that have some kind of similarities, because you will want to generate data from a specific distribution, to make it as similar as possible to your use case.
Our model manages to generate readable text boxes, matching the input words and with various styles. However, there are a few limitations. The main one is probably that changing the style vector does not significantly change the shape of the characters. Hence, training an OCR only with data generated with our model will not allow it to generalize enough. However, TextBoxGAN may be appropriate for data augmentation, at the risk of creating a bias towards some characters' shapes.
Nevertheless, with some improvements, the model could probably generate diverse and unbiased text boxes. Thus, it certainly has the potential to train OCRs in a better way than the synthetic datasets discussed at the beginning of this article.
Try it yourself, head up to the repo! Docker + trained model available to get you started quickly. You will also find more technical information, mostly on the experiments and the datasets used.