Monday, April 23, 2018

Tesseract OCR fails to detect varying font size and letters that are not horizontally aligned

Leave a Comment

I am trying to detect these price labels text which is always clearly preprocessed. Although it can easily read the text written above it, it fails to detect price values. I am using python bindings pytesseract although it also fails to read from the CLI commands. Most of the time it tries to recognize the part where the price as one or two characters.

Sample 1:

tesseract D:\tesseract\tesseract_test_images\test.png output 

And the output of the sample image is this.

je Beutel

13

However if I crop and stretch the price to look like they are seperated and are the same font size, output is just fine.

Processed image(cropped and shrinked price):

je Beutel

1,89

How do get OCR tesseract to work as I intended, as I will be going over a lot of similar images? Edit: Added more price tags:
sample2sample3sample4sample5 sample6 sample7

2 Answers

Answers 1

The problem is the image you are using is of small size. Now when tesseract processes the image it considers '8', '9' and ',' as a single letter and thus predicts it to '3' or may consider '8' and ',' as one letter and '9' as a different letter and so produces wrong output. The image shown below explains it.

detected contours of original(small) image

A simple solution could be increasing its size by factor of 2 or 3 or even more as per the size of your original image and then passing to tesseract so that it detects each letter individually as shown below. (Here I increased its size by factor of 2)

detected contours of resized(larger) image

Bellow is a simple python script that will solve your purpose

import pytesseract import cv2  img = cv2.imread('dKC6k.png') img = cv2.resize(img, None, fx=2, fy=2)  data = pytesseract.image_to_string(img) print(data) 

Detected text:

je Beutel  89 1. 

Now you can simply extract the required data from the text and format it as per your requirement.

data = data.replace('\n\n', '\n') data = data.split('\n')  dollars = data[2].strip(',').strip('.') cents = data[1]  print('{}.{}'.format(dollars, cents)) 

Desired Format:

1.89 

Answers 2

The problem is that the Tesseract engine was not trained to read this kind of text topology.

You can:

  • train your own model, and you'll need in particular to provide images with variations of topology (position of characters). You can actually use the same image, and shuffle the positions of the characters.
  • reorganize the image into clusters of text and use tesseract, in particular, I would consider the cents part and move it on the right of the coma, in that case you can use tesseract out of the box. Few relevant criterions would be the height of the clusters (to differenciate cents and integers), and the position of the clusters (read from the left to the right).

In general computer vision algorithms (including CNNs) are giving you tool to have a higher representation of an image (features or descriptors), but they fail to create a logic or an algorithm to process intermediate results in a certain way.

In your case that would be:

  • "if the height of those letters are smaller, it's cents",
  • "if the height, and vertical position is the same, it's about the same number, either on left of coma, or on the right of coma".

The thing is that it's difficult to reach that through training, and at the same time it's extremely simple to write this for a human as an algorithm. Sorry for not giving you an actual implementation, but my text is the pseudo code.

TrainingTesseract2

TrainingTesseract4

Joint Unsupervised Learning of Deep Representations and Image Clusters

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment