Monday, September 3, 2018

Python OCR: ignore signatures in documents

Leave a Comment

I'm trying to do OCR of a scanned document which has handwritten signatures in it. See the image below.

enter image description here

My question is simple, is there a way to still extract the names of the people using OCR while ignoring the signatures? When I run Tesseract OCR it fails to retrieve the names. I tried grayscaling/blurring/thresholding, using the code below, but without luck. Any suggestions?

image = cv2.imread(file_path) image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) image = cv2.GaussianBlur(image, (5, 5), 0) image = cv2.threshold(image, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)[1] 

4 Answers

Answers 1

You can use scikit-image's Gaussian filter to blur thin lines first (with an appropriate sigma), followed by binarization of image (e.g., with some thresholding function), then by morphological operations (such as remove_small_objects or opening with some appropriate structure), to remove the signatures mostly and then try classification of the digits with sliding window (assuming that one is already trained with some blurred characters as in the test image). The following shows an example.

from skimage.morphology import binary_opening, square from skimage.filters import threshold_minimum from skimage.io import imread from skimage.color import rgb2gray from skimage.filters import gaussian  im = gaussian(rgb2gray(imread('lettersig.jpg')), sigma=2) thresh = threshold_minimum(im) im = im > thresh im = im.astype(np.bool) plt.figure(figsize=(20,20)) im1 = binary_opening(im, square(3)) plt.imshow(im1) plt.axis('off') plt.show()     

enter image description here

[EDIT]: Use Deep Learning Models

Another option is to pose the problem as an object detection problem where the alphabets are objects. We can use deep learning: CNN/RNN/Fast RNN models (with tensorflow/keras) for object detection or Yolo model (refer to the this article for car detection with yolo model).

Answers 2

I suppose the input pictures are grayscale, otherwise maybe the different color of the ink could have a distinctive power.

The problem here is that, your training set - I guess - contains almost only 'normal' letters, without the disturbance of the signature - so naturally the classifier won't work on letters with the ink of signature on them. One way to go could be to extend the training set with letters of this type. Of course it is quite a job to extract and label these letters one-by-one.

You can use real letters with different signatures on them, but it might be also possible to artificially generate similar letters. You just need different letters with different snippets of signatures moved above them. This process might be automated.

Answers 3

You may try to preprocess the image with morphologic operations.

You can try opening to remove the thin lines of the signature. The problem is that it may remove the punctuation as well.

image = cv2.imread(file_path) image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) kernel = cv2.getStructuringElement(cv2.MORPH_CROSS,(5,5)) image = cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel) 

You may have to alter the kernel size or shape. Just try different sets.

Answers 4

You can try other OCR providers for the same task. For example, https://cloud.google.com/vision/ try this. You can upload an image and check for free.

You will get a response from API from where you can extract the text which you need. Documentation for extracting that text is also given on the same webpage.

Check out this. this will help you in fetching that text. this is my own answer when I faced the same problem. Convert Google Vision API response to JSON

enter image description here

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment