Showing posts with label ocr. Show all posts
Showing posts with label ocr. Show all posts

Tuesday, September 11, 2018

Convert Non-Searchable Pdf to Searchable Pdf in Windows Python

Leave a Comment

Need a solution to convert a PDF file where every page is image and a page can either contains text, table or combination of both to a searchable pdf.

I have used ABBY FineReader Online which is doing the job perfectly well but I am looking for a solution which can be achieved via Windows Python

I have done detailed analysis and below are the links which came close to what I want but not exactly:

Scanned Image/PDF to Searchable Image/PDF

It is telling to use Ghost script to convert it 1st to image and then it does directly convert to text. I don't believe tesseract converts non-searchable to searchable PDF's.

Converting searchable PDF to a non-searchable PDF

The above solution helps in reverse i.e. converting searchable to non-searchable. Also I think these are valid in Ubuntu/Linux/MacOS.

Can someone please help in telling what should be the Python code for achieving non-searchable to searchable in Windows Python?


UPDATE 1

I have got the desired result with Asprise Web Ocr. Below is the link and code:

https://asprise.com/royalty-free-library/python-ocr-api-overview.html

I am looking for a solution which can be done through Windows Python libraries only as

  1. Need not to pay subscription costs in future
  2. I need to convert thousands of documents daily and it will be cumbersome to upload one to API and then download and so on.

UPDATE 2

I know the solution of converting non-searchable pdf directly to text. But I am looking is their any way to convert non-searchable to searchable PDF. I have the code for converting the PDF to text using PyPDF2.

3 Answers

Answers 1

Well you don't actually need to transform everything inside the pdf to text. Text will remain text, table will remain table and if possible image should become text. You would need a script that actually reads the pdf as is, and begins the conversion on blocks. The script would write blocks of text until the document has been read completely and then transform it into a pdf. Something like

if line_is_text():     write_the_line_as_is() elif line_is_img():     transform_img_in_text()# comments below code ... .. . 

Now transform_img_in_text() I think it could be done with many external libraries, one you can use could be:

Tesseract OCR Python

You can download this lib via pip, instructions provided in the link above.

Answers 2

If an online ocr solution is acceptable to you, the free OCR API from OCR.space can also create searchable PDFs and works well.

In the free version the created PDF contains a watermark. To remove the watermark you need to upgrade to their commercial PRO plan. You can test the api with the web form on the front page.

OCR.space is also available as non-subscription on-premise option, but I am unsure about the price. Personally I use the free ocr api with good success.

Answers 3

I've used pypdfocr in the past to do this. It hasn't been updated recently though.

From the README:

pypdfocr filename.pdf --> filename_ocr.pdf will be generated 

Read carefully the Install instructions for Windows.

Read More

Monday, September 3, 2018

Python OCR: ignore signatures in documents

Leave a Comment

I'm trying to do OCR of a scanned document which has handwritten signatures in it. See the image below.

enter image description here

My question is simple, is there a way to still extract the names of the people using OCR while ignoring the signatures? When I run Tesseract OCR it fails to retrieve the names. I tried grayscaling/blurring/thresholding, using the code below, but without luck. Any suggestions?

image = cv2.imread(file_path) image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) image = cv2.GaussianBlur(image, (5, 5), 0) image = cv2.threshold(image, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)[1] 

4 Answers

Answers 1

You can use scikit-image's Gaussian filter to blur thin lines first (with an appropriate sigma), followed by binarization of image (e.g., with some thresholding function), then by morphological operations (such as remove_small_objects or opening with some appropriate structure), to remove the signatures mostly and then try classification of the digits with sliding window (assuming that one is already trained with some blurred characters as in the test image). The following shows an example.

from skimage.morphology import binary_opening, square from skimage.filters import threshold_minimum from skimage.io import imread from skimage.color import rgb2gray from skimage.filters import gaussian  im = gaussian(rgb2gray(imread('lettersig.jpg')), sigma=2) thresh = threshold_minimum(im) im = im > thresh im = im.astype(np.bool) plt.figure(figsize=(20,20)) im1 = binary_opening(im, square(3)) plt.imshow(im1) plt.axis('off') plt.show()     

enter image description here

[EDIT]: Use Deep Learning Models

Another option is to pose the problem as an object detection problem where the alphabets are objects. We can use deep learning: CNN/RNN/Fast RNN models (with tensorflow/keras) for object detection or Yolo model (refer to the this article for car detection with yolo model).

Answers 2

I suppose the input pictures are grayscale, otherwise maybe the different color of the ink could have a distinctive power.

The problem here is that, your training set - I guess - contains almost only 'normal' letters, without the disturbance of the signature - so naturally the classifier won't work on letters with the ink of signature on them. One way to go could be to extend the training set with letters of this type. Of course it is quite a job to extract and label these letters one-by-one.

You can use real letters with different signatures on them, but it might be also possible to artificially generate similar letters. You just need different letters with different snippets of signatures moved above them. This process might be automated.

Answers 3

You may try to preprocess the image with morphologic operations.

You can try opening to remove the thin lines of the signature. The problem is that it may remove the punctuation as well.

image = cv2.imread(file_path) image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) kernel = cv2.getStructuringElement(cv2.MORPH_CROSS,(5,5)) image = cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel) 

You may have to alter the kernel size or shape. Just try different sets.

Answers 4

You can try other OCR providers for the same task. For example, https://cloud.google.com/vision/ try this. You can upload an image and check for free.

You will get a response from API from where you can extract the text which you need. Documentation for extracting that text is also given on the same webpage.

Check out this. this will help you in fetching that text. this is my own answer when I faced the same problem. Convert Google Vision API response to JSON

enter image description here

Read More

Monday, June 4, 2018

Tesseract does not recognize german “für”

Leave a Comment

I use the tesseract 4.0 via docker image tesseractshadow/tesseract4re

I use the option -l=deu to give tesseract the hint, that the text is in "deutsch" (german).

Still the result for the german word "für" is not good. The german word is very common (meaning "for" in english).

Tesseract often detects "fiir" or "fur".

What can I do to improve this?

reproducible example

docker run --name self.container_name --rm \     --volume  $PWD:/pwd \     tesseractshadow/tesseract4re \     tesseract /pwd/die-fuer-das.png /pwd/die-fuer-das.png.ocr-result -l=deu 

Result:

cat die-fuer-das.png.ocr-result.txt  die fur das 

Image die_fuer_das.png:

enter image description here

2 Answers

Answers 1

I found the solution. I should be -l deu otherwise the german language does not get used:

Works:

===> tesseract  die-fuer-das.png out  -l deu; cat out.txt Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica die für das 

Wrong language:

===> tesseract  die-fuer-das.png out  -l=deu; cat out.txt Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica die fur das 

Answers 2

This is more a comment than a direct answer to your question. Here is another data point: If I use the link with your image with the OCR.space API it gets it perfectly right:

****** Result for Image/Page 1 ****** die für das  

In the past, upscaling to 300dpi often improved Tesseract results, but I surprised that this should still be needed in version 4.

Read More

Sunday, June 3, 2018

How to read pdf file to a text file in a proper format using Spire.PDF or any other library?

Leave a Comment

How can I read pdf files and save contents to a text file using Spire.PDF? For example: Here is a pdf file and here is the desired text file from that pdf

I tried the below code to read the file and save it to a text file

PdfDocument doc = new PdfDocument(); doc.LoadFromFile(@"C:\Users\Tamal\Desktop\101395a.pdf");  StringBuilder buffer = new StringBuilder();  foreach (PdfPageBase page in doc.Pages) {     buffer.Append(page.ExtractText()); }  doc.Close(); String fileName = @"C:\Users\Tamal\Desktop\101395a.txt"; File.WriteAllText(fileName, buffer.ToString()); System.Diagnostics.Process.Start(fileName); 

But the output text file is not properly formatted. It has unnecessary whitespaces and a complete para is broken into multiple lines etc.

How do I get the desired result as in the desired text file?

Additionally, it is possible to detect and mark(like add a tag) to texts with bold, italic or underline forms as well? Also things get more problematic for pages have multiple columns of text.

3 Answers

Answers 1

Using iText

File inputFile = new File("input.pdf");  PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));  SimpleTextExtractionStrategy stes = new SimpleTextExtractionStrategy(); PdfCanvasProcessor canvasProcessor = new PdfCanvasProcessor(stes); canvasProcessor.processPageContent(pdfDocument.getPage(1));  System.out.println(stes.getResultantText()); 

This is (as the code says) a basic/simple text extraction strategy. More advanced examples can be found in the documentation.

Answers 2

Use IronOCR

var Ocr = new IronOcr.AutoOcr(); var Results = Ocr.ReadPdf("E:\Demo.pdf"); File.WriteAllText("E:\Demo.txt", Convert.ToString(Results)); 

For reference https://ironsoftware.com/csharp/ocr/

Using this you should get formatted text output, but not exact desire output which you want.

If you want exact pre-interpreted output, then you should check paid OCR services like OmniPage capture SDK & Abbyy finereader SDK

Answers 3

That is the nature of PDF. It basically says "go to this location on a page and place this character there." I'm not familiar at all with Spire.PFF; I work with Java and the PDFBox library, but any attempt to extract text from PDF is heuristic and hence imperfect. This is a problem that has received considerable attention and some applications have better results than others, so you may want to survey all available options. Still, I think you'll have to clean up the result.

Read More

Monday, April 23, 2018

Tesseract OCR fails to detect varying font size and letters that are not horizontally aligned

Leave a Comment

I am trying to detect these price labels text which is always clearly preprocessed. Although it can easily read the text written above it, it fails to detect price values. I am using python bindings pytesseract although it also fails to read from the CLI commands. Most of the time it tries to recognize the part where the price as one or two characters.

Sample 1:

tesseract D:\tesseract\tesseract_test_images\test.png output 

And the output of the sample image is this.

je Beutel

13

However if I crop and stretch the price to look like they are seperated and are the same font size, output is just fine.

Processed image(cropped and shrinked price):

je Beutel

1,89

How do get OCR tesseract to work as I intended, as I will be going over a lot of similar images? Edit: Added more price tags:
sample2sample3sample4sample5 sample6 sample7

2 Answers

Answers 1

The problem is the image you are using is of small size. Now when tesseract processes the image it considers '8', '9' and ',' as a single letter and thus predicts it to '3' or may consider '8' and ',' as one letter and '9' as a different letter and so produces wrong output. The image shown below explains it.

detected contours of original(small) image

A simple solution could be increasing its size by factor of 2 or 3 or even more as per the size of your original image and then passing to tesseract so that it detects each letter individually as shown below. (Here I increased its size by factor of 2)

detected contours of resized(larger) image

Bellow is a simple python script that will solve your purpose

import pytesseract import cv2  img = cv2.imread('dKC6k.png') img = cv2.resize(img, None, fx=2, fy=2)  data = pytesseract.image_to_string(img) print(data) 

Detected text:

je Beutel  89 1. 

Now you can simply extract the required data from the text and format it as per your requirement.

data = data.replace('\n\n', '\n') data = data.split('\n')  dollars = data[2].strip(',').strip('.') cents = data[1]  print('{}.{}'.format(dollars, cents)) 

Desired Format:

1.89 

Answers 2

The problem is that the Tesseract engine was not trained to read this kind of text topology.

You can:

  • train your own model, and you'll need in particular to provide images with variations of topology (position of characters). You can actually use the same image, and shuffle the positions of the characters.
  • reorganize the image into clusters of text and use tesseract, in particular, I would consider the cents part and move it on the right of the coma, in that case you can use tesseract out of the box. Few relevant criterions would be the height of the clusters (to differenciate cents and integers), and the position of the clusters (read from the left to the right).

In general computer vision algorithms (including CNNs) are giving you tool to have a higher representation of an image (features or descriptors), but they fail to create a logic or an algorithm to process intermediate results in a certain way.

In your case that would be:

  • "if the height of those letters are smaller, it's cents",
  • "if the height, and vertical position is the same, it's about the same number, either on left of coma, or on the right of coma".

The thing is that it's difficult to reach that through training, and at the same time it's extremely simple to write this for a human as an algorithm. Sorry for not giving you an actual implementation, but my text is the pseudo code.

TrainingTesseract2

TrainingTesseract4

Joint Unsupervised Learning of Deep Representations and Image Clusters

Read More

Saturday, December 16, 2017

Image processing with ionic1 (apk) versus Chrome (debugger) using ocrad.js

Leave a Comment

I've developed an application which captures the image and processes it to extract some data. Here is the image:

enter image description here

When I'm running the code in Chrome debugger, I clearly receive the desired text

Lot # 170814

enter image description here

But when I'm running the same code as Android application, I receive some gibberish. enter image description here

Common functions:

function OCRImage(image) {   var canvas = document.createElement('canvas')   canvas.width = image.naturalWidth;   canvas.height = image.naturalHeight;   canvas.getContext('2d').drawImage(image, 0, 0)   return OCRAD(canvas) }  function OCRPath(url, callback) {   var image = new Image()   image.src = url;   image.onload = function () {     callback(OCRImage(image))   } } 

JS code for Chrome:

 OCRPath('img.png', function (words) {     alert(words)   }) 

JS code for Android:

  var options = {     quality: 100,     destinationType: Camera.DestinationType.FILE_URI,     sourceType: Camera.PictureSourceType.PHOTOLIBRARY,     encodingType: Camera.EncodingType.PNG,     mediaType: Camera.MediaType.PICTURE,     allowEdit: true,     correctOrientation: true   }       $cordovaCamera.getPicture(options)         .then(           function (imageURI) {             OCRPath('img.png',               function (words) {                 alert(words)               })},               function (err) {                 alert('Error');               });       } 

What can be the difference? It's literally the same image and same image processing code. Any suggestions? May be, any other way to make OCR?

1 Answers

Answers 1

literally the same image and same image processing code

It isn't - in the web you're loading 'img.png' into an <img>, then on load copying that into a <canvas> that you then pass to OCRAD.

On Android you're calling $cordovaCamera.getPicture(options) first to get a promise that resolves with imageURI once the user has selected an image.

Once you have that you ignore imageURI and load 'img.png' directly instead.

The results have accessed the image, but may have a scaled or otherwise lower resolution version - for instance Lot # 170814 becomes _l # 1 TO81, but FISSURE DIAMOND seems to be found in the larger title text ok.

Possible solutions:

Switch to base-64 encoded image using destinationType : Camera.DestinationType.DATA_URL

var options = {     quality: 100,     destinationType: Camera.DestinationType.DATA_URL,     sourceType: Camera.PictureSourceType.PHOTOLIBRARY,     encodingType: Camera.EncodingType.PNG,     mediaType: Camera.MediaType.PICTURE,     allowEdit: true,     correctOrientation: true }       $cordovaCamera.getPicture(options)     .then(          function (imageURI) {              OCRPath(imageURI, alert, alert);          }); 

Alternatively you shouldn't need the slightly roundabout approach to load the <img> then put that into a <canvas> - the only reason you're doing that is to get the bytes of the image to OCRAD. It's the best way to do that on the web, but on Android you already have the file - you should be able to pass it directly to OCRAD.

Read More

Sunday, July 9, 2017

How can I use the Keras OCR example?

Leave a Comment

I found examples/image_ocr.py which seems to for OCR. Hence it should be possible to give the model an image and receive text. However, I have no idea how to do so. How do I feed the model with a new image? Which kind of preprocessing is necessary?

What I did

Installing the depencencies:

  • Install cairocffi: sudo apt-get install python-cairocffi
  • Install editdistance: sudo -H pip install editdistance
  • Change train to return the model and save the trained model.
  • Run the script to train the model.

Now I have a model.h5. What's next?

See https://github.com/MartinThoma/algorithms/tree/master/ML/ocr/keras for my current code. I know how to load the model (see below) and this seems to work. The problem is that I don't know how to feed new scans of images with text to the model.

Related side questions

  • What is CTC? Connectionist Temporal Classification?
  • Are there algorithms which reliably detect the rotation of a document?
  • Are there algorithms which reliably detect lines / text blocks / tables / images (hence make a reasonable segmentation)? I guess edge detection with smoothing and line-wise histograms already works reasonably well for that?

What I tried

#!/usr/bin/env python  from keras import backend as K import keras from keras.models import load_model import os  from image_ocr import ctc_lambda_func, create_model, TextImageGenerator from keras.layers import Lambda from keras.utils.data_utils import get_file import scipy.ndimage import numpy  img_h = 64 img_w = 512 pool_size = 2 words_per_epoch = 16000 val_split = 0.2 val_words = int(words_per_epoch * (val_split)) if K.image_data_format() == 'channels_first':     input_shape = (1, img_w, img_h) else:     input_shape = (img_w, img_h, 1)  fdir = os.path.dirname(get_file('wordlists.tgz',                                 origin='http://www.mythic-ai.com/datasets/wordlists.tgz', untar=True))  img_gen = TextImageGenerator(monogram_file=os.path.join(fdir, 'wordlist_mono_clean.txt'),                              bigram_file=os.path.join(fdir, 'wordlist_bi_clean.txt'),                              minibatch_size=32,                              img_w=img_w,                              img_h=img_h,                              downsample_factor=(pool_size ** 2),                              val_split=words_per_epoch - val_words                              ) print("Input shape: {}".format(input_shape)) model, _, _ = create_model(input_shape, img_gen, pool_size, img_w, img_h)  model.load_weights("my_model.h5")  x = scipy.ndimage.imread('example.png', mode='L').transpose() x = x.reshape(x.shape + (1,))  # Does not work print(model.predict(x)) 

this gives

2017-07-05 22:07:58.695665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN Black, pci bus id: 0000:01:00.0) Traceback (most recent call last):   File "eval_example.py", line 45, in <module>     print(model.predict(x))   File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1567, in predict     check_batch_axis=False)   File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 106, in _standardize_input_data     'Found: array with shape ' + str(data.shape)) ValueError: The model expects 4 arrays, but only received one array. Found: array with shape (512, 64, 1) 

2 Answers

Answers 1

Here, you created a model that needs 4 inputs:

model = Model(inputs=[input_data, labels, input_length, label_length], outputs=loss_out) 

Your predict attempt, on the other hand, is loading just an image.
Hence the message: The model expects 4 arrays, but only received one array

From your code, the necessary inputs are:

input_data = Input(name='the_input', shape=input_shape, dtype='float32') labels = Input(name='the_labels', shape=[img_gen.absolute_max_string_len],dtype='float32') input_length = Input(name='input_length', shape=[1], dtype='int64') label_length = Input(name='label_length', shape=[1], dtype='int64') 

The original code and your training work because they're using the TextImageGenerator. This generator cares to give you the four necessary inputs for the model.

So, what you have to do is to predict using the generator. As you have the fit_generator() method for training with the generator, you also have the predict_generator() method for predicting with the generator.


Now, for a complete answer and solution, I'd have to study your generator and see how it works (which would take me some time). But now you know what is to be done, you can probably figure it out.

You can either use the generator as it is, and predict probably a huge lot of data, or you can try to replicate a generator that will yield just one or a few images with the necessary labels, length and label length.

Or maybe, if possible, just create the 3 remaining arrays manually, but making sure they have the same shapes (except for the first, which is the batch size) as the generator outputs.

The one thing you must assert, though, is: have 4 arrays with the same shapes as the generator outputs, except for the first dimension.

Answers 2

Now I have a model.h5. What's next?

First I should comment that the model.h5 contains the weights of your network, if you wish to save the architecture of your network as well you should save it as a json like this example:

model_json = model_json = model.to_json() with open("model_arch.json", "w") as json_file:     json_file.write(model_json) 

Now, once you have your model and its weights you can load them on demand by doing the following:

json_file = open('model_arch.json', 'r') loaded_model_json = json_file.read() json_file.close() loaded_model = model_from_json(loaded_model_json) # load weights into new model # if you already have a loaded model and dont need to save start from here loaded_model.load_weights("model.h5")     # compile loaded model with certain specifications sgd = SGD(lr=0.01) loaded_model.compile(loss="binary_crossentropy", optimizer=sgd, metrics=["accuracy"]) 

Then, with that loaded_module you can proceed to predict the classification of certain input like this:

prediction = loaded_model.predict(some_input, batch_size=20, verbose=0) 

Which will return the classification of that input.

About the Side Questions:

  1. CTC seems to be a term they are defining in the paper you refered, extracting from it says:

In what follows, we refer to the task of labelling un- segmented data sequences as temporal classification (Kadous, 2002), and to our use of RNNs for this pur- pose as connectionist temporal classification (CTC).

  1. To compensate the rotation of a document, images, or similar you could either generate more data from your current one by applying such transformations (take a look at this blog post that explains a way to do that ), or you could use a Convolutional Neural Network approach, which also is actually what that Keras example you are using does, as we can see from that git:

This example uses a convolutional stack followed by a recurrent stack and a CTC logloss function to perform optical character recognition of generated text images.

You can check this tutorial that is related to what you are doing and where they also explain more about Convolutional Neural Networks.

  1. Well this one is a broad question but to detect lines you could use the Hough Line Transform, or also Canny Edge Detection could be good options.

Edit: The error you are getting is because it is expected more parameters instead of 1, from the keras docs we can see:

predict(self, x, batch_size=32, verbose=0) 

Raises ValueError: In case of mismatch between the provided input data and the model's expectations, or in case a stateful model receives a number of samples that is not a multiple of the batch size.

Read More

Thursday, April 27, 2017

Can tesseract be trained for non-font symbols?

Leave a Comment

I'm curious about how I may be able to more reliably recognise the value and the suit of playing card images. Here are two examples:

enter image description here enter image description here

There may be some noise in the images, but I have a large dataset of images that I could use for training (roughly 10k pngs, including all values & suits).

I can reliably recognise images that I've manually classified, if I have a known exact-match using a hashing method. But since I'm hashing images based on their content, then the slightest noise changes the hash and results in an image being treated as unknown. This is what I'm looking to reliably address with further automation.

I've been reviewing the 3.05 documentation on training tesseract: https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract#automated-method

Can tesseract only be trained with images found in fonts? Or could I use it to recognise the suits for these cards?

I was hoping that I could say that all images in this folder correspond to 4c (e.g. the example images above), and that tesseract would see the similarity in any future instances of that image (regardless of noise) and also read that as 4c. Is this possible? Does anyone here have experience with this?

1 Answers

Answers 1

This has been my non-tesseract solution to this, until someone proves there's a better way. I've setup:

Getting these to running was the hardest part. Next, I used my dataset to train a new caffe network. I prepared my dataset into a single depth folder structure:

./card ./card/2c ./card/2d ./card/2h ./card/2s ./card/3c ./card/3d ./card/3h ./card/3s ./card/4c ./card/4d ./card/4h ./card/4s ./card/5c ./card/5d ./card/5h ./card/5s ./card/6c ./card/6d ./card/6h ./card/6s ./card/7c ./card/7d ./card/7h ./card/7s ./card/8c ./card/8d ./card/8h ./card/8s ./card/9c ./card/9d ./card/9h ./card/9s ./card/_noise ./card/_table ./card/Ac ./card/Ad ./card/Ah ./card/As ./card/Jc ./card/Jd ./card/Jh ./card/Js ./card/Kc ./card/Kd ./card/Kh ./card/Ks ./card/Qc ./card/Qd ./card/Qh ./card/Qs ./card/Tc ./card/Td ./card/Th ./card/Ts 

Within Digits, I chose:

  1. Datasets tab
  2. New Dataset Images
  3. Classification
  4. I pointed it to my card folder, e.g: /path/to/card
  5. I set the validation % to 13.0%, based on the discussion here: http://stackoverflow.com/a/13612921/880837
  6. After creating the dataset, I opened the models tab
  7. Chose my new dataset.
  8. Chose the GoogLeNet under Standard Networks, and left it to train.

I did this several times, each time I had new images in the dataset. Each learning session took 6-10 hours, but at this stage I can use my caffemodel to programmatically estimate what each image is expected to be, using this logic: https://github.com/BVLC/caffe/blob/master/examples/cpp_classification/classification.cpp

The results are either a card (2c, 7h, etc), noise, or table. Any estimates with an accuracy bigger than 90% are most likely correct. The latest run correctly recognised 300 out of 400 images, with only 3 mistakes. I'm adding new images to the dataset and retraining the existing model, further tuning the result accuracy. Hope this is valuable to others!

While I wanted the high level steps here, this was all done with large thanks to David Humphrey and his github post, I really recommend reading it and trying it out if you're interested in learning more: https://github.com/humphd/have-fun-with-machine-learning

Read More

Friday, October 7, 2016

tesseract didn't get the little labels

Leave a Comment

I've installed tesseract on my linux environment.

It works when I execute something like

# tesseract myPic.jpg /output 

But my pic has some little labels and tesseract didn't see them.

Is an option is available to set a pitch or something like that ?

Example of text labels:

enter image description here

With this pic, tesseract doesn't recognize any value...

But with this pic:

enter image description here

I have the following output:

J8  J7A-J7B P7 \  2 40 50 0 180 190  200  P1 P2 7  110 110 \ l 

For example, in this case, the 90 (on top left) is not seen by tesseract...

I think it's just an option to define or somethink like that, no ?

Thx

1 Answers

Answers 1

In order to get accurate results from Tesseract (as well as any OCR engine) you will need to follow some guidelines as can be seen in my answer on this post: Junk results when using Tesseract OCR and tess-two

Here is the gist of it:

  • Use a high resolution image (if needed) 300 DPI is minimum

  • Make sure there is no shadows or bends in the image

  • If there is any skew, you will need to fix the image in code prior to ocr

  • Use a dictionary to help get good results

  • Adjust the text size (12 pt font is ideal)

  • Binarize the image and use image processing algorithms to remove noise

It is also recommended to spend some time training the OCR engine to receive better results as seen in this link: Training Tesseract

I took the 2 images that you shared and ran some image processing on them using the LEADTOOLS SDK (disclaimer: I am an employee of this company) and was able to get better results than you were getting with the processed images, but since the original images aren't the greatest - it still was not 100%. Here is the code I used to try and fix the images:

//initialize the codecs class using (RasterCodecs codecs = new RasterCodecs()) {    //load the file    using (RasterImage img = codecs.Load(filename))    {       //Run the image processing sequence starting by resizing the image       double newWidth = (img.Width / (double)img.XResolution) * 300;       double newHeight = (img.Height / (double)img.YResolution) * 300;       SizeCommand sizeCommand = new SizeCommand((int)newWidth, (int)newHeight, RasterSizeFlags.Resample);       sizeCommand.Run(img);        //binarize the image       AutoBinarizeCommand autoBinarize = new AutoBinarizeCommand();       autoBinarize.Run(img);        //change it to 1BPP       ColorResolutionCommand colorResolution = new ColorResolutionCommand();       colorResolution.BitsPerPixel = 1;       colorResolution.Run(img);        //save the image as PNG       codecs.Save(img, outputFile, RasterImageFormat.Png, 0);    } } 

Here are the output images from this process:

image1 processed image2 processed

Read More

Tuesday, March 8, 2016

Why am I getting “tiff page 1 not found” Lebtonica warning in Tesseract?

Leave a Comment

I just started using Tesseract.

I am following the instructions described here.

I have created a test image like this:

training/text2image --text=test.txt --outputbase=eng.Arial.exp0 --font='Arial' --fonts_dir=/usr/share/fonts 

Now I want to train the Tesseract like follows:

tesseract eng.Arial.exp0.tif eng.Arial.exp0 box.train 

Here is the output that I have:

Tesseract Open Source OCR Engine v3.04.00 with Leptonica Page 1 APPLY_BOXES:    Boxes read from boxfile:     112    Found 112 good blobs. Generated training data for 21 words Warning in pixReadMemTiff: tiff page 1 not found 

This prevents the creation of fontfile.tr file. I have tried continuing by ignoring the warning, but when creating the char-sets I get an awefull content:

unicharset_extractor lang.fontname.exp0.box  "58 NULL 0 NULL 0 Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0     # Joined [4a 6f 69 6e 65 64 ] |Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0    # Broken T 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # T [54 ] h 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # h [68 ] e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # e [65 ] ( 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # ( [28 ] q 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # q [71 ] u 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # u [75 ] ..." 

Here is the version I am using:

tesseract 3.04.00  leptonica-1.72   libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 

Any idea why this happens?

Thanks!

0 Answers

Read More