Showing posts with label tesseract. Show all posts
Showing posts with label tesseract. Show all posts

Monday, June 4, 2018

Tesseract does not recognize german “für”

Leave a Comment

I use the tesseract 4.0 via docker image tesseractshadow/tesseract4re

I use the option -l=deu to give tesseract the hint, that the text is in "deutsch" (german).

Still the result for the german word "für" is not good. The german word is very common (meaning "for" in english).

Tesseract often detects "fiir" or "fur".

What can I do to improve this?

reproducible example

docker run --name self.container_name --rm \     --volume  $PWD:/pwd \     tesseractshadow/tesseract4re \     tesseract /pwd/die-fuer-das.png /pwd/die-fuer-das.png.ocr-result -l=deu 

Result:

cat die-fuer-das.png.ocr-result.txt  die fur das 

Image die_fuer_das.png:

enter image description here

2 Answers

Answers 1

I found the solution. I should be -l deu otherwise the german language does not get used:

Works:

===> tesseract  die-fuer-das.png out  -l deu; cat out.txt Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica die für das 

Wrong language:

===> tesseract  die-fuer-das.png out  -l=deu; cat out.txt Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica die fur das 

Answers 2

This is more a comment than a direct answer to your question. Here is another data point: If I use the link with your image with the OCR.space API it gets it perfectly right:

****** Result for Image/Page 1 ****** die für das  

In the past, upscaling to 300dpi often improved Tesseract results, but I surprised that this should still be needed in version 4.

Read More

Wednesday, May 23, 2018

tesseract api with load_system_dawg and load_freq_dawg

Leave a Comment

How to set load_system_dawg and load_freq_dawg to false ??

I need to disable the dictionary.. So I guess these are the two parameteres I need to set to false?

tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); if(api->Init(NULL, "dan+eng")){     // error } api->SetImage(image); api->Recognize(0); 

tesseract 3.05.01

1 Answers

Answers 1

  1. In your tessdata directory create a configs directory
  2. Create a file config (you will pass name of config file later in code)
  3. Fill your config file with following text
load_system_dawg     F     load_freq_dawg       F 
  1. Modify your code

    auto     numOfConfigs = 1; auto     **configs    = new char *[numOfConfigs]; configs[i] = (char *) "name of your config file";  tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); if(api->Init(NULL, "dan+eng", tesseract::OEM_DEFAULT, configs, numOfConfigs, nullptr, nullptr, false)){     // error } 

P.S. It is also possible to do with last couple of arguments of Init function, feel free to try them out yourself.

Read More

Monday, April 23, 2018

Tesseract OCR fails to detect varying font size and letters that are not horizontally aligned

Leave a Comment

I am trying to detect these price labels text which is always clearly preprocessed. Although it can easily read the text written above it, it fails to detect price values. I am using python bindings pytesseract although it also fails to read from the CLI commands. Most of the time it tries to recognize the part where the price as one or two characters.

Sample 1:

tesseract D:\tesseract\tesseract_test_images\test.png output 

And the output of the sample image is this.

je Beutel

13

However if I crop and stretch the price to look like they are seperated and are the same font size, output is just fine.

Processed image(cropped and shrinked price):

je Beutel

1,89

How do get OCR tesseract to work as I intended, as I will be going over a lot of similar images? Edit: Added more price tags:
sample2sample3sample4sample5 sample6 sample7

2 Answers

Answers 1

The problem is the image you are using is of small size. Now when tesseract processes the image it considers '8', '9' and ',' as a single letter and thus predicts it to '3' or may consider '8' and ',' as one letter and '9' as a different letter and so produces wrong output. The image shown below explains it.

detected contours of original(small) image

A simple solution could be increasing its size by factor of 2 or 3 or even more as per the size of your original image and then passing to tesseract so that it detects each letter individually as shown below. (Here I increased its size by factor of 2)

detected contours of resized(larger) image

Bellow is a simple python script that will solve your purpose

import pytesseract import cv2  img = cv2.imread('dKC6k.png') img = cv2.resize(img, None, fx=2, fy=2)  data = pytesseract.image_to_string(img) print(data) 

Detected text:

je Beutel  89 1. 

Now you can simply extract the required data from the text and format it as per your requirement.

data = data.replace('\n\n', '\n') data = data.split('\n')  dollars = data[2].strip(',').strip('.') cents = data[1]  print('{}.{}'.format(dollars, cents)) 

Desired Format:

1.89 

Answers 2

The problem is that the Tesseract engine was not trained to read this kind of text topology.

You can:

  • train your own model, and you'll need in particular to provide images with variations of topology (position of characters). You can actually use the same image, and shuffle the positions of the characters.
  • reorganize the image into clusters of text and use tesseract, in particular, I would consider the cents part and move it on the right of the coma, in that case you can use tesseract out of the box. Few relevant criterions would be the height of the clusters (to differenciate cents and integers), and the position of the clusters (read from the left to the right).

In general computer vision algorithms (including CNNs) are giving you tool to have a higher representation of an image (features or descriptors), but they fail to create a logic or an algorithm to process intermediate results in a certain way.

In your case that would be:

  • "if the height of those letters are smaller, it's cents",
  • "if the height, and vertical position is the same, it's about the same number, either on left of coma, or on the right of coma".

The thing is that it's difficult to reach that through training, and at the same time it's extremely simple to write this for a human as an algorithm. Sorry for not giving you an actual implementation, but my text is the pseudo code.

TrainingTesseract2

TrainingTesseract4

Joint Unsupervised Learning of Deep Representations and Image Clusters

Read More

Thursday, April 27, 2017

Can tesseract be trained for non-font symbols?

Leave a Comment

I'm curious about how I may be able to more reliably recognise the value and the suit of playing card images. Here are two examples:

enter image description here enter image description here

There may be some noise in the images, but I have a large dataset of images that I could use for training (roughly 10k pngs, including all values & suits).

I can reliably recognise images that I've manually classified, if I have a known exact-match using a hashing method. But since I'm hashing images based on their content, then the slightest noise changes the hash and results in an image being treated as unknown. This is what I'm looking to reliably address with further automation.

I've been reviewing the 3.05 documentation on training tesseract: https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract#automated-method

Can tesseract only be trained with images found in fonts? Or could I use it to recognise the suits for these cards?

I was hoping that I could say that all images in this folder correspond to 4c (e.g. the example images above), and that tesseract would see the similarity in any future instances of that image (regardless of noise) and also read that as 4c. Is this possible? Does anyone here have experience with this?

1 Answers

Answers 1

This has been my non-tesseract solution to this, until someone proves there's a better way. I've setup:

Getting these to running was the hardest part. Next, I used my dataset to train a new caffe network. I prepared my dataset into a single depth folder structure:

./card ./card/2c ./card/2d ./card/2h ./card/2s ./card/3c ./card/3d ./card/3h ./card/3s ./card/4c ./card/4d ./card/4h ./card/4s ./card/5c ./card/5d ./card/5h ./card/5s ./card/6c ./card/6d ./card/6h ./card/6s ./card/7c ./card/7d ./card/7h ./card/7s ./card/8c ./card/8d ./card/8h ./card/8s ./card/9c ./card/9d ./card/9h ./card/9s ./card/_noise ./card/_table ./card/Ac ./card/Ad ./card/Ah ./card/As ./card/Jc ./card/Jd ./card/Jh ./card/Js ./card/Kc ./card/Kd ./card/Kh ./card/Ks ./card/Qc ./card/Qd ./card/Qh ./card/Qs ./card/Tc ./card/Td ./card/Th ./card/Ts 

Within Digits, I chose:

  1. Datasets tab
  2. New Dataset Images
  3. Classification
  4. I pointed it to my card folder, e.g: /path/to/card
  5. I set the validation % to 13.0%, based on the discussion here: http://stackoverflow.com/a/13612921/880837
  6. After creating the dataset, I opened the models tab
  7. Chose my new dataset.
  8. Chose the GoogLeNet under Standard Networks, and left it to train.

I did this several times, each time I had new images in the dataset. Each learning session took 6-10 hours, but at this stage I can use my caffemodel to programmatically estimate what each image is expected to be, using this logic: https://github.com/BVLC/caffe/blob/master/examples/cpp_classification/classification.cpp

The results are either a card (2c, 7h, etc), noise, or table. Any estimates with an accuracy bigger than 90% are most likely correct. The latest run correctly recognised 300 out of 400 images, with only 3 mistakes. I'm adding new images to the dataset and retraining the existing model, further tuning the result accuracy. Hope this is valuable to others!

While I wanted the high level steps here, this was all done with large thanks to David Humphrey and his github post, I really recommend reading it and trying it out if you're interested in learning more: https://github.com/humphd/have-fun-with-machine-learning

Read More

Sunday, April 9, 2017

Compile/bundle tesseract into one binary

Leave a Comment

Is it possible to compile tesseract into one binary?

I use the following to compile a program, but how is it possible to compile tesseract shared libraries into one binary so the program is 100% portable and you dont need tesseract to be installed on the current system?

Its not necessary to compile leptonica into the binary

g++ -std=c++11 txtocr.cpp -o txtocr -llept -ltesseract 

3 Answers

Answers 1

For that you need to use a Static Library, on unix systems they usually ends with the .a extension, and a Shared Library ends with .so

If you only have the .so (or .dylib on mac, .dll on windows) library of the tesseract, then you cannot compile it as a single binary.

Answers 2

This link below

will help you more. Whether you are going to compile from scratch or uselibraries which are already compiled for you for the desired OS.

Answers 3

Use the -static argument to g++ to compile a static binary.

Read More

Friday, October 7, 2016

tesseract didn't get the little labels

Leave a Comment

I've installed tesseract on my linux environment.

It works when I execute something like

# tesseract myPic.jpg /output 

But my pic has some little labels and tesseract didn't see them.

Is an option is available to set a pitch or something like that ?

Example of text labels:

enter image description here

With this pic, tesseract doesn't recognize any value...

But with this pic:

enter image description here

I have the following output:

J8  J7A-J7B P7 \  2 40 50 0 180 190  200  P1 P2 7  110 110 \ l 

For example, in this case, the 90 (on top left) is not seen by tesseract...

I think it's just an option to define or somethink like that, no ?

Thx

1 Answers

Answers 1

In order to get accurate results from Tesseract (as well as any OCR engine) you will need to follow some guidelines as can be seen in my answer on this post: Junk results when using Tesseract OCR and tess-two

Here is the gist of it:

  • Use a high resolution image (if needed) 300 DPI is minimum

  • Make sure there is no shadows or bends in the image

  • If there is any skew, you will need to fix the image in code prior to ocr

  • Use a dictionary to help get good results

  • Adjust the text size (12 pt font is ideal)

  • Binarize the image and use image processing algorithms to remove noise

It is also recommended to spend some time training the OCR engine to receive better results as seen in this link: Training Tesseract

I took the 2 images that you shared and ran some image processing on them using the LEADTOOLS SDK (disclaimer: I am an employee of this company) and was able to get better results than you were getting with the processed images, but since the original images aren't the greatest - it still was not 100%. Here is the code I used to try and fix the images:

//initialize the codecs class using (RasterCodecs codecs = new RasterCodecs()) {    //load the file    using (RasterImage img = codecs.Load(filename))    {       //Run the image processing sequence starting by resizing the image       double newWidth = (img.Width / (double)img.XResolution) * 300;       double newHeight = (img.Height / (double)img.YResolution) * 300;       SizeCommand sizeCommand = new SizeCommand((int)newWidth, (int)newHeight, RasterSizeFlags.Resample);       sizeCommand.Run(img);        //binarize the image       AutoBinarizeCommand autoBinarize = new AutoBinarizeCommand();       autoBinarize.Run(img);        //change it to 1BPP       ColorResolutionCommand colorResolution = new ColorResolutionCommand();       colorResolution.BitsPerPixel = 1;       colorResolution.Run(img);        //save the image as PNG       codecs.Save(img, outputFile, RasterImageFormat.Png, 0);    } } 

Here are the output images from this process:

image1 processed image2 processed

Read More

Sunday, June 19, 2016

How to read information from bank cheque using Tesseract?

Leave a Comment

I have a sample Cheque. I am trying to read the following

a) Branch Name (i.e. Salwa Branch)

b) Doha on (i.e. 1/7/2016)

c) Pay against this cheque to/order

d) Riyals

e) QR

f) Cheque No.

I am using Tesseract. What extra thing I need to do for getting the relevant info since I am not able to get the informations properly.

Or any other OCR SDK is there specific to this purpose...Please help

Thanks in advance.

enter image description here

2 Answers

Answers 1

Everything that is hand written, will not be accepted in any way. You have only chances with known or similar fonts. In your case i would analyse the complete image/tiff and than go through all blocks that are created with Tesseract.

Answers 2

is easy with tesseract.

Use this .net wrapper works very well.

https://www.nuget.org/packages/Tesseract/

Examples can be found on his project page.

In same cases you have to train the fonts. How you can read here:

http://www.joyofdata.de/blog/a-guide-on-ocr-with-tesseract-3-03/

For Number and symbols (middle bottom) is a trained font available I found by google for tesseract.

Read More

Tuesday, March 8, 2016

Why am I getting “tiff page 1 not found” Lebtonica warning in Tesseract?

Leave a Comment

I just started using Tesseract.

I am following the instructions described here.

I have created a test image like this:

training/text2image --text=test.txt --outputbase=eng.Arial.exp0 --font='Arial' --fonts_dir=/usr/share/fonts 

Now I want to train the Tesseract like follows:

tesseract eng.Arial.exp0.tif eng.Arial.exp0 box.train 

Here is the output that I have:

Tesseract Open Source OCR Engine v3.04.00 with Leptonica Page 1 APPLY_BOXES:    Boxes read from boxfile:     112    Found 112 good blobs. Generated training data for 21 words Warning in pixReadMemTiff: tiff page 1 not found 

This prevents the creation of fontfile.tr file. I have tried continuing by ignoring the warning, but when creating the char-sets I get an awefull content:

unicharset_extractor lang.fontname.exp0.box  "58 NULL 0 NULL 0 Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0     # Joined [4a 6f 69 6e 65 64 ] |Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0    # Broken T 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # T [54 ] h 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # h [68 ] e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # e [65 ] ( 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # ( [28 ] q 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # q [71 ] u 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # u [75 ] ..." 

Here is the version I am using:

tesseract 3.04.00  leptonica-1.72   libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 

Any idea why this happens?

Thanks!

0 Answers

Read More