Monday, June 4, 2018

Tesseract does not recognize german “für”

By Hường Hana 6:30 AM ocr, tesseract Leave a Comment

I use the tesseract 4.0 via docker image tesseractshadow/tesseract4re

I use the option -l=deu to give tesseract the hint, that the text is in "deutsch" (german).

Still the result for the german word "für" is not good. The german word is very common (meaning "for" in english).

Tesseract often detects "fiir" or "fur".

What can I do to improve this?

reproducible example

docker run --name self.container_name --rm \     --volume  $PWD:/pwd \     tesseractshadow/tesseract4re \     tesseract /pwd/die-fuer-das.png /pwd/die-fuer-das.png.ocr-result -l=deu

Result:

cat die-fuer-das.png.ocr-result.txt  die fur das

Image die_fuer_das.png:

2 Answers

Answers 1

I found the solution. I should be -l deu otherwise the german language does not get used:

Works:

===> tesseract  die-fuer-das.png out  -l deu; cat out.txt Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica die für das

Wrong language:

===> tesseract  die-fuer-das.png out  -l=deu; cat out.txt Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica die fur das

Answers 2

This is more a comment than a direct answer to your question. Here is another data point: If I use the link with your image with the OCR.space API it gets it perfectly right:

****** Result for Image/Page 1 ****** die für das

In the past, upscaling to 300dpi often improved Tesseract results, but I surprised that this should still be needed in version 4.

Coding Question

Monday, June 4, 2018

Tesseract does not recognize german “für”

2 Answers

Answers 1

Answers 2

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment

Search

Popular Posts

Labels

Blog Archive

Find Us On Facebook