Monday, June 4, 2018

Tesseract does not recognize german “für”

Leave a Comment

I use the tesseract 4.0 via docker image tesseractshadow/tesseract4re

I use the option -l=deu to give tesseract the hint, that the text is in "deutsch" (german).

Still the result for the german word "für" is not good. The german word is very common (meaning "for" in english).

Tesseract often detects "fiir" or "fur".

What can I do to improve this?

reproducible example

docker run --name self.container_name --rm \     --volume  $PWD:/pwd \     tesseractshadow/tesseract4re \     tesseract /pwd/die-fuer-das.png /pwd/die-fuer-das.png.ocr-result -l=deu 

Result:

cat die-fuer-das.png.ocr-result.txt  die fur das 

Image die_fuer_das.png:

enter image description here

2 Answers

Answers 1

I found the solution. I should be -l deu otherwise the german language does not get used:

Works:

===> tesseract  die-fuer-das.png out  -l deu; cat out.txt Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica die für das 

Wrong language:

===> tesseract  die-fuer-das.png out  -l=deu; cat out.txt Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica die fur das 

Answers 2

This is more a comment than a direct answer to your question. Here is another data point: If I use the link with your image with the OCR.space API it gets it perfectly right:

****** Result for Image/Page 1 ****** die für das  

In the past, upscaling to 300dpi often improved Tesseract results, but I surprised that this should still be needed in version 4.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment