I use the tesseract 4.0 via docker image tesseractshadow/tesseract4re
I use the option -l=deu
to give tesseract the hint, that the text is in "deutsch" (german).
Still the result for the german word "für" is not good. The german word is very common (meaning "for" in english).
Tesseract often detects "fiir" or "fur".
What can I do to improve this?
reproducible example
docker run --name self.container_name --rm \ --volume $PWD:/pwd \ tesseractshadow/tesseract4re \ tesseract /pwd/die-fuer-das.png /pwd/die-fuer-das.png.ocr-result -l=deu
Result:
cat die-fuer-das.png.ocr-result.txt die fur das
Image die_fuer_das.png:
2 Answers
Answers 1
I found the solution. I should be -l deu
otherwise the german language does not get used:
Works:
===> tesseract die-fuer-das.png out -l deu; cat out.txt Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica die für das
Wrong language:
===> tesseract die-fuer-das.png out -l=deu; cat out.txt Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica die fur das
Answers 2
This is more a comment than a direct answer to your question. Here is another data point: If I use the link with your image with the OCR.space API it gets it perfectly right:
****** Result for Image/Page 1 ****** die für das
In the past, upscaling to 300dpi often improved Tesseract results, but I surprised that this should still be needed in version 4.
0 comments:
Post a Comment