I just started using Tesseract.
I am following the instructions described here.
I have created a test image like this:
training/text2image --text=test.txt --outputbase=eng.Arial.exp0 --font='Arial' --fonts_dir=/usr/share/fonts Now I want to train the Tesseract like follows:
tesseract eng.Arial.exp0.tif eng.Arial.exp0 box.train Here is the output that I have:
Tesseract Open Source OCR Engine v3.04.00 with Leptonica Page 1 APPLY_BOXES: Boxes read from boxfile: 112 Found 112 good blobs. Generated training data for 21 words Warning in pixReadMemTiff: tiff page 1 not found This prevents the creation of fontfile.tr file. I have tried continuing by ignoring the warning, but when creating the char-sets I get an awefull content:
unicharset_extractor lang.fontname.exp0.box "58 NULL 0 NULL 0 Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Joined [4a 6f 69 6e 65 64 ] |Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Broken T 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # T [54 ] h 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # h [68 ] e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # e [65 ] ( 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # ( [28 ] q 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # q [71 ] u 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # u [75 ] ..." Here is the version I am using:
tesseract 3.04.00 leptonica-1.72 libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 Any idea why this happens?
Thanks!
0 comments:
Post a Comment