I found examples/image_ocr.py
which seems to for OCR. Hence it should be possible to give the model an image and receive text. However, I have no idea how to do so. How do I feed the model with a new image? Which kind of preprocessing is necessary?
What I did
Installing the depencencies:
- Install
cairocffi
:sudo apt-get install python-cairocffi
- Install
editdistance
:sudo -H pip install editdistance
- Change
train
to return the model and save the trained model. - Run the script to train the model.
Now I have a model.h5
. What's next?
See https://github.com/MartinThoma/algorithms/tree/master/ML/ocr/keras for my current code. I know how to load the model (see below) and this seems to work. The problem is that I don't know how to feed new scans of images with text to the model.
Related side questions
- What is CTC? Connectionist Temporal Classification?
- Are there algorithms which reliably detect the rotation of a document?
- Are there algorithms which reliably detect lines / text blocks / tables / images (hence make a reasonable segmentation)? I guess edge detection with smoothing and line-wise histograms already works reasonably well for that?
What I tried
#!/usr/bin/env python from keras import backend as K import keras from keras.models import load_model import os from image_ocr import ctc_lambda_func, create_model, TextImageGenerator from keras.layers import Lambda from keras.utils.data_utils import get_file import scipy.ndimage import numpy img_h = 64 img_w = 512 pool_size = 2 words_per_epoch = 16000 val_split = 0.2 val_words = int(words_per_epoch * (val_split)) if K.image_data_format() == 'channels_first': input_shape = (1, img_w, img_h) else: input_shape = (img_w, img_h, 1) fdir = os.path.dirname(get_file('wordlists.tgz', origin='http://www.mythic-ai.com/datasets/wordlists.tgz', untar=True)) img_gen = TextImageGenerator(monogram_file=os.path.join(fdir, 'wordlist_mono_clean.txt'), bigram_file=os.path.join(fdir, 'wordlist_bi_clean.txt'), minibatch_size=32, img_w=img_w, img_h=img_h, downsample_factor=(pool_size ** 2), val_split=words_per_epoch - val_words ) print("Input shape: {}".format(input_shape)) model, _, _ = create_model(input_shape, img_gen, pool_size, img_w, img_h) model.load_weights("my_model.h5") x = scipy.ndimage.imread('example.png', mode='L').transpose() x = x.reshape(x.shape + (1,)) # Does not work print(model.predict(x))
this gives
2017-07-05 22:07:58.695665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN Black, pci bus id: 0000:01:00.0) Traceback (most recent call last): File "eval_example.py", line 45, in <module> print(model.predict(x)) File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1567, in predict check_batch_axis=False) File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 106, in _standardize_input_data 'Found: array with shape ' + str(data.shape)) ValueError: The model expects 4 arrays, but only received one array. Found: array with shape (512, 64, 1)
2 Answers
Answers 1
Here, you created a model that needs 4 inputs:
model = Model(inputs=[input_data, labels, input_length, label_length], outputs=loss_out)
Your predict attempt, on the other hand, is loading just an image.
Hence the message: The model expects 4 arrays, but only received one array
From your code, the necessary inputs are:
input_data = Input(name='the_input', shape=input_shape, dtype='float32') labels = Input(name='the_labels', shape=[img_gen.absolute_max_string_len],dtype='float32') input_length = Input(name='input_length', shape=[1], dtype='int64') label_length = Input(name='label_length', shape=[1], dtype='int64')
The original code and your training work because they're using the TextImageGenerator
. This generator cares to give you the four necessary inputs for the model.
So, what you have to do is to predict using the generator. As you have the fit_generator()
method for training with the generator, you also have the predict_generator() method for predicting with the generator.
Now, for a complete answer and solution, I'd have to study your generator and see how it works (which would take me some time). But now you know what is to be done, you can probably figure it out.
You can either use the generator as it is, and predict probably a huge lot of data, or you can try to replicate a generator that will yield just one or a few images with the necessary labels, length and label length.
Or maybe, if possible, just create the 3 remaining arrays manually, but making sure they have the same shapes (except for the first, which is the batch size) as the generator outputs.
The one thing you must assert, though, is: have 4 arrays with the same shapes as the generator outputs, except for the first dimension.
Answers 2
Now I have a model.h5. What's next?
First I should comment that the model.h5
contains the weights of your network, if you wish to save the architecture of your network as well you should save it as a json
like this example:
model_json = model_json = model.to_json() with open("model_arch.json", "w") as json_file: json_file.write(model_json)
Now, once you have your model and its weights you can load them on demand by doing the following:
json_file = open('model_arch.json', 'r') loaded_model_json = json_file.read() json_file.close() loaded_model = model_from_json(loaded_model_json) # load weights into new model # if you already have a loaded model and dont need to save start from here loaded_model.load_weights("model.h5") # compile loaded model with certain specifications sgd = SGD(lr=0.01) loaded_model.compile(loss="binary_crossentropy", optimizer=sgd, metrics=["accuracy"])
Then, with that loaded_module
you can proceed to predict the classification of certain input like this:
prediction = loaded_model.predict(some_input, batch_size=20, verbose=0)
Which will return the classification of that input.
About the Side Questions:
- CTC seems to be a term they are defining in the paper you refered, extracting from it says:
In what follows, we refer to the task of labelling un- segmented data sequences as temporal classification (Kadous, 2002), and to our use of RNNs for this pur- pose as connectionist temporal classification (CTC).
- To compensate the rotation of a document, images, or similar you could either generate more data from your current one by applying such transformations (take a look at this blog post that explains a way to do that ), or you could use a Convolutional Neural Network approach, which also is actually what that
Keras
example you are using does, as we can see from that git:
This example uses a convolutional stack followed by a recurrent stack and a CTC logloss function to perform optical character recognition of generated text images.
You can check this tutorial that is related to what you are doing and where they also explain more about Convolutional Neural Networks.
- Well this one is a broad question but to detect lines you could use the Hough Line Transform, or also Canny Edge Detection could be good options.
Edit: The error you are getting is because it is expected more parameters instead of 1, from the keras docs we can see:
predict(self, x, batch_size=32, verbose=0)
Raises ValueError: In case of mismatch between the provided input data and the model's expectations, or in case a stateful model receives a number of samples that is not a multiple of the batch size.
0 comments:
Post a Comment