Any one have any success with efficient data-parallelism, where you send the identical model definition to multiple GPUs, but send different user data to each GPU?
It looks like dist-keras might be promising. but I would love to hear feedback on any approaches taken along these lines.
We have user behavioral data: 100k users, 200 fields (one-hot vectors), 30,000 records per user. We built an RNN, using Keras on top of Tensorflow, to predict the next action (out of 20+ possible actions) taken for only 1 user. It takes about 30min to train on 1 GPU. (My box has 8 GPUs). Now, We would like to build models for all 100k users.
We were able to perform data parallelism using Multi GPU approach for single user data.
But since the model takes 30 minutes per user, and there are 100k users, we want to partition the data by user and and run the same model for every user data in distributed way using a cluster and generate model output for that user.
I am currently using Keras 2.1.x with TensorFlow 1.4.
1 Answers
Answers 1
This is not exactly what you are describing, however, something that might work would be to take slices of each batch and train them on the different GPUs separately by taking the model and constructing a seperate one that does this automatically.
So say we want to make the model parallelized, and then split its batches during training among the hardware.
def make_parallel(model, gpu_count): """ make a paralellized model from the input model on the given gpu count that splits the input batch amongst the hardware. :param model: The model you want to make parallel :param gpu_count: The gpu count :return: The parellelized model """ def get_slice(data, idx, parts): # take a slice of the batch shape = tf.shape(data) size = tf.concat([shape[:1] // parts, shape[1:]], axis=0) stride = tf.concat([shape[:1] // parts, shape[1:] * 0], axis=0) start = stride * idx return tf.slice(data, start, size) outputs_all = [[] for i in range(len(model.outputs))] # Place a copy of the model on each GPU, each getting a slice of the batch for i in range(gpu_count): with tf.device('/gpu:%d' % i): with tf.name_scope('tower_%d' % i) as scope: inputs = [] for x in model.inputs: input_shape = tuple(x.get_shape().as_list())[1:] slice_n = Lambda(get_slice, output_shape=input_shape, arguments={'idx': i, 'parts': gpu_count})(x) inputs.append(slice_n) outputs = model(inputs) if not isinstance(outputs, list): outputs = [outputs] # Save all outputs to be joined at a later date for l in range(len(outputs)): outputs_all[l].append(outputs[l]) # merge outputs on CPU with tf.device('/cpu:0'): merged = [merge(output, mode='concat', concat_axis=0) for output in outputs_all] return Model(input=model.inputs, output=merged)
Can you report back speed results when training on this model?
0 comments:
Post a Comment