I would like to use Batch Normalization in TensorFlow, since I found it in the source code core/ops/nn_ops.cc
. However, I did not find it documented on tensorflow.org.
BN has different semantics in MLP and CNN, so I am not sure what exactly this BN does.
I did not find a method called MovingMoments
either.
The C++ code is copied here for reference:
REGISTER_OP("BatchNormWithGlobalNormalization") .Input("t: T") .Input("m: T") .Input("v: T") .Input("beta: T") .Input("gamma: T") .Output("result: T") .Attr("T: numbertype") .Attr("variance_epsilon: float") .Attr("scale_after_normalization: bool") .Doc(R"doc( Batch normalization. t: A 4D input Tensor. m: A 1D mean Tensor with size matching the last dimension of t. This is the first output from MovingMoments. v: A 1D variance Tensor with size matching the last dimension of t. This is the second output from MovingMoments. beta: A 1D beta Tensor with size matching the last dimension of t. An offset to be added to the normalized tensor. gamma: A 1D gamma Tensor with size matching the last dimension of t. If "scale_after_normalization" is true, this tensor will be multiplied with the normalized tensor. variance_epsilon: A small float number to avoid dividing by 0. scale_after_normalization: A bool indicating whether the resulted tensor needs to be multiplied with gamma. )doc");
6 Answers
Answers 1
Update July 2016 The easiest way to use batch normalization in TensorFlow is through the higher-level interfaces provided in either contrib/layers, tflearn, or slim.
Previous answer if you want to DIY: The documentation string for this has improved since the release - see the docs comment in the master branch instead of the one you found. It clarifies, in particular, that it's the output from tf.nn.moments
.
You can see a very simple example of its use in the batch_norm test code. For a more real-world use example, I've included below the helper class and use notes that I scribbled up for my own use (no warranty provided!):
"""A helper class for managing batch normalization state. This class is designed to simplify adding batch normalization (http://arxiv.org/pdf/1502.03167v3.pdf) to your model by managing the state variables associated with it. Important use note: The function get_assigner() returns an op that must be executed to save the updated state. A suggested way to do this is to make execution of the model optimizer force it, e.g., by: update_assignments = tf.group(bn1.get_assigner(), bn2.get_assigner()) with tf.control_dependencies([optimizer]): optimizer = tf.group(update_assignments) """ import tensorflow as tf class ConvolutionalBatchNormalizer(object): """Helper class that groups the normalization logic and variables. Use: ewma = tf.train.ExponentialMovingAverage(decay=0.99) bn = ConvolutionalBatchNormalizer(depth, 0.001, ewma, True) update_assignments = bn.get_assigner() x = bn.normalize(y, train=training?) (the output x will be batch-normalized). """ def __init__(self, depth, epsilon, ewma_trainer, scale_after_norm): self.mean = tf.Variable(tf.constant(0.0, shape=[depth]), trainable=False) self.variance = tf.Variable(tf.constant(1.0, shape=[depth]), trainable=False) self.beta = tf.Variable(tf.constant(0.0, shape=[depth])) self.gamma = tf.Variable(tf.constant(1.0, shape=[depth])) self.ewma_trainer = ewma_trainer self.epsilon = epsilon self.scale_after_norm = scale_after_norm def get_assigner(self): """Returns an EWMA apply op that must be invoked after optimization.""" return self.ewma_trainer.apply([self.mean, self.variance]) def normalize(self, x, train=True): """Returns a batch-normalized version of x.""" if train: mean, variance = tf.nn.moments(x, [0, 1, 2]) assign_mean = self.mean.assign(mean) assign_variance = self.variance.assign(variance) with tf.control_dependencies([assign_mean, assign_variance]): return tf.nn.batch_norm_with_global_normalization( x, mean, variance, self.beta, self.gamma, self.epsilon, self.scale_after_norm) else: mean = self.ewma_trainer.average(self.mean) variance = self.ewma_trainer.average(self.variance) local_beta = tf.identity(self.beta) local_gamma = tf.identity(self.gamma) return tf.nn.batch_norm_with_global_normalization( x, mean, variance, local_beta, local_gamma, self.epsilon, self.scale_after_norm)
Note that I called it a ConvolutionalBatchNormalizer
because it pins the use of tf.nn.moments
to sum across axes 0, 1, and 2, whereas for non-convolutional use you might only want axis 0.
Feedback appreciated if you use it.
Answers 2
The following works fine for me, it does not require invoking EMA-apply outside.
import numpy as np import tensorflow as tf from tensorflow.python import control_flow_ops def batch_norm(x, n_out, phase_train, scope='bn'): """ Batch normalization on convolutional maps. Args: x: Tensor, 4D BHWD input maps n_out: integer, depth of input maps phase_train: boolean tf.Varialbe, true indicates training phase scope: string, variable scope Return: normed: batch-normalized maps """ with tf.variable_scope(scope): beta = tf.Variable(tf.constant(0.0, shape=[n_out]), name='beta', trainable=True) gamma = tf.Variable(tf.constant(1.0, shape=[n_out]), name='gamma', trainable=True) batch_mean, batch_var = tf.nn.moments(x, [0,1,2], name='moments') ema = tf.train.ExponentialMovingAverage(decay=0.5) def mean_var_with_update(): ema_apply_op = ema.apply([batch_mean, batch_var]) with tf.control_dependencies([ema_apply_op]): return tf.identity(batch_mean), tf.identity(batch_var) mean, var = tf.cond(phase_train, mean_var_with_update, lambda: (ema.average(batch_mean), ema.average(batch_var))) normed = tf.nn.batch_normalization(x, mean, var, beta, gamma, 1e-3) return normed
Example:
import math n_in, n_out = 3, 16 ksize = 3 stride = 1 phase_train = tf.placeholder(tf.bool, name='phase_train') input_image = tf.placeholder(tf.float32, name='input_image') kernel = tf.Variable(tf.truncated_normal([ksize, ksize, n_in, n_out], stddev=math.sqrt(2.0/(ksize*ksize*n_out))), name='kernel') conv = tf.nn.conv2d(input_image, kernel, [1,stride,stride,1], padding='SAME') conv_bn = batch_norm(conv, n_out, phase_train) relu = tf.nn.relu(conv_bn) with tf.Session() as session: session.run(tf.initialize_all_variables()) for i in range(20): test_image = np.random.rand(4,32,32,3) sess_outputs = session.run([relu], {input_image.name: test_image, phase_train.name: True})
Answers 3
Just a heads up, can't comment because I don't have 50 rep. But @bgshi's answer above does not seem correct. When phase_train is set to false, it still updates the ema mean and variance. This can be verified with the following code snippet.
x = tf.placeholder(tf.float32, [None, 20, 20, 10], name='input') phase_train = tf.placeholder(tf.bool, name='phase_train') # generate random noise to pass into batch norm x_gen = tf.random_normal([50,20,20,10]) pt_false = tf.Variable(tf.constant(True)) #generate a constant variable to pass into batch norm y = x_gen.eval() [bn, bn_vars] = batch_norm(x, 10, phase_train) tf.initialize_all_variables().run() train_step = lambda: bn.eval({x:x_gen.eval(), phase_train:True}) test_step = lambda: bn.eval({x:y, phase_train:False}) test_step_c = lambda: bn.eval({x:y, phase_train:True}) # Verify that this is different as expected, two different x's have different norms print(train_step()[0][0][0]) print(train_step()[0][0][0]) # Verify that this is same as expected, same x's (y) have same norm print(train_step_c()[0][0][0]) print(train_step_c()[0][0][0]) # THIS IS DIFFERENT but should be they same, should only be reading from the ema. print(test_step()[0][0][0]) print(test_step()[0][0][0])
Answers 4
There is also an "official" batch normalization layer coded by the developers. They don't have very good docs on how to use it but here is how to use it (according to me):
from tensorflow.contrib.layers.python.layers import batch_norm as batch_norm def batch_norm_layer(x,train_phase,scope_bn): bn_train = batch_norm(x, decay=0.999, center=True, scale=True, is_training=True, reuse=None, # is this right? trainable=True, scope=scope_bn) bn_inference = batch_norm(x, decay=0.999, center=True, scale=True, is_training=False, reuse=True, # is this right? trainable=True, scope=scope_bn) z = tf.cond(train_phase, lambda: bn_train, lambda: bn_inference) return z
to actually use it you need to create a placeholder for train_phase
that indicates if you are in training or inference phase (as in train_phase = tf.placeholder(tf.bool, name='phase_train')
). Its value can be filled during inference or training with a tf.session
as in:
test_error = sess.run(fetches=cross_entropy, feed_dict={x: batch_xtest, y_:batch_ytest, train_phase: False})
or during training:
sess.run(fetches=train_step, feed_dict={x: batch_xs, y_:batch_ys, train_phase: True})
and here is the function that implements BN according to them (that can be in the link I gave above but I'm pasting this just in case they change it):
@add_arg_scope def batch_norm(inputs, decay=0.999, center=True, scale=False, epsilon=0.001, activation_fn=None, updates_collections=ops.GraphKeys.UPDATE_OPS, is_training=True, reuse=None, variables_collections=None, outputs_collections=None, trainable=True, scope=None): """Adds a Batch Normalization layer from http://arxiv.org/abs/1502.03167. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" Sergey Ioffe, Christian Szegedy Can be used as a normalizer function for conv2d and fully_connected. Args: -inputs: a tensor of size `[batch_size, height, width, channels]` or `[batch_size, channels]`. -decay: decay for the moving average. -center: If True, subtract `beta`. If False, `beta` is ignored. -scale: If True, multiply by `gamma`. If False, `gamma` is not used. When the next layer is linear (also e.g. `nn.relu`), this can be disabled since the scaling can be done by the next layer. -epsilon: small float added to variance to avoid dividing by zero. -activation_fn: Optional activation function. -updates_collections: collections to collect the update ops for computation. If None, a control dependency would be added to make sure the updates are computed. -is_training: whether or not the layer is in training mode. In training mode it would accumulate the statistics of the moments into `moving_mean` and `moving_variance` using an exponential moving average with the given `decay`. When it is not in training mode then it would use the values of the `moving_mean` and the `moving_variance`. -reuse: whether or not the layer and its variables should be reused. To be able to reuse the layer scope must be given. -variables_collections: optional collections for the variables. -outputs_collections: collections to add the outputs. -trainable: If `True` also add variables to the graph collection `GraphKeys.TRAINABLE_VARIABLES` (see tf.Variable). -scope: Optional scope for `variable_op_scope`. Returns: a tensor representing the output of the operation. """ with variable_scope.variable_op_scope([inputs],scope, 'BatchNorm', reuse=reuse) as sc: inputs_shape = inputs.get_shape() dtype = inputs.dtype.base_dtype axis = list(range(len(inputs_shape) - 1)) params_shape = inputs_shape[-1:] # Allocate parameters for the beta and gamma of the normalization. beta, gamma = None, None if center: beta_collections = utils.get_variable_collections(variables_collections,'beta') beta = variables.model_variable('beta',shape=params_shape,dtype=dtype,initializer=init_ops.zeros_initializer,collections=beta_collections,trainable=trainable) if scale: gamma_collections = utils.get_variable_collections(variables_collections,'gamma') gamma = variables.model_variable('gamma',shape=params_shape,dtype=dtype,initializer=init_ops.ones_initializer,collections=gamma_collections,trainable=trainable) # Create moving_mean and moving_variance variables and add them to the # appropiate collections. moving_mean_collections = utils.get_variable_collections(variables_collections, 'moving_mean') moving_mean = variables.model_variable('moving_mean',shape=params_shape,dtype=dtype,initializer=init_ops.zeros_initializer,trainable=False,collections=moving_mean_collections) moving_variance_collections = utils.get_variable_collections(variables_collections, 'moving_variance') moving_variance = variables.model_variable('moving_variance',shape=params_shape,dtype=dtype,initializer=init_ops.ones_initializer,trainable=False,collections=moving_variance_collections) if is_training: # Calculate the moments based on the individual batch. mean, variance = nn.moments(inputs, axis, shift=moving_mean) # Update the moving_mean and moving_variance moments. update_moving_mean = moving_averages.assign_moving_average(moving_mean, mean, decay) update_moving_variance = moving_averages.assign_moving_average(moving_variance, variance, decay) if updates_collections is None: # Make sure the updates are computed here. with ops.control_dependencies([update_moving_mean,update_moving_variance]): outputs = nn.batch_normalization(inputs, mean, variance, beta, gamma, epsilon) else: # Collect the updates to be computed later. ops.add_to_collections(updates_collections, update_moving_mean) ops.add_to_collections(updates_collections, update_moving_variance) outputs = nn.batch_normalization(inputs, mean, variance, beta, gamma, epsilon) else: outputs = nn.batch_normalization( inputs, moving_mean, moving_variance, beta, gamma, epsilon) outputs.set_shape(inputs.get_shape()) if activation_fn: outputs = activation_fn(outputs) return utils.collect_named_outputs(outputs_collections, sc.name, outputs)
I'm pretty sure this is correct according to the discussion in github.
Answers 5
So a simple example of the use of this batchnorm class:
from bn_class import * with tf.name_scope('Batch_norm_conv1') as scope: ewma = tf.train.ExponentialMovingAverage(decay=0.99) bn_conv1 = ConvolutionalBatchNormalizer(num_filt_1, 0.001, ewma, True) update_assignments = bn_conv1.get_assigner() a_conv1 = bn_conv1.normalize(a_conv1, train=bn_train) h_conv1 = tf.nn.relu(a_conv1)
Answers 6
Using TensorFlow built-in batch_norm layer, below is the code to load data, build a network with one hidden ReLU layer and L2 normalization and introduce batch normalization for both hidden and out layer. This runs fine and trains fine. Just FYI this example is mostly built upon the data and code from Udacity DeepLearning course. P.S. Yes, parts of it were discussed one way or another in answers earlier but I decided to gather in one code snippet everything so that you have example of whole network training process with Batch Normalization and its evaluation
# These are all the modules we'll be using later. Make sure you can import them # before proceeding further. from __future__ import print_function import numpy as np import tensorflow as tf from six.moves import cPickle as pickle pickle_file = '/home/maxkhk/Documents/Udacity/DeepLearningCourse/SourceCode/tensorflow/examples/udacity/notMNIST.pickle' with open(pickle_file, 'rb') as f: save = pickle.load(f) train_dataset = save['train_dataset'] train_labels = save['train_labels'] valid_dataset = save['valid_dataset'] valid_labels = save['valid_labels'] test_dataset = save['test_dataset'] test_labels = save['test_labels'] del save # hint to help gc free up memory print('Training set', train_dataset.shape, train_labels.shape) print('Validation set', valid_dataset.shape, valid_labels.shape) print('Test set', test_dataset.shape, test_labels.shape) image_size = 28 num_labels = 10 def reformat(dataset, labels): dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32) # Map 2 to [0.0, 1.0, 0.0 ...], 3 to [0.0, 0.0, 1.0 ...] labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32) return dataset, labels train_dataset, train_labels = reformat(train_dataset, train_labels) valid_dataset, valid_labels = reformat(valid_dataset, valid_labels) test_dataset, test_labels = reformat(test_dataset, test_labels) print('Training set', train_dataset.shape, train_labels.shape) print('Validation set', valid_dataset.shape, valid_labels.shape) print('Test set', test_dataset.shape, test_labels.shape) def accuracy(predictions, labels): return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1)) / predictions.shape[0]) #for NeuralNetwork model code is below #We will use SGD for training to save our time. Code is from Assignment 2 #beta is the new parameter - controls level of regularization. #Feel free to play with it - the best one I found is 0.001 #notice, we introduce L2 for both biases and weights of all layers batch_size = 128 beta = 0.001 #building tensorflow graph graph = tf.Graph() with graph.as_default(): # Input data. For the training data, we use a placeholder that will be fed # at run time with a training minibatch. tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size * image_size)) tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels)) tf_valid_dataset = tf.constant(valid_dataset) tf_test_dataset = tf.constant(test_dataset) #introduce batchnorm tf_train_dataset_bn = tf.contrib.layers.batch_norm(tf_train_dataset) #now let's build our new hidden layer #that's how many hidden neurons we want num_hidden_neurons = 1024 #its weights hidden_weights = tf.Variable( tf.truncated_normal([image_size * image_size, num_hidden_neurons])) hidden_biases = tf.Variable(tf.zeros([num_hidden_neurons])) #now the layer itself. It multiplies data by weights, adds biases #and takes ReLU over result hidden_layer = tf.nn.relu(tf.matmul(tf_train_dataset_bn, hidden_weights) + hidden_biases) #adding the batch normalization layerhi() hidden_layer_bn = tf.contrib.layers.batch_norm(hidden_layer) #time to go for output linear layer #out weights connect hidden neurons to output labels #biases are added to output labels out_weights = tf.Variable( tf.truncated_normal([num_hidden_neurons, num_labels])) out_biases = tf.Variable(tf.zeros([num_labels])) #compute output out_layer = tf.matmul(hidden_layer_bn,out_weights) + out_biases #our real output is a softmax of prior result #and we also compute its cross-entropy to get our loss #Notice - we introduce our L2 here loss = (tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits( out_layer, tf_train_labels) + beta*tf.nn.l2_loss(hidden_weights) + beta*tf.nn.l2_loss(hidden_biases) + beta*tf.nn.l2_loss(out_weights) + beta*tf.nn.l2_loss(out_biases))) #now we just minimize this loss to actually train the network optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss) #nice, now let's calculate the predictions on each dataset for evaluating the #performance so far # Predictions for the training, validation, and test data. train_prediction = tf.nn.softmax(out_layer) valid_relu = tf.nn.relu( tf.matmul(tf_valid_dataset, hidden_weights) + hidden_biases) valid_prediction = tf.nn.softmax( tf.matmul(valid_relu, out_weights) + out_biases) test_relu = tf.nn.relu( tf.matmul( tf_test_dataset, hidden_weights) + hidden_biases) test_prediction = tf.nn.softmax(tf.matmul(test_relu, out_weights) + out_biases) #now is the actual training on the ANN we built #we will run it for some number of steps and evaluate the progress after #every 500 steps #number of steps we will train our ANN num_steps = 3001 #actual training with tf.Session(graph=graph) as session: tf.initialize_all_variables().run() print("Initialized") for step in range(num_steps): # Pick an offset within the training data, which has been randomized. # Note: we could use better randomization across epochs. offset = (step * batch_size) % (train_labels.shape[0] - batch_size) # Generate a minibatch. batch_data = train_dataset[offset:(offset + batch_size), :] batch_labels = train_labels[offset:(offset + batch_size), :] # Prepare a dictionary telling the session where to feed the minibatch. # The key of the dictionary is the placeholder node of the graph to be fed, # and the value is the numpy array to feed to it. feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels} _, l, predictions = session.run( [optimizer, loss, train_prediction], feed_dict=feed_dict) if (step % 500 == 0): print("Minibatch loss at step %d: %f" % (step, l)) print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels)) print("Validation accuracy: %.1f%%" % accuracy( valid_prediction.eval(), valid_labels)) print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
0 comments:
Post a Comment