In a TensorFlow optimizer (python) the method apply_dense
does get called for the neuron weights (layer connections) and the bias weights but I would like to use both in this method.
def _apply_dense(self, grad, weight): ...
For example: A fully connected neural network with two hidden layer with two neurons and a bias for each.
If we take a look at layer 2 we get in apply_dense
a call for the neuron weights:
and a call for the bias weights:
But I would either need both matrix in one call of apply_dense
or a weight matrix like this:
X_2X_4, B_1X_4, ... is just a notation for the weight of the connection between the two neurons. Therefore B_1X_4 ist only a placeholder for the weight between B_1 and X_4.
How to do this?
MWE
For an minimal working example here a stochastic gradient descent optimizer implementation with a momentum. For every layer the momentum of all incoming connections from other neurons is reduced to the mean (see ndims == 2). What i need instead is the mean of not only the momentum values from the incoming neuron connections but also from the incoming bias connections (as described above).
from __future__ import absolute_import from __future__ import division from __future__ import print_function import tensorflow as tf from tensorflow.python.training import optimizer class SGDmomentum(optimizer.Optimizer): def __init__(self, learning_rate=0.001, mu=0.9, use_locking=False, name="SGDmomentum"): super(SGDmomentum, self).__init__(use_locking, name) self._lr = learning_rate self._mu = mu self._lr_t = None self._mu_t = None def _create_slots(self, var_list): for v in var_list: self._zeros_slot(v, "a", self._name) def _apply_dense(self, grad, weight): learning_rate_t = tf.cast(self._lr_t, weight.dtype.base_dtype) mu_t = tf.cast(self._mu_t, weight.dtype.base_dtype) momentum = self.get_slot(weight, "a") if momentum.get_shape().ndims == 2: # neuron weights momentum_mean = tf.reduce_mean(momentum, axis=1, keep_dims=True) elif momentum.get_shape().ndims == 1: # bias weights momentum_mean = momentum else: momentum_mean = momentum momentum_update = grad + (mu_t * momentum_mean) momentum_t = tf.assign(momentum, momentum_update, use_locking=self._use_locking) weight_update = learning_rate_t * momentum_t weight_t = tf.assign_sub(weight, weight_update, use_locking=self._use_locking) return tf.group(*[weight_t, momentum_t]) def _prepare(self): self._lr_t = tf.convert_to_tensor(self._lr, name="learning_rate") self._mu_t = tf.convert_to_tensor(self._mu, name="momentum_term")
For a simple neural network: https://raw.githubusercontent.com/aymericdamien/TensorFlow-Examples/master/examples/3_NeuralNetworks/multilayer_perceptron.py (only change the optimizer to the custom SGDmomentum optimizer)
1 Answers
Answers 1
I'm not 100% clear on what you are trying to do, so I'm not sure if this really answers your question.
Let's say you have a dense layer transforming an input of size M to an output of size N. According to the convention you show, you'd have an N × M weights matrix W and a N-sized bias vector B. Then, an input vector X of size M (or a batch of inputs of size M × K) would be processed by the layer as W · X + B, and then applying the activation function (in the case of a batch, the addition would be a "broadcasted" operation). In TensorFlow:
X = ... # Input batch of size M x K W = ... # Weights of size N x M B = ... # Biases of size N Y = tf.matmul(W, X) + B[:, tf.newaxis] # Output of size N x K # Activation...
If you want, you can always put W and B together in a single extended weights matrix W*, basically adding B as a new row in W, so W* would be (N + 1) × M. Then you just need to add a new element to the input vector X containing a constant 1 (or a new row if it's a batch), so you would get X* with size N + 1 (or (N + 1) × K for a batch). The product W* · X* would then give you the same result as before. In TensorFlow:
X = ... # Input batch of size M x K W_star = ... # Extended weights of size (N + 1) x M # You can still have a "view" of the original W and B if you need it W = W_star[:N] B = W_star[-1] X_star = tf.concat([X, tf.ones_like(X[:1])], axis=0) Y = tf.matmul(W_star, X_star) # Output of size N x K # Activation...
Now you can compute gradients and updates for weights and biases together. A drawback of this approach is that if you want to apply regularization then you should be careful to apply it only on the weights part of the matrix, not on the biases.
0 comments:
Post a Comment