This is my custom extension of one of Andrew NG's neural network from deep learning course where instead of producing 0 or 1 for binary classification I'm attempting to classify multiple examples.
Both the inputs and outputs are one hot encoded.
With not much training I receive an accuracy of 'train accuracy: 67.51658067499625 %'
How can I classify a single training example instead of classifying all training examples?
I think a bug exists in my implementation as an issue with this network is training examples (train_set_x) and output values (train_set_y) both need to have same dimensions or an error related to the dimensionality of matrices is received. For example using :
train_set_x = np.array([ [1,1,1,1],[0,1,1,1],[0,0,1,1] ]) train_set_y = np.array([ [1,1,1],[1,1,0],[1,1,1] ])
returns error :
ValueError Traceback (most recent call last) <ipython-input-11-0d356e8d66f3> in <module>() 27 print(A) 28 ---> 29 np.multiply(train_set_y,A) 30 31 def initialize_with_zeros(numberOfTrainingExamples):
ValueError: operands could not be broadcast together with shapes (3,3) (1,4)
network code :
import numpy as np import matplotlib.pyplot as plt import h5py import scipy from scipy import ndimage import pandas as pd %matplotlib inline train_set_x = np.array([ [1,1,1,1],[0,1,1,1],[0,0,1,1] ]) train_set_y = np.array([ [1,1,1,0],[1,1,0,0],[1,1,1,1] ]) numberOfFeatures = 4 numberOfTrainingExamples = 3 def sigmoid(z): s = 1 / (1 + np.exp(-z)) return s w = np.zeros((numberOfTrainingExamples , 1)) b = 0 A = sigmoid(np.dot(w.T , train_set_x)) print(A) np.multiply(train_set_y,A) def initialize_with_zeros(numberOfTrainingExamples): w = np.zeros((numberOfTrainingExamples , 1)) b = 0 return w, b def propagate(w, b, X, Y): m = X.shape[1] A = sigmoid(np.dot(w.T , X) + b) cost = -(1/m)*np.sum(np.multiply(Y,np.log(A)) + np.multiply((1-Y),np.log(1-A)), axis=1) dw = ( 1 / m ) * np.dot( X, ( A - Y ).T ) # consumes ( A - Y ) db = ( 1 / m ) * np.sum( A - Y ) # consumes ( A - Y ) again # cost = np.squeeze(cost) grads = {"dw": dw, "db": db} return grads, cost def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost = True): costs = [] for i in range(num_iterations): grads, cost = propagate(w, b, X, Y) dw = grads["dw"] db = grads["db"] w = w - (learning_rate * dw) b = b - (learning_rate * db) if i % 100 == 0: costs.append(cost) if print_cost and i % 10000 == 0: print(cost) params = {"w": w, "b": b} grads = {"dw": dw, "db": db} return params, grads, costs def model(X_train, Y_train, num_iterations, learning_rate = 0.5, print_cost = False): w, b = initialize_with_zeros(numberOfTrainingExamples) parameters, grads, costs = optimize(w, b, X_train, Y_train, num_iterations, learning_rate, print_cost = True) w = parameters["w"] b = parameters["b"] Y_prediction_train = sigmoid(np.dot(w.T , X_train) + b) print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100)) model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.0001, print_cost = True)
Update: A bug exists in this implementation in that the training example pairs (train_set_x , train_set_y)
must contain the same dimensions. Can point in direction of how linear algebra should be modified?
Update 2 :
I modified @Paul Panzer answer so that learning rate is 0.001 and train_set_x , train_set_y pairs are unique :
train_set_x = np.array([ [1,1,1,1,1],[0,1,1,1,1],[0,0,1,1,0],[0,0,1,0,1] ]) train_set_y = np.array([ [1,0,0],[0,0,1],[0,1,0],[1,0,1] ]) grads = model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.001, print_cost = True) # To classify single training example : print(sigmoid(dw @ [0,0,1,1,0] + db))
This update produces following output :
-2.09657359028 -3.94918577439 [[ 0.74043089 0.32851512 0.14776077 0.77970162] [ 0.04810012 0.08033521 0.72846174 0.1063849 ] [ 0.25956911 0.67148488 0.22029838 0.85223923]] [[1 0 0 1] [0 0 1 0] [0 1 0 1]] train accuracy: 79.84462279013312 % [[ 0.51309252 0.48853845 0.50945862] [ 0.5110232 0.48646923 0.50738869] [ 0.51354109 0.48898712 0.50990734]]
Should print(sigmoid(dw @ [0,0,1,1,0] + db))
produce a vector that once rounded matches train_set_y
corresponding value : [0,1,0]
?
Modifying to produce a vector with (adding [0,0,1,1,0]
to numpy array and taking transpose):
print(sigmoid(dw @ np.array([[0,0,1,1,0]]).T + db))
returns :
array([[ 0.51309252], [ 0.48646923], [ 0.50990734]])
Again, rounding these values to nearest whole number produces vector [1,0,1]
when [0,1,0]
is expected.
These are incorrect operations to produce a prediction for single training example ?
3 Answers
Answers 1
Your difficulties come from mismatched dimensions, so let's walk through the problem and try and get them straight.
Your network has a number of inputs, the features, let's call their number N_in
(numberOfFeatures
in your code). And it has a number of outputs which correspond to different classes let's call their number N_out
. Inputs and outputs are connected by the weights w
.
Now here is the problem. Connections are all-to-all, so we need a weight for each of the N_out x N_in
pairs of outputs and inputs. Therefore in your code the shape of w
must be changed to (N_out, N_in)
. You probably also want an offset b
for each output, so b should be a vector of size (N_out,)
or rather (N_out, 1)
so it plays well with the 2d terms.
I've fixed that in the modified code below and I tried to make it very explicit. I've also thrown a mock data creator into the bargain.
Re the one-hot encoded categorical output, I'm not an expert on neural networks but I think, most people understand it so that classes are mutually exclusive, so each sample in your mock output should have one one and the rest zeros.
Side note:
At one point a competing answer advised you to get rid of the 1-...
terms in the cost function. While that looks like an interesting idea to me my gut feeling (Edit Now confirmed using gradient-free minimizer; use activation="hybrid" in code below. Solver will simply maximize all outputs which are active in at least one training example.) is it won't work just like that because the cost will then fail to penalise false positives (see below for detailed explanation). To make it work you'd have to add some kind of regularization. One method that appears to work is using the softmax
instead of the sigmoid
. The softmax
is to one-hot what the sigmoid
is to binary. It makes sure the output is "fuzzy one-hot".
Therefore my recommendation is:
- If you want to stick with
sigmoid
and not explicitly enforce one-hot predictions. Keep the1-...
term. - If you want to use the shorter cost function. Enforce one-hot predictions. For example by using
softmax
instead ofsigmoid
.
I've added an activation="sigmoid"|"softmax"|"hybrid"
parameter to the code that switches between models. I've also made the scipy general purpose minimizer available, which may be useful when the gradient of the cost is not at hand.
Recap on how the cost function works:
The cost is a sum over all classes and all training samples of the term
-y log (y') - (1-y) log (1-y')
where y is the expected response, i.e. the one given by the "y" training sample for the input (the "x" training sample). y' is the prediction, the response the network with its current weights and biases generates. Now, because the expected response is either 0 or 1 the cost for a single category and a single training sample can be written
-log (y') if y = 1 -log(1-y') if y = 0
because in the first case (1-y) is zero, so the second term vanishes and in the secondo case y is zero, so the first term vanishes. One can now convince oneself that the cost is high if
- the expected response y is 1 and the network prediction y' is close to zero
- the expected response y is 0 and the network prediction y' is close to one
In other words the cost does its job in punishing wrong predictions. Now, if we drop the second term (1-y) log (1-y')
half of this mechanism is gone. If the expected response is 1, a low prediction will still incur a cost, but if the expected response is 0, the cost will be zero, regardless of the prediction, in particular, a high prediction (or false positive) will go unpunished.
Now, because the total cost is a sum over all training samples, there are three possibilities.
all training samples prescribe that the class be zero: then the cost will be completely independent of the predictions for this class and no learning can take place
some training samples put the class at zero, some at one: then because "false negatives" or "misses" are still punished but false positives aren't the net will find the easiest way to minimize the cost which is to indiscriminately increase the prediction of the class for all samples
all training samples prescribe that the class be one: essentially the same as in the second scenario will happen, only here it's no problem, because that is the correct behavior
And finally, why does it work if we use softmax
instead of sigmoid
? False positives will still be invisible. Now it is easy to see that the sum over all classes of the softmax is one. So I can only increase the prediction for one class if at least one other class is reduced to compensate. In particular, there can be no false positives without a false negative, and the false negative the cost will detect.
import numpy as np from scipy import optimize as opt from collections import namedtuple # First, a few structures to keep ourselves organized Problem_Size = namedtuple('Problem_Size', 'Out In Samples') Data = namedtuple('Data', 'Out In') Network = namedtuple('Network', 'w b activation cost gradient') def get_dims(Out, In, transpose=False): """extract dimensions and ensure everything is 2d return Data, Dims""" # gracefully acccept lists etc. Out, In = np.asanyarray(Out), np.asanyarray(In) if transpose: Out, In = Out.T, In.T # if it's a single sample make sure it's n x 1 Out = Out[:, None] if len(Out.shape) == 1 else Out In = In[:, None] if len(In.shape) == 1 else In Dims = Problem_Size(Out.shape[0], *In.shape) if Dims.Samples != Out.shape[1]: raise ValueError("number of samples must be the same for Out and In") return Data(Out, In), Dims def sigmoid(z): s = 1 / (1 + np.exp(-z)) return s def sig_cost(Net, data): A = process(data.In, Net) logA = np.log(A) return -(data.Out * logA + (1-data.Out) * (1-logA)).sum(axis=0).mean() def sig_grad (Net, Dims, data): A = process(data.In, Net) return dict(dw = (A - data.Out) @ data.In.T / Dims.Samples, db = (A - data.Out).mean(axis=1, keepdims=True)) def softmax(z): z = z - z.max(axis=0, keepdims=True) z = np.exp(z) return z / z.sum(axis=0, keepdims=True) def sof_cost(Net, data): A = process(data.In, Net) logA = np.log(A) return -(data.Out * logA).sum(axis=0).mean() sof_grad = sig_grad def get_net(Dims, activation='softmax'): activation, cost, gradient = { 'sigmoid': (sigmoid, sig_cost, sig_grad), 'softmax': (softmax, sof_cost, sof_grad), 'hybrid': (sigmoid, sof_cost, None)}[activation] return Network(w=np.zeros((Dims.Out, Dims.In)), b=np.zeros((Dims.Out, 1)), activation=activation, cost=cost, gradient=gradient) def process(In, Net): return Net.activation(Net.w @ In + Net.b) def propagate(data, Dims, Net): return Net.gradient(Net, Dims, data), Net.cost(Net, data) def optimize_no_grad(Net, Dims, data): def f(x): Net.w[...] = x[:Net.w.size].reshape(Net.w.shape) Net.b[...] = x[Net.w.size:].reshape(Net.b.shape) return Net.cost(Net, data) x = np.r_[Net.w.ravel(), Net.b.ravel()] res = opt.minimize(f, x, options=dict(maxiter=10000)).x Net.w[...] = res[:Net.w.size].reshape(Net.w.shape) Net.b[...] = res[Net.w.size:].reshape(Net.b.shape) def optimize(Net, Dims, data, num_iterations, learning_rate, print_cost = True): w, b = Net.w, Net.b costs = [] for i in range(num_iterations): grads, cost = propagate(data, Dims, Net) dw = grads["dw"] db = grads["db"] w -= learning_rate * dw b -= learning_rate * db if i % 100 == 0: costs.append(cost) if print_cost and i % 10000 == 0: print(cost) return grads, costs def model(X_train, Y_train, num_iterations, learning_rate = 0.5, print_cost = False, activation='sigmoid'): data, Dims = get_dims(Y_train, X_train, transpose=True) Net = get_net(Dims, activation) if Net.gradient is None: optimize_no_grad(Net, Dims, data) else: grads, costs = optimize(Net, Dims, data, num_iterations, learning_rate, print_cost = True) Y_prediction_train = process(data.In, Net) print(Y_prediction_train) print(data.Out) print(Y_prediction_train.sum(axis=0)) print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - data.Out)) * 100)) def create_data(Dims): Out = np.zeros((Dims.Out, Dims.Samples), dtype=int) Out[np.random.randint(0, Dims.Out, (Dims.Samples,)), np.arange(Dims.Samples)] = 1 In = np.random.randint(0, 2, (Dims.In, Dims.Samples)) return Data(Out, In) train_set_x = np.array([ [1,1,1,1,1],[0,1,1,1,1],[0,0,1,1,0],[0,0,1,0,1] ]) train_set_y = np.array([ [1,0,0],[1,0,0],[0,0,1],[0,0,1] ]) model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.001, print_cost = True, activation='sigmoid') model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.001, print_cost = True, activation='softmax') model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.001, print_cost = True, activation='hybrid') Dims = Problem_Size(8, 100, 50) data = create_data(Dims) model(data.In.T, data.Out.T, num_iterations = 40000, learning_rate = 0.001, print_cost = True, activation='softmax') model(data.In.T, data.Out.T, num_iterations = 40000, learning_rate = 0.001, print_cost = True, activation='sigmoid')
Answers 2
Both the idea of how to fix the bug and how you can extend the implementation to classify between more classes can be solved with some dimensionality analysis.
I am assuming that you by classifying multiple examples mean multiple classes and not multiple samples, as we need multiple samples to train even for 2 classes.
Where N
= number of samples, D
= number of features, K
= number of categories(with K=2
being a special case where one can reduce this down to one dimension,ie K=1
with y=0
signifying one class and y=1
the other). The data should have the following dimensions:
X: N * D #input y: N * K #output W: D * K #weights, also dW has same dimensions b: 1 * K #bias, also db has same dimensions #A should have same dimensions as y
The order of the dimensions can be switched around, as long as the dot products are done correctly.
First dealing with your bug: You are initializing
W
asN * K
instead ofD * K
ie. in the binary case:w = np.zeros((numberOfTrainingExamples , 1)) #instead of w = np.zeros((numberOfFeatures , 1))
This means that the only time you are initializing
W
to correct dimensions is wheny
andX
(coincidentally) have same dimensions.This will mess with your dot products as well:
np.dot(X, w) # or np.dot(w.T,X.T) if you define y as [K * N] dimensions #instead of np.dot(w.T , X)
and
np.dot( X.T, ( A - Y ) ) #np.dot( X.T, ( A - Y ).T ) if y:[K * N] #instead of np.dot( X, ( A - Y ).T )
Also make sure that the cost function returns one number (ie. not an array).
Secondly going on to
K>2
you need to make some changes.b
is no longer a single number, but a vector (1D-array).y
andW
go from being 1D-array to 2D array. To avoid confusion and hard-to-find bugs it could be good to setK
,N
andD
to different values
Answers 3
The shape of A is (1,4), while the shape of train_set_y is (3,3).
You can't element-wise multiply two matrices that don't share a common shape on one of the axis.
Try to make train_set_y to be of a shape that would match the shape of A. For example:
train_set_y = np.array([ [1,1,1,0], [1,1,0,0], [1,1,1,1], [0,0,0,0] ])
0 comments:
Post a Comment