Deconstructing Deep Learning + δeviations
Drop me an email
Format :
Date | Title
TL; DR
Looking at backpropagation from scratch because somehow I have not done that yet.
So I realized we cheated and used Zygote the last time around and so I wanted to do backprop from scratch. Also, I found some more nice things which I did not know so oh well why not.
First let us load our data. For this we use a package for now. (I might do a separate one on loading different types of data and maybe fixing our dataloader at some point). We just load the train and test directly. We are going with MNIST which is basically a set of images of numbers from 1 to 10 in black and white.
using MLDatasets
train_x, train_y = MNIST.traindata()
test_x, test_y = MNIST.testdata();
Now we need to one hot encode the labels. This means that we make a dataframe of sorts with 10 rows and if the current element is present in it, we just put a 1 there and leave the rest of the rows blank. We do this for the entire dataset. Another thing to notice is the reshape. So for our test, we take just 1000 examples. We also reshape the entire array into (nn, num_of_data) where n is the size of the image. Here all our images are 28x28 so we take (2828 = 784, 1000). We then apply one hot to them.
images, labels = (reshape(train_x[:,:,1:1000], (28*28, 1000)), train_y[1:1000])
one_hot_labels = zeros(10,length(labels))
for (i,l) in enumerate(labels)
one_hot_labels[l+1, i] = 1.0
end
labels = one_hot_labels
Rinse and repeat for test set.
test_images = reshape(test_x, (28*28, size(test_x,3)))
test_labels = zeros((10, size(test_x,3)))
for (i,l) in enumerate(test_y)
test_labels[l+1, i] = 1.0
end
We now define ReLU and its derivative. $$\frac{\mathrm{d}\left( x \right)}{dx} = 1$$
We also define some constants like batch size (number of elements taken at once). Learning rate α and number of iterations/epochs. Pixels per image is just 28*28 and we have 10 labels. As for hidden size, that is a parameter which defines how deep our network will be.
relu(x) = x > 0 ? x : 0
relu2deriv(x) = x > 0 ? 1 : 0
batch_size = 100
α, iterations = (0.001, 100)
pixels_per_image, num_labels, hidden_size = (784, 10, 100)
We initialize our weights with a custom function here. (Could replace with He Inuit)
weights_0_1 = 0.2 .* rand(pixels_per_image,hidden_size) .- 0.1
weights_1_2 = 0.2 .* rand(hidden_size,num_labels) .- 0.1
Now for the main loop.
We iterate till the number of iterations and then for every batch in the data. The first bit is forward propagation. We first take the batch of images as an array and then run them through a Linear layer and apply relu.on it. After that we also add dropout. This is randomly choosing a certain number of weights and initializing them to 0. This regulates the weights and prevents the network from becoming over confident. We also calculate the total loss using Mean squared error and store that away as well.
for j = 1:iterations
Error, Correct_cnt = (0.0, 0)
for i = 1:batch_size:size(images,2)-batch_size
batch_start, batch_end = i, i+batch_size-1
layer_0 = images[:, batch_start:batch_end]
layer_1 = relu.(layer_0' * weights_0_1)
dropout_mask = bitrand(size(layer_1))
layer_1 .*= (dropout_mask .* 2)
layer_2 = layer_1 * weights_1_2
Error += sum((labels[:, batch_start:batch_end]' .- layer_2) .^ 2)
Time for backprop. If you notice, all we are doing is calculating the derivatives. After we do that we apply dropout and most importantly, update our weights. Note that this is GD and so there is no momentum term or anything like that. A simple vanilla network.
for k=1:batch_size
Correct_cnt += Int(argmax(layer_2[k, :]) == argmax(labels[:, batch_start+k-1]))[]
layer_2_delta = (labels[:, batch_start:batch_end]' .- layer_2) ./batch_size
layer_1_delta = (layer_2_delta * weights_1_2') .* relu2deriv.(layer_1)
layer_1_delta .*= dropout_mask
weights_1_2 += α .* layer_1' * layer_2_delta
weights_0_1 += α .* layer_0 * layer_1_delta
end
end
This is technically the whole loop. Every few epochs we can calculate the error on the test set, this makes it much easier for us to work with it so we can see how the functions work. Here is the entire code.
for j = 1:iterations
Error, Correct_cnt = (0.0, 0)
for i = 1:batch_size:size(images,2)-batch_size
batch_start, batch_end = i, i+batch_size-1
layer_0 = images[:, batch_start:batch_end]
layer_1 = relu.(layer_0' * weights_0_1)
dropout_mask = bitrand(size(layer_1))
layer_1 .*= (dropout_mask .* 2)
layer_2 = layer_1 * weights_1_2
Error += sum((labels[:, batch_start:batch_end]' .- layer_2) .^ 2)
for k=1:batch_size
Correct_cnt += Int(argmax(layer_2[k, :]) == argmax(labels[:, batch_start+k-1]))[]
layer_2_delta = (labels[:, batch_start:batch_end]' .- layer_2) ./batch_size
layer_1_delta = (layer_2_delta * weights_1_2') .* relu2deriv.(layer_1)
layer_1_delta .*= dropout_mask
weights_1_2 -= α .* layer_1' * layer_2_delta
weights_0_1 -= α .* layer_0 * layer_1_delta
end
end
if (j % 10 == 0)
test_Error, test_Correct_cnt = (0.0, 0)
for i = 1:size(test_images, 2)
layer_0 = test_images[:, i]
layer_1 = relu.(layer_0' * weights_0_1)
layer_2 = layer_1 * weights_1_2
test_Error += sum((test_labels[:, i]' .- layer_2) .^ 2)
test_Correct_cnt += Int(argmax(layer_2[1,:]) == argmax(test_labels[:, i]))
end
println("I: $(j) Train error: $(Error/size(images, 2)) Train accuracy: $(Correct_cnt/size(images, 2)) Test-Err:: $(test_Error/size(test_images, 2)) Test-Acc:: $(test_Correct_cnt/size(test_images, 2))")
end
end
I loved this repository and I found a lot of it from here. Link