Home page

Deconstructing Deep Learning + δeviations

Drop me an email Format : Date | Title
  TL; DR

Total posts : 86

View My GitHub Profile


Index page

Backprop

Looking at backpropagation from scratch because somehow I have not done that yet.

So I realized we cheated and used Zygote the last time around and so I wanted to do backprop from scratch. Also, I found some more nice things which I did not know so oh well why not.

Data

First let us load our data. For this we use a package for now. (I might do a separate one on loading different types of data and maybe fixing our dataloader at some point). We just load the train and test directly. We are going with MNIST which is basically a set of images of numbers from 1 to 10 in black and white.

using MLDatasets
train_x, train_y = MNIST.traindata()
test_x,  test_y  = MNIST.testdata();

One hot

Now we need to one hot encode the labels. This means that we make a dataframe of sorts with 10 rows and if the current element is present in it, we just put a 1 there and leave the rest of the rows blank. We do this for the entire dataset. Another thing to notice is the reshape. So for our test, we take just 1000 examples. We also reshape the entire array into (nn, num_of_data) where n is the size of the image. Here all our images are 28x28 so we take (2828 = 784, 1000). We then apply one hot to them.

images, labels = (reshape(train_x[:,:,1:1000], (28*28, 1000)), train_y[1:1000])
one_hot_labels = zeros(10,length(labels))
for (i,l) in enumerate(labels)
    one_hot_labels[l+1, i] = 1.0
end
labels = one_hot_labels

Rinse and repeat for test set.

test_images = reshape(test_x, (28*28, size(test_x,3)))
test_labels = zeros((10, size(test_x,3)))

for (i,l) in enumerate(test_y)
    test_labels[l+1, i] = 1.0
end

Constants and activations

We now define ReLU and its derivative. $$\frac{\mathrm{d}\left( x \right)}{dx} = 1$$

We also define some constants like batch size (number of elements taken at once). Learning rate α and number of iterations/epochs. Pixels per image is just 28*28 and we have 10 labels. As for hidden size, that is a parameter which defines how deep our network will be.

relu(x) = x > 0 ? x : 0
relu2deriv(x) = x > 0 ? 1 : 0
batch_size = 100
α, iterations = (0.001, 100)
pixels_per_image, num_labels, hidden_size = (784, 10, 100)

We initialize our weights with a custom function here. (Could replace with He Inuit)

weights_0_1 = 0.2 .* rand(pixels_per_image,hidden_size) .- 0.1
weights_1_2 = 0.2 .* rand(hidden_size,num_labels) .- 0.1

Now for the main loop.

Forward prop + dropout

We iterate till the number of iterations and then for every batch in the data. The first bit is forward propagation. We first take the batch of images as an array and then run them through a Linear layer and apply relu.on it. After that we also add dropout. This is randomly choosing a certain number of weights and initializing them to 0. This regulates the weights and prevents the network from becoming over confident. We also calculate the total loss using Mean squared error and store that away as well.

for j = 1:iterations
    Error, Correct_cnt = (0.0, 0)
    for i = 1:batch_size:size(images,2)-batch_size
        batch_start, batch_end = i, i+batch_size-1
        layer_0 = images[:, batch_start:batch_end]
        layer_1 = relu.(layer_0' * weights_0_1)

        dropout_mask = bitrand(size(layer_1))
        layer_1 .*= (dropout_mask .* 2)
        layer_2 = layer_1 * weights_1_2

        Error += sum((labels[:, batch_start:batch_end]' .- layer_2) .^ 2)

Backprop

Time for backprop. If you notice, all we are doing is calculating the derivatives. After we do that we apply dropout and most importantly, update our weights. Note that this is GD and so there is no momentum term or anything like that. A simple vanilla network.

for k=1:batch_size
            Correct_cnt += Int(argmax(layer_2[k, :]) == argmax(labels[:, batch_start+k-1]))[]
            layer_2_delta = (labels[:, batch_start:batch_end]' .- layer_2) ./batch_size
            layer_1_delta = (layer_2_delta * weights_1_2') .* relu2deriv.(layer_1)

            layer_1_delta .*= dropout_mask

            weights_1_2 += α .* layer_1' * layer_2_delta
            weights_0_1 += α .* layer_0 * layer_1_delta
        end
    end

Entire code

This is technically the whole loop. Every few epochs we can calculate the error on the test set, this makes it much easier for us to work with it so we can see how the functions work. Here is the entire code.

for j = 1:iterations
    Error, Correct_cnt = (0.0, 0)
    for i = 1:batch_size:size(images,2)-batch_size
        batch_start, batch_end = i, i+batch_size-1
        layer_0 = images[:, batch_start:batch_end]
        layer_1 = relu.(layer_0' * weights_0_1)

        dropout_mask = bitrand(size(layer_1))
        layer_1 .*= (dropout_mask .* 2)
        layer_2 = layer_1 * weights_1_2

        Error += sum((labels[:, batch_start:batch_end]' .- layer_2) .^ 2)

        for k=1:batch_size
            Correct_cnt += Int(argmax(layer_2[k, :]) == argmax(labels[:, batch_start+k-1]))[]
            layer_2_delta = (labels[:, batch_start:batch_end]' .- layer_2) ./batch_size
            layer_1_delta = (layer_2_delta * weights_1_2') .* relu2deriv.(layer_1)

            layer_1_delta .*= dropout_mask

            weights_1_2 -= α .* layer_1' * layer_2_delta
            weights_0_1 -= α .* layer_0 * layer_1_delta
        end
    end

    if (j % 10 == 0)
        test_Error, test_Correct_cnt = (0.0, 0)
        for i = 1:size(test_images, 2)

            layer_0 = test_images[:, i]
            layer_1 = relu.(layer_0' * weights_0_1)
            layer_2 = layer_1 * weights_1_2

            test_Error += sum((test_labels[:, i]' .- layer_2) .^ 2)
            test_Correct_cnt += Int(argmax(layer_2[1,:]) == argmax(test_labels[:, i]))
        end
        println("I: $(j) Train error: $(Error/size(images, 2)) Train accuracy: $(Correct_cnt/size(images, 2)) Test-Err:: $(test_Error/size(test_images, 2)) Test-Acc:: $(test_Correct_cnt/size(test_images, 2))")
    end
end

I loved this repository and I found a lot of it from here. Link