Deconstructing Deep Learning + δeviations
Drop me an email
Format :
Date | Title
TL; DR
Here we will look at the various ways of implementing convolutions and benchmark them.
Let us define a huge image and a smallish kernel first to let us compare them.
img = rand(Float32,1000,1000)
kernel = rand(Float32,15,15);
We already implemented Conv2D once. So let us just take that and run it for these images as a benchmark. I think 10kx10k is a pretty crazy huge operation but anyway let us see what happens.
ih, iw = 1, 1
for i in 1: result_h
for j in 1: result_w
for k in 1:kernel_h
for l in 1:kernel_w
result[i,j] += img[ih+k-1, iw+l-1]*kernel[k,l]
end
end
ih+=stride
end
iw+= stride
ih = 1
end
Time taken ->
0.536841 seconds (53.66 k allocations: 10.141 MiB)
Okay that's good but toooo slow xD
Let us remove those loops. First let us remove the inner loop with k and l.
We have our new code..
ih, iw = 1, 1
for i in 1: result_h
for j in 1: result_w
[result[i,j] += img[ih+k-1, iw+l-1]*kernel[k,l] for k in 1:kernel_h, l in 1:kernel_w]
ih+=stride
end
iw+= stride
ih = 1
end
Time taken ->
34.546620 seconds (1.09 G allocations: 18.201 GiB, 3.21% gc time)
Wow okay that went crazy pretty fast. That did not work xD
So I did a google search and turns out the way to increase speed is by using the Convolution algorithm. Basically this says
$$f*g= ℱ^{-1}\big{ℱ{f}\cdot ℱ{g}\big}$$
where $$ℱ$$ is the fourier transform operation. And we first take the FT of both the image and the kernel. Then we perform point wise multiplication. And then we finally do an inverse FT.
FFT stands for fast fourier transform.
The sad part is that to do this, both the image and the kernel need to be of the same size, which means that we need to apply padding.
So I will take a break from this post and make another one padding and then come back when I have achieved that.
But before that I want to write a bit more about the FT. Let us see what it does I am curious.
Here is an awesome video I found. 3Blue1Brown
Okay so I managed to solve the padding issue. (Check the next post) For now, I will take the example of constant padding.
using Images,ImageView, Plots,LinearAlgebra,Statistics, FFTW,TestImages
img = rand(Float32,1000,1000)
kernel = rand(Float32,15,15);
function pad_constant(img,kernel,constant=0)
kernel_h, kernel_w = size(kernel)
img_h, img_w = size(img)
padded_kernel= ones(img_h,img_w).*(1/(1+exp(-constant)));
pad_h, pad_w = size(padded_kernel)
center_x,center_y = pad_w ÷2, pad_h ÷2
tmp_x = center_x-(kernel_w÷2)
tmp_y = center_y-(kernel_h÷2)
padded_kernel[collect(tmp_x:tmp_x+kernel_w-1),collect(tmp_y:tmp_y+kernel_h-1)] = kernel;
return padded_kernel
end
ker_pad = pad_constant(img, kernel, .3);
@time ifft(fft(channelview(img)).*fft(ker_pad))
Now I am not sure if this is the perfect approach but I will trust the internet for now.
Speed?
0.063075 seconds (190 allocations: 76.304 MiB)
Wait. WHAT. That is around 8 times faster. What even.
Okay let us try this for a 10000x10000 array then. Just because we can. xD
Okay so our old method took
53.943691 seconds (59.00 k allocations: 763.824 MiB, 0.08% gc time)
And FFT takes
0.075766 seconds (190 allocations: 76.304 MiB, 5.42% gc time)
Um
GPUs were meant for performance. So how about we try to do our implementation on a GPU? I have an Nvidia RTX2070 which is pretty great. CUDA? Compute Unified Device Architecture is a programming paradigm by Nvidia. It allows massive parallelism which gives the boost we need for Deep Learning.
I have never actually used a CUDA array directly and I need to find out how to do it first.
Okay so let us use the FFT thing from CUDA. We can't use the old method because it has scalar indexing. (Aka non vectorized operations)
using CUDA.CUFFT,CUDA,Images
function pad_constant(img,kernel,constant=0)
kernel_h, kernel_w = size(kernel)
img_h, img_w = size(img)
padded_kernel= CUDA.ones(img_h,img_w).*(1/(1+exp(-constant)));
pad_h, pad_w = size(padded_kernel)
center_x,center_y = pad_w ÷2, pad_h ÷2
tmp_x = center_x-(kernel_w÷2)
tmp_y = center_y-(kernel_h÷2)
padded_kernel[collect(tmp_x:tmp_x+kernel_w-1),collect(tmp_y:tmp_y+kernel_h-1)] = kernel;
return CuArray(padded_kernel)
end
img = CuArray(channelview(rand(Float32,1000,1000)));
kernel = CuArray(rand(Float32,15,15));
ker_pad = pad_constant(img, kernel, .3);
@time ifft(fft(img)*fft(ker_pad))
So that took...
0.018873 seconds (851 allocations: 26.781 KiB)
Okay..........
How about for a 10k x 10k array
Oh no.. I do not think my GPU can take this ):
Okay 9k x 9k
23.501186 seconds (973 allocations: 31.203 KiB, 1.67% gc time)
Heh. I guess it took time to send it to the GPU. And bring it back :/ I am confusion.
Ah yes it was that. Once it did go to the GPU, it took 1 second. Which means that it could take lesser on the next run, except my dear ol GPU wont let me do it for such a huge array.
But I guess I should be satsified with 1k x 1k. I mean come on, do we even use such huge images for DL. (Unless you work at Google LOL)