Mini PyTorch from Scratch — Module 5 (part 1) (2024)

Published in

ITNEXT

7 min read

Aug 2, 2024

Mini PyTorch from Scratch — Module 5 (part 1) (3)

So far we have created all building blocks (auto-differentiation, loss functions, layers, datasets, transforms). In order to efficiently process images we need layers which are designed for image processing specifically. Here we introduce and implement one of such layers: 2D convolution.

All the source code with samples is available on Github (under dark 5).

Before we dive into the topic I suggest checking out this blog that provides a comprehensive explanation of how convolutions are implemented from scratch. The blog also features delightful animations (also used here) that perfectly illustrate the concept.

This article provides a summary explanation on how 2D convolution works with the following characteristics:

The explanation is provided to convey a general idea and differences between implementations. Details are provided as references.
The shown implementations (CPU/GPU) have been created for readability rather than performance. They may not fully correspond to the provided source code, but the concept is the same.
Image padding is omitted to shorten the shown code.

We will provide both CPU and GPU versions. They are implemented differently: CPU version in a straightforward way and GPU one with some tricks to improve the performance. First, we give details on how GPU support is achieved in general.

As NumPy arrays are our basis, it was logical to use a library with a similar API but for GPU — CuPy [4]. They both share the same API, but the CuPy can only be installed if you have NVidia CUDA compatible card.

We will create our tensor module which we will import and refer as dt (dark-tensor). This will enable us to have the same code for CPU and CUDA devices. The implementation is shown below:

import os
use_cpu = os.environ.get("USE_CPU") in ["1", "True"]try:
 if use_cpu: raise Exception("CPU forced")
 from cupy import *
 from .conv_gpu import corr2d
except:
 from numpy import *
 from .conv_cpu import corr2d

I understand that using torch.nn.Conv2D makes convolution seem like magic, but it’s beneficial to grasp the inner workings of convolution algorithms and their implementation in popular DL libraries.

During the convolution process, the kernel slides across the entire input. With each slide, an element-wise multiplication is performed followed by summation to obtain a single value.

Let us look at the following sample: we have an image of size (5,5) with 3 layers (channels): (5,5,3). On the other hand kernel has a size of (3, 3) with the same number of channels — (3,3,3). When a convolution is performed between an image using a kernel we get an (3,3,1) image as shown below:

Mini PyTorch from Scratch — Module 5 (part 1) (4)

The implementation shown below consist of only two for loops. The multichannel patch multiplication and summation are done in the last line of the shown code:

def _corr2d_im(X, W, stride, out):
 _, kh, kw = W.shape
 oh, ow = out.shape for h in range(oh): #slide the filter vertically
 h_start = h * stride
 h_end = h_start + kh
 for w in range(ow): #slide the filter horizontally 
 w_start = w * stride
 w_end = w_start + kw
 #elementwise multiplication + sum
 out[h, w] = dt.sum(X[:, h_start:h_end, w_start:w_end] * W)

Unfortunately the implementation above (_corr2d_im) is too slow so we need to speed it up. We did that using Numba library which optimizes and compiles the code. In most cases we just need to decorate a function with @njit decorator. We kept the same implementation as above.

@njit()
def _corr2d_im(X, W, stride, out):
 ... #the same as the code in the last paragraph

We have 4 dimensional tensors and kernels, where the first dimension is the number of images in an input tensor, and number of output channels in a provided kernel. Therefore, we need two more loops to cover this case. Moreover, the batch loop is made to run in parallel so all CPU cores are utilized:

@njit(parallel=True)
def corr2d(X, W, stride):
 tb, tc, th, tw = X.shape
 kb, kc, kh, kw = W.shape
 assert tc == kc #define output size
 oh = int((th - kh) / stride) + 1
 ow = int((tw - kw) / stride) + 1
 out = np.zeros((tb, kb, oh, ow))
 for b in prange(tb): #for each image
 for oc in range(kb): #for each output channel
 _corr2d_im(X[b, ...], W[oc, ...], stride, out[b, oc, ...])
 return out

Such implementation is sufficiently fast that can be used for training smaller models or for a forward evaluation in general.

While CPU implementation has a decent performance on a CPU, unfortunately we cannot use the same elegant implementation on a GPU.

Here we need to embrace the power of vectorization using CuPy and approach the challenge from a different direction. We will write 2D convolution as matrix multiplication which itself is highly optimized procedure.

First, our 3D input image is transformed into a 2D format, such as that each image multichannel patch is unrolled into a single column. Columns are then stacked. The procedure is known as im2col in MATLAB and it is shown below:

Mini PyTorch from Scratch — Module 5 (part 1) (5)

If you have multiple images (N>1), this matrix will have N 2D planes.

The actual implementation of im2col is done in a vectorized way using CuPy following the idea explained in this blog. Here, we are going to give a naive implementation using for loops:

def im2col(X, bh, bw, stride):
 tb, tc, th, tw = X.shape
 horz_blocks = th - bh + 1
 vert_blocks = tw - bw + 1 out = dt.zeros((bh * bw, horz_blocks * vert_blocks))
 itr = 0
 for v_b in range(vert_blocks):
 for h_b in range(horz_blocks):
 out[:, itr] = image[v_b: v_b + bh, h_b: h_b + bw].ravel()
 itr += 1
 return output_vectors[:, ::skip]

Notice that a bad property of this procedure is high memory usage because we are unrolling every possible sliding window position, so this algorithm is memory bound — especially for GPUs — but it is fast.

2. Next, we reshape the kernel. As you can see, each filter is flattened and then stacked together. Therefore, for N filters, we will flatten and stack N filters together.

kb, kc, kh, kw = W.shape
W_col = W.reshape(kc, -1)

Mini PyTorch from Scratch — Module 5 (part 1) (6)

3) Finally, the obtained 2D arrays are multiplied as matrices as shown below. This is a highly optimized operation.

Mini PyTorch from Scratch — Module 5 (part 1) (7)

4) The result is then reshaped into the valid form using reshape command.

The entire code is shown below:

def corr2d(X, W, stride):
 tb, tc, th, tw = X.shape
 kb, kc, kh, kw = W.shape
 assert tc == kc #define output size
 oh = int((th - kh) / stride) + 1
 ow = int((tw - kw) / stride) + 1
 #unroll X and kernel
 X_col = im2col(X, kh, kw, stride)
 W_col = W.reshape(kc, -1)
 #multiply them (convolve)
 out = W_col @ X_col
 #reshape the result
 out = out.reshape(kc, oh, ow, tb)
 out = out.transpose(3, 0, 1, 2)
 return out

So far we have implemented the forward pass, both for CPU and GPU. In order to avoid implementation of the backward pass where we broadcast a gradient, we will use dilation function shown below:

Mini PyTorch from Scratch — Module 5 (part 1) (8)

And the corresponding implementation:

def dilate(x, stride):
 tb, tc, th, tw = x.shape
 xe = dt.zeros((tb, tc, (th - 1) * stride + 1, (tw - 1) * stride + 1))
 xe[:, :, ::stride, ::stride] = x return xe

This dilation function will broadcast the gradient and enable us to use the same corr2d function to do the backward pass. This may be inefficient, but it simplifies the implementation.

Operation

To do the backward pass for the 2D convolution we need to calculate gradient for the input x and for the kernel k. It is sufficient to say that both operations are convolutions — more detailed explanation can be found here.

class Conv2d(Operation): def forward(self, x, k, **kwargs):
 self.stride = kwargs["stride"]
 return dt.corr2d(x, k, self.padding)
 def backward(self, dldy, y, x, k):
 _, _, k_h, k_w = k.shape 
 assert k_h == k_w
 dldye = self.dilate(dldy, self.stride)
 #pad gradient and convolve w.r.t k
 dldx = self.conv2d(dldye, k.transpose((1, 0, 2, 3)), k_h - 1, 1)
 #dilate x, convolve w.r.t dilated x
 dldk = dt.corr2d(x.transpose((1, 0, 2, 3)), dldye.transpose((1, 0, 2, 3)), 1)
 dldk = dt.swapaxes(dldk, 0, 1)
 return dldx, dldk
 def dilate(self, x, stride):
 tb, tc, th, tw = x.shape
 xe = dt.zeros((tb, tc, (th - 1) * stride + 1, (tw - 1) * stride + 1))
 xe[:, :, ::stride, ::stride] = x
 return xe
 def conv2d(self, x, k, stride): 
 k = flip(k, axis=(2, 3))
 result = corr2d(x, k, stride)
 return result
def conv2d(s, k, stride = 1):
 return Conv2d.apply(s, k, stride)

Module

The implementation of a corresponding module is just a call to the operation with a bias addition.

class Conv2d(Module):
 ... def forward(self, input):
 out = dark.conv2d(input, self.weights, self.stride)
 out = dark.add(out, self.bias)
 return out

We had implemented convolution for both CPU and GPU device. Before we make some image based samples we just need to create one more layer: pooling 2d — which is done in the next part.

Let us connect: LinkedIn.

[1] https://hackmd.io/@machine-learning/blog-post-cnnumpy-slow
[2] https://hackmd.io/@machine-learning/blog-post-cnnumpy-fast
[3] https://praisethemoon.org/demystifying-the-math-and-implementation-of-convolutions-part-ii/
[4] https://cupy.dev/

Mini PyTorch from Scratch — Module 5 (part 1) (2024)

References