Mini PyTorch from Scratch — Module 5 (part 1) (2024)

Mini PyTorch from Scratch — Module 5 (part 1) (3)

So far we have created all building blocks (auto-differentiation, loss functions, layers, datasets, transforms). In order to efficiently process images we need layers which are designed for image processing specifically. Here we introduce and implement one of such layers: 2D convolution.

All the source code with samples is available on Github (under dark 5).

Before we dive into the topic I suggest checking out this blog that provides a comprehensive explanation of how convolutions are implemented from scratch. The blog also features delightful animations (also used here) that perfectly illustrate the concept.

This article provides a summary explanation on how 2D convolution works with the following characteristics:

  1. The explanation is provided to convey a general idea and differences between implementations. Details are provided as references.
  2. The shown implementations (CPU/GPU) have been created for readability rather than performance. They may not fully correspond to the provided source code, but the concept is the same.
  3. Image padding is omitted to shorten the shown code.

We will provide both CPU and GPU versions. They are implemented differently: CPU version in a straightforward way and GPU one with some tricks to improve the performance. First, we give details on how GPU support is achieved in general.

As NumPy arrays are our basis, it was logical to use a library with a similar API but for GPU — CuPy [4]. They both share the same API, but the CuPy can only be installed if you have NVidia CUDA compatible card.

We will create our tensor module which we will import and refer as dt (dark-tensor). This will enable us to have the same code for CPU and CUDA devices. The implementation is shown below:

import os
use_cpu = os.environ.get("USE_CPU") in ["1", "True"]

try:
if use_cpu: raise Exception("CPU forced")

from cupy import *
from .conv_gpu import corr2d
except:
from numpy import *
from .conv_cpu import corr2d

I understand that using torch.nn.Conv2D makes convolution seem like magic, but it’s beneficial to grasp the inner workings of convolution algorithms and their implementation in popular DL libraries.

During the convolution process, the kernel slides across the entire input. With each slide, an element-wise multiplication is performed followed by summation to obtain a single value.

Let us look at the following sample: we have an image of size (5,5) with 3 layers (channels): (5,5,3). On the other hand kernel has a size of (3, 3) with the same number of channels — (3,3,3). When a convolution is performed between an image using a kernel we get an (3,3,1) image as shown below:

Mini PyTorch from Scratch — Module 5 (part 1) (4)

The implementation shown below consist of only two for loops. The multichannel patch multiplication and summation are done in the last line of the shown code:

def _corr2d_im(X, W, stride, out):
_, kh, kw = W.shape
oh, ow = out.shape

for h in range(oh): #slide the filter vertically
h_start = h * stride
h_end = h_start + kh

for w in range(ow): #slide the filter horizontally
w_start = w * stride
w_end = w_start + kw

#elementwise multiplication + sum
out[h, w] = dt.sum(X[:, h_start:h_end, w_start:w_end] * W)

Unfortunately the implementation above (_corr2d_im) is too slow so we need to speed it up. We did that using Numba library which optimizes and compiles the code. In most cases we just need to decorate a function with @njit decorator. We kept the same implementation as above.

@njit()
def _corr2d_im(X, W, stride, out):
... #the same as the code in the last paragraph

We have 4 dimensional tensors and kernels, where the first dimension is the number of images in an input tensor, and number of output channels in a provided kernel. Therefore, we need two more loops to cover this case. Moreover, the batch loop is made to run in parallel so all CPU cores are utilized:

@njit(parallel=True)
def corr2d(X, W, stride):
tb, tc, th, tw = X.shape
kb, kc, kh, kw = W.shape
assert tc == kc

#define output size
oh = int((th - kh) / stride) + 1
ow = int((tw - kw) / stride) + 1
out = np.zeros((tb, kb, oh, ow))

for b in prange(tb): #for each image
for oc in range(kb): #for each output channel
_corr2d_im(X[b, ...], W[oc, ...], stride, out[b, oc, ...])

return out

Such implementation is sufficiently fast that can be used for training smaller models or for a forward evaluation in general.

While CPU implementation has a decent performance on a CPU, unfortunately we cannot use the same elegant implementation on a GPU.

Here we need to embrace the power of vectorization using CuPy and approach the challenge from a different direction. We will write 2D convolution as matrix multiplication which itself is highly optimized procedure.

  1. First, our 3D input image is transformed into a 2D format, such as that each image multichannel patch is unrolled into a single column. Columns are then stacked. The procedure is known as im2col in MATLAB and it is shown below:
Mini PyTorch from Scratch — Module 5 (part 1) (5)

If you have multiple images (N>1), this matrix will have N 2D planes.

The actual implementation of im2col is done in a vectorized way using CuPy following the idea explained in this blog. Here, we are going to give a naive implementation using for loops:

def im2col(X, bh, bw, stride):
tb, tc, th, tw = X.shape
horz_blocks = th - bh + 1
vert_blocks = tw - bw + 1

out = dt.zeros((bh * bw, horz_blocks * vert_blocks))
itr = 0

for v_b in range(vert_blocks):
for h_b in range(horz_blocks):
out[:, itr] = image[v_b: v_b + bh, h_b: h_b + bw].ravel()
itr += 1

return output_vectors[:, ::skip]

Notice that a bad property of this procedure is high memory usage because we are unrolling every possible sliding window position, so this algorithm is memory bound — especially for GPUs — but it is fast.

2. Next, we reshape the kernel. As you can see, each filter is flattened and then stacked together. Therefore, for N filters, we will flatten and stack N filters together.

kb, kc, kh, kw = W.shape
W_col = W.reshape(kc, -1)
Mini PyTorch from Scratch — Module 5 (part 1) (6)

3) Finally, the obtained 2D arrays are multiplied as matrices as shown below. This is a highly optimized operation.

Mini PyTorch from Scratch — Module 5 (part 1) (7)

4) The result is then reshaped into the valid form using reshape command.

The entire code is shown below:

def corr2d(X, W, stride):
tb, tc, th, tw = X.shape
kb, kc, kh, kw = W.shape
assert tc == kc

#define output size
oh = int((th - kh) / stride) + 1
ow = int((tw - kw) / stride) + 1

#unroll X and kernel
X_col = im2col(X, kh, kw, stride)
W_col = W.reshape(kc, -1)

#multiply them (convolve)
out = W_col @ X_col

#reshape the result
out = out.reshape(kc, oh, ow, tb)
out = out.transpose(3, 0, 1, 2)

return out

So far we have implemented the forward pass, both for CPU and GPU. In order to avoid implementation of the backward pass where we broadcast a gradient, we will use dilation function shown below:

Mini PyTorch from Scratch — Module 5 (part 1) (8)

And the corresponding implementation:

def dilate(x, stride):
tb, tc, th, tw = x.shape
xe = dt.zeros((tb, tc, (th - 1) * stride + 1, (tw - 1) * stride + 1))
xe[:, :, ::stride, ::stride] = x

return xe

This dilation function will broadcast the gradient and enable us to use the same corr2d function to do the backward pass. This may be inefficient, but it simplifies the implementation.

Operation

To do the backward pass for the 2D convolution we need to calculate gradient for the input x and for the kernel k. It is sufficient to say that both operations are convolutions — more detailed explanation can be found here.

class Conv2d(Operation):

def forward(self, x, k, **kwargs):
self.stride = kwargs["stride"]
return dt.corr2d(x, k, self.padding)

def backward(self, dldy, y, x, k):
_, _, k_h, k_w = k.shape
assert k_h == k_w
dldye = self.dilate(dldy, self.stride)

#pad gradient and convolve w.r.t k
dldx = self.conv2d(dldye, k.transpose((1, 0, 2, 3)), k_h - 1, 1)

#dilate x, convolve w.r.t dilated x
dldk = dt.corr2d(x.transpose((1, 0, 2, 3)), dldye.transpose((1, 0, 2, 3)), 1)
dldk = dt.swapaxes(dldk, 0, 1)

return dldx, dldk

def dilate(self, x, stride):
tb, tc, th, tw = x.shape
xe = dt.zeros((tb, tc, (th - 1) * stride + 1, (tw - 1) * stride + 1))
xe[:, :, ::stride, ::stride] = x

return xe

def conv2d(self, x, k, stride):
k = flip(k, axis=(2, 3))
result = corr2d(x, k, stride)

return result

def conv2d(s, k, stride = 1):
return Conv2d.apply(s, k, stride)

Module

The implementation of a corresponding module is just a call to the operation with a bias addition.

class Conv2d(Module):
...

def forward(self, input):
out = dark.conv2d(input, self.weights, self.stride)
out = dark.add(out, self.bias)
return out

We had implemented convolution for both CPU and GPU device. Before we make some image based samples we just need to create one more layer: pooling 2d — which is done in the next part.

Let us connect: LinkedIn.

[1] https://hackmd.io/@machine-learning/blog-post-cnnumpy-slow
[2] https://hackmd.io/@machine-learning/blog-post-cnnumpy-fast
[3] https://praisethemoon.org/demystifying-the-math-and-implementation-of-convolutions-part-ii/
[4] https://cupy.dev/

Mini PyTorch from Scratch — Module 5 (part 1) (2024)

References

Top Articles
Unraveling The Tragic Story Of Nikki Catsouras
Discovering The Allure Of Mr Chow East 57th
Bj 사슴이 분수
The Daily News Leader from Staunton, Virginia
Nfr Daysheet
Insidious 5 Showtimes Near Cinemark Tinseltown 290 And Xd
Shorthand: The Write Way to Speed Up Communication
Best Cheap Action Camera
Costco in Hawthorne (14501 Hindry Ave)
Tabler Oklahoma
Youtube Combe
Pollen Count Los Altos
Ave Bradley, Global SVP of design and creative director at Kimpton Hotels & Restaurants | Hospitality Interiors
Capitulo 2B Answers Page 40
Dusk
Facebook Marketplace Charlottesville
Charmeck Arrest Inquiry
Watch TV shows online - JustWatch
Beau John Maloney Houston Tx
Void Touched Curio
Fairy Liquid Near Me
Craigslist List Albuquerque: Your Ultimate Guide to Buying, Selling, and Finding Everything - First Republic Craigslist
Craiglist Kpr
Craighead County Sheriff's Department
Jayah And Kimora Phone Number
Lowe's Garden Fence Roll
Exterior insulation details for a laminated timber gothic arch cabin - GreenBuildingAdvisor
Uta Kinesiology Advising
Samantha Aufderheide
Barber Gym Quantico Hours
Ac-15 Gungeon
Www.dunkinbaskinrunsonyou.con
Rubber Ducks Akron Score
Rs3 Ushabti
Infinite Campus Asd20
Healthy Kaiserpermanente Org Sign On
Issue Monday, September 23, 2024
100 Million Naira In Dollars
Eero Optimize For Conferencing And Gaming
Manuel Pihakis Obituary
In Branch Chase Atm Near Me
Best Workers Compensation Lawyer Hill & Moin
Hindilinks4U Bollywood Action Movies
1v1.LOL Game [Unblocked] | Play Online
Home Auctions - Real Estate Auctions
Weather In Allentown-Bethlehem-Easton Metropolitan Area 10 Days
2013 Honda Odyssey Serpentine Belt Diagram
My Gsu Portal
Sky Dental Cartersville
라이키 유출
Southern Blotting: Principle, Steps, Applications | Microbe Online
Olay Holiday Gift Rebate.com
Latest Posts
Article information

Author: Velia Krajcik

Last Updated:

Views: 5638

Rating: 4.3 / 5 (54 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Velia Krajcik

Birthday: 1996-07-27

Address: 520 Balistreri Mount, South Armand, OR 60528

Phone: +466880739437

Job: Future Retail Associate

Hobby: Polo, Scouting, Worldbuilding, Cosplaying, Photography, Rowing, Nordic skating

Introduction: My name is Velia Krajcik, I am a handsome, clean, lucky, gleaming, magnificent, proud, glorious person who loves writing and wants to share my knowledge and understanding with you.