Convolutions from scratch

💡

Borderline: Complete parts A and just the first question in B. You should know what a convolution is, be able to implement a minimal one-dimensional convolution, and correctly answer questions about the algorithm's complexity.

Hire: Complete parts A, B, and C. You have shown you can complete tensor manipulations that are common in an AI codebase. You should get the second part of B as a strong coder or senior engineer, but if you miss that, passing C is a decent substitute.

Strong hire: Complete parts A, B, C, and complete at least one problem in part D. I do not expect anyone to be able to do this in the time allotted for an interview. Maybe only 1-2 interviewees per year.

A. Build a 1D Convolution

Build a 1D convolution from scratch, without external libraries.

What can I use?

I recommend starting off with just Python lists of floats, so we can see the logic explicitly written out. You may not use an external library's convolution implementation. i.e., np.convolve or torch.nn.Conv1D. See Ask to Simplify.

What size is the kernel?

The kernel is 3x1. See Anticipate missing details.

What are the convolution's hyper-parameters?

For simplicity, there is no padding, no stride, no dilation, and no bias. See Anticipate missing details and Explain your suggestion.

Can I assume the kernel is square?

Yes, for simplicity, since most kernels in widely-used models are square.

Solution

# NOTE: start with a simple example that you can easily verify by hand.
# At a minimum, this means manually creating the input like this.
x = [1, 2, 3, 4, 5, 6, 7]

# NOTE: Just as a general tip, avoid names that conflict with built-in
# Python keywords, like `filter` for example. This can cause mayhem later
# on if you forget you redefined a built-in. For a bonus point, you can
# mention that you're naming this `kernel`, precisely for this reason.
kernel = [1, 1, 2]

# NOTE: It's likely no one has time for type annotations, but since I'm
# writing official solutions for you to read, I'll include them for
# completeness.
def conv1d(x: list, kernel: list) -> list:
    kernel_size = len(kernel)
    output_size = len(x) - (kernel_size - 1)

    # TIP: Always allocate the output explicitly, in advance. If you can,
    # avoid .append which may cause dynamic array resizing. In other words,
    # that may cause many runtime memory allocations, which slows down your
    # program. No one will fault you for using .append. Just a pro tip, and
    # extra brownie points if you can explain this in an interview.
    outputs = [0] * output_size

    # NOTE: For your own sanity, do not use `i` for your variable name.
    # Especially for a problem like this where there are nested for loops,
    # vague variable names can get confusing really quickly. This is for
    # your own sake -- not even to show off to the interviewer. 
    for start in range(output_size):
            output = 0
        for offset in range(kernel_size):
            output += x[start + offset] * kernel[offset]
        outputs[start] = output
    return output

# NOTE: Since we manually constructed very small inputs, we can also
# manually compute what the output should be. Add a simple 'test' to ensure
# that your program actually works!
assert conv1d(x, kernel) == [9, 13, 17, 21, 25]

Integrate a tensor library with your solution.

import numpy as np

x = np.array([1, 2, 3, 4, 5, 6, 7])
kernel = np.array([1, 1, 2])

def conv1d(x: np.array, kernel: np.array) -> np.array:
    kernel_size = len(kernel)
    output_size = len(x) - (kernel_size - 1)
    output = np.zeros(output_size)

    for start in range(output_size):
            output = 0
        for offset in range(kernel_size):
            output += x[start + offset] * kernel[offset]
        outputs[start] = output

    return output

assert conv1d(x, kernel) == np.array([9, 13, 17, 21, 25])

B. Algorithmic complexities

Let's assess the current performance of your algorithm.

💡

You generally build ethos with the interviewer for automatically anticipating these questions.

What is the time complexity and space complexity of your algorithm?

Say you have $n$ inputs and $k$ kernel size. For every one of the $O(n)$ inputs, we iterated over exactly $k$ values — of both the input and the kernel. As a result, our runtime is $O(nk)$. For space complexity: We didn't allocate any intermediate variables besides the output, so at a rough level, our algorithm takes $O(n)$ memory, which is for the output array.

Can we improve our time complexity? How would you do it?

No, we cannot improve our time complexity. This is inherent to the definition of a convolution. For every one of $n$ outputs, we need to access $k$ number of inputs and all $k$ values in the kernel. This means $O(nk)$ operations.

Can we improve our space complexity? How would you do it?

Yes. We could reuse our input to contain our output. After convolving the first $x_0, x_1, x_2$ inputs with the kernel, the first input $x_0$ is never used again. As a result, we can store the output $y_0$ where $x_0$ used to be. Then, like before, after convolving the next $x_1, x_2, x_3$ inputs with the kernel, the second input $x_1$ is never used again. Then, again like before, we can store the output $y_1$ where $x_1$ used to be.

💡

In most cases, the interview ends right here, because there's no time to actually implement the optimized version. At a bare minimum, you should be able to report the complexity.

Write the most algorithmically optimal version of your algorithm.

Assuming an input of size $n$ and kernel size $k$, this program runs in $O(nk)$ time but uses $O(1)$ memory, since we simply reuse the input for our output.

import numpy as np

x = np.array([1, 2, 3, 4, 5, 6, 7])
kernel = np.array([1, 1, 2]

def conv1d(x: np.array, kernel: np.array) -> np.array:
    kernel_size = len(kernel)
    output_size = len(x) - (kernel_size - 1)

    for start in range(output_size):
            output = 0
        for offset in range(kernel_size):
            output += x[start + offset] * kernel[offset]
        x[start] = output

    # NOTE: numpy typically returns a *view* of the original array, without
    # actually creating a copy. In this case, we know the memory layout is
    # guaranteed to be contiguous, since we're slicing a 1d array. As a
    # result, there's no need to make sure this array is contiguous.
    return x[:output_size]

assert conv1d(x, kernel) == np.array([9, 13, 17, 21, 25])

Why is this optimization a bad idea for a convolution, generally? Hint: A general convolution needs to support extra features like padding, dilation, stride etc.

Before, we assumed that after using the first input just once, we would never it use it again. As a result, we can store the first output in the input's first position.

However, that assumption breaks down if we implicitly pad the input while convolving. Let me explain what I mean. Let's look at the input, padded. Since $k = 3$ in our example, we pad with $k - 1 = 2$ zeros on both sides.

[0, 0, 1, 2, ..., 6, 7, 0, 0]

Now, run a convolution normally, and you'll see that we first convolve the kernel with [0, 0, 1], then [0, 1, 2], then [1, 2, 3]. In other words, we reuse the first input $x_0 = 1$ multiple times, so after our first kernel product, we can't just replace the first input. We need it two more times!

C. Performance optimization

Without changing the algorithmic complexity, what are some ways to make the convolution run faster?

💡

Generally speaking, you don't need to know performance optimizations at a deep level. However, there are two important exceptions: You must know how to vectorize programs, and you must know how to batch programs. Using JAX¹ obliterates the need for both of these somewhat, but there are different challenges in writing a JAX program, which we'll cover in a different exercise.

Vectorize the convolution.

There are two important parts to note:

It's most likely not worthwhile to vectorize the outer loop, because you'd have to explicitly materialize a large (x_size, kernel_size) array. In this example, all arrays are small, so these times are negligible, but at scale, this is too large.
You could batch the kernel products. For example, by computing outputs[i:i+4] all together. Now, you're just initializing (4, kernel_size) instead of all (x_size, kernel_size). If you suggest such an option, make sure you identify the questionable trade off — it's unclear whether allocating a larger intermediate array is worth the savings you get from batching an element-wise product.

See .

import numpy as np

x = np.array([1, 2, 3, 4, 5, 6, 7])
kernel = np.array([1, 1, 2]

def conv1d(x: np.array, kernel: np.array) -> np.array:
    kernel_size = len(kernel)
    output_size = len(x) - (kernel_size - 1)
    output = np.zeros(output_size)

    for i in range(output_size):
            outputs[i] = np.sum(x[i: i + kernel_size] * kernel)

    return output

assert conv1d(x, kernel) == np.array([9, 13, 17, 21, 25])

Supports batches in the convolution.

First, let's write a variant that minimally adds the batch dimension. We can add an extra for loop that iterates over all samples in the batch.

import numpy as np

x = np.array([  # 2x4
    [1, 2, 3, 4],
    [5, 6, 7, 8]
])
kernel = np.array([[1, 1, 2]]) # 1x3

def conv1d(x: np.array, kernel: np.array) -> np.array:
    kernel_size = len(kernel)
    batch_size = x.shape[0]
    output_size = x.shape[1] - (kernel_size - 1)
    output = np.zeros((batch_size, output_size))

    for sample in range(batch_size):
        for i in range(output_size):
                outputs[sample, i] = np.sum(
                    x[sample, i: i + kernel_size] * kernel)

    return output

assert conv1d(x, kernel) == np.array([
    [9, 13],
    [25, 29],
]

However, we should really be vectorizing over batch sizes. The following accomplishes the same thing more efficiently, since we're handling all samples in a batch together.

import numpy as np

x = np.array([  # 2x4
    [1, 2, 3, 4],
    [5, 6, 7, 8]
])
kernel = np.array([[1, 1, 2]]) # 1x3

def conv1d(x: np.array, kernel: np.array) -> np.array:
    kernel_size = len(kernel)
    batch_size = x.shape[0]
    output_size = x.shape[1] - kernel_size + 1
    output = np.zeros((batch_size, output_size))

    for i in range(output_size):
            outputs[:, i] = np.sum(x[:, i: i + kernel_size] * kernel, axis=-1)

    return output

assert conv1d(x, kernel) == np.array([
    [9, 13],
    [25, 29],
]

Unroll the for loop.

There are a number of optimizations like this one that can improve the algorithm's performance. This is outside the scope of this interview though, because these optimizations are not specific to convolutions or machine learning.

💡

Most interviews end here, if they haven't already ended at an earlier step. This is way more than enough material for most interviewees to get through in a 35-minute period. For practice with AI-related code, you should finish all parts of this exercise.

D. Complete convolution

Let's now complete the convolution implementation in its most general form. For simplicity, start from your solution back in Part A.

Implement padding.

import numpy as np

x = np.array([1, 2, 3, 4, 5, 6, 7])
kernel = np.array([1, 1, 2]

def conv1d(
    x: np.array,
    kernel: np.array,
    padding: bool=False,
) -> np.array:
    kernel_size = len(kernel)
    output_size = len(x) if padding else len(x) - (kernel_size - 1)
    start = -kernel_size + 1 if padding else 0
        output = np.zeros(output_size)

    for start in range(output_size, start=start):
            output = 0
        for offset in range(kernel_size):
                index = start + offset
                x_i = x[index] if 0 <= index < len(x) else 0
            output += x[index] * kernel[offset]
        outputs[start] = output

    return output

assert conv1d(x, kernel, True) == \
    np.array([2, 5, 9, 13, 17, 21, 25, 13, 7])

Implement dilation.

import numpy as np

x = np.array([1, 2, 3, 4, 5, 6, 7])
kernel = np.array([1, 1, 2]

def conv1d(x: np.array, kernel: np.array, dilation: int=1) -> np.array:
    kernel_size = len(kernel)
    output_size = len(x) - (kernel_size - 1) * dilation
    output = np.zeros(output_size)

    for start in range(output_size):
            output = 0
        for offset in range(0, kernel_size * dilation, dilation):
            output += x[start + offset] * kernel[offset]
        outputs[start] = output

    return output

assert conv1d(x, kernel, dilation=2) == np.array([14, 18, 22])

Implement stride.

import numpy as np

x = np.array([1, 2, 3, 4, 5, 6, 7])
kernel = np.array([1, 1, 2]

def conv1d(x: np.array, kernel: np.array, stride: int=1) -> np.array:
    kernel_size = len(kernel)
    output_size = len(x) - (kernel_size - 1) * dilation
    output = np.zeros(output_size)

    for start in range(0, output_size, stride):
            output = 0
        for offset in range(0, kernel_size):
            output += x[start + offset] * kernel[offset]
        outputs[start] = output

    return output

assert conv1d(x, kernel, stride=2) == np.array([9, 17, 25])

Implement a two-dimensional convolution. Do not implement stride, dilation, or padding. Keep the implementation focused on the core convolutional operation.

import numpy as np

x = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])
kernel = np.array([
    [1, 1],
    [1, 2]
])

def conv2d(x: np.array, kernel: np.array) -> np.array:
    kernel_size = len(kernel)
    output_size_y = x.shape[0] - (kernel_size - 1)
    output_size_x = x.shape[1] - (kernel_size - 1)
    output = np.zeros(output_size_y, output_size_x)

    for iy in range(output_size_y):
        for ix in range(output_size_x):
                output = 0
            for ky in range(kernel_size):
                for kx in range(kernel_size):
                    output += x[iy + dy, ix + dx] * kernel[dy, dx]
            outputs[iy, ix] = output

    return output

assert conv2d(x, kernel) == np.array([
    [17, 22],
    [32, 37]
])

That's it! You've now implemented convolutions from scratch, along with a bunch of commonly-used convolutional hyperparameters.

←Prev Exercises Next→ Attention from scratch

In order for JAX to automatically vectorize and batch-ify a program, you have many restrictions. One of those is that you cannot use conditional logic. In other words, no if or while statements. To practice this, try Preparing argmax for training. ↩