Sparse dot product

💡

Hire: Finish Part A and, with guidance, can make significant progress on Part B. There are many small details in getting Part B fully correct, and the more details you can foresee, the better. You are expected to know what a dot product is, have the intuition for its representation, and be able to express memory usage algebraically.

Strong Hire: Completes Part A and B largely autonomously. Makes progress on C and able to make a recommendation for when to use this sparse representation. You are additionally expected to have number sense — how to translate this expression back into a takeaway of tangible value.

A. Implement sparse dot product

Compute the dot product for sparse vectors.

How would you represent the sparse vector? Write a dataclass for your custom format.

Store the indices of the non-zero values and the non-zero values themselves. There are other design choices that will further simplify your program, which you may come across as you begin thinking through the code:

The lists should be sorted.
Keep one contiguous list of indices and one contiguous list of values.

from dataclasses import dataclass

@dataclass
class SparseVector:
    indices: list[int]
    values: list[float]
    length: int

Now, write the dot product, assuming both vectors are represented using your custom format.

def dot(a: SparseVector, b: SparseVector) -> float:
        product = a_pos = b_pos = 0
    while a_pos < a.length and b_pos < b.length:
        a_idx, b_idx = a.indices[a_pos], b.indices[b_pos]
        if a_idx == b_idx:
            product += a.values[a_pos] * b.values[b_pos]
        elif a_idx < b_idx:
            a_pos += 1
        else:
            b_pos += 1
    return product

a = SparseVector(indices=[1, 4, 6], values=[1, 2, 3], length=3)
b = SparseVector(indices=[2, 3, 6], values=[4, 5, 6], length=3)
assert dot(a, b) == 18.

What is the runtime and space complexity of your algorithm?

You should propose a new variable for the number of non-zero values, e.g., k. Your algorithm runs in O(k) time. The input vectors each take O(k) space, but the dot product function itself takes O(1) space, since only constant space is allocated for the final result.

B. Expression for sparsity savings

Come up with an algebraic expression that tells me when it's worthwhile to use this sparse vector representation — for its space savings — over the naive way.

How many bits would it take to represent the vector naively?

First, compute the size of the naive representation. Assume our vector has length N and that each element in our vector takes B bits to represent. The naive representation takes up $\text{bits}_\text{naive} = NB$ bits.

How many bits would it take to represent the vector using your custom sparse representation?

Second, compute the size of the sparse representation. Assume our vector contains K non-zero values. The values list is simply KB. Our indices list will contain K values but the integers need to be able to represent values up to N — our indices would ideally use only $\lceil\text{log}_2(N)\rceil$ bits to do so. So in short, our sparse representation requires $\text{bits}_{sparse} = K(B + \lceil\text{log}_2(N)\rceil)$ bits.

There's a possibility you accidentally said the indices list is also KB bits. The interviewer should prompt you to reconsider the bit width. Failing that, I'd ask the candidate to assign a concrete value to B, then try the next bullet point — proposing a length N that your B bit-width integer can't represent.
You might propose a reasonable bit width for integers, such as a 2-byte uint16. However, if this is the case, the interviewer should ask about a length that clearly exceeds the representable maximum. In this case, what if my vector has 100,000 values? This exceeds the largest possible index that uint16 can hold, which is $2^{16} = 65536$.
Make sure you're using log bits. This is crucial: An L-bit number can represent integers up to $2^L$. An L-byte number can represent much more, $2^{8L}$. The unit matters.
If you've gotten this far and realized it was $\text{log}(N)$, make sure you're (a) using the right base of 2, not the default natural logarithm; and (b) using the ceiling divide. The logarithmic value may be non-integral.

💡

Most interviews would likely end at some point in the last step, possibly without arriving at the final answer. The majority of the time would be spent discussing and fixing intricacies in the expression, because those details matter for the remainder of the calculations in this question.

Come up with an expression that tells us if we should use the sparse representation for a vector.

We should pick the sparse representation whenever the number of sparse bits is less than the number of dense bits, or equivalently

$$K(B + \lceil\text{log}_2(N)\rceil) < NB$$

If the interviewer was mean, they could ask you: How would you simplify this expression?

As is, it's impossible to isolate N, as a result, the only option you have off the bat is to isolate K instead. B is usually a fixed value, based on how you pre-trained model.
To further simplify this expression, we'd ideally apply a linear bound to the log term. That's impossible to do tightly, in the general case.

Unfortunately, we can't simplify much, so here's an expression:

$$K < \frac{NB}{B + \lceil\text{log}_2(N)\rceil}$$

On its own, this fraction isn't very informative.

C. Applicable to Large Language Models?

Let's say that we'd like to store large amounts of Large Language Model activations on disk, and we introduce activation sparsity to save disk space. Would this sparse representation help us save space? Or would storing the dense activations be better?

What is a reasonable sparsity rate for neural network weights broadly? Note this gives you k.

Most state-of-the-art papers achieve 30-40% sparsity with minimal degradation and exhibit some kind of tradeoff curve for more aggressive sparsity rates. Any answer in the range 20-80% is probably acceptable here. This gives you the chance to show off any knowledge of recent papers you may have. Let's say that 50% sparsity can be achieved with negligible quality degradation, so K = N/2.

For those up-to-date with literature: Since we're working with unstructured sparsity within a single tensor, we're ignoring structured sparsity techniques.

What are typical model dimensions for a Large Language Model? Note this gives you N.

You should know this from Math to memorize. A cloud model may have model dimensions as large as 16384. This could also be as small as 1024. Let's use N=16384 for cloud and N=1024 for on-device.

What are typical datatypes used for inference? Training? This gives you values of B.

Training is usually done in mixed-precision, using BF16, FP16 and FP32. Inference is usually done in BF16 or FP16 when not quantized. With quantization, you could see bitrates as low as 4 bits per weight. There are even more aggressively quantized models, but we can ignore those for now. Let's use B=4 bits for inference, B=32 bits for training.

All together, given what you said above, does the sparse representation actually save space for Large Language Models?

Our next step is to actually plug values into our expresion.

One way to approach this question is to simply start plugging in values directly.
Extra bonus points if you realize you can now simplify your expression, by plugging in K=N/2, which will greatly simplify what we have to consider.

Our simplified expression is now:

$$\begin{align}K = \frac{N}{2} &< \frac{NB}{B + \lceil\text{log}_2(N)\rceil}\\1 &< \frac{2B}{B + \lceil\text{log}_2(N)\rceil}\\\lceil\text{log}_2(N)\rceil &< B\\N &< 2^B&\text{ignored the ceil rounding for simplicity}\end{align}$$

If you don't want to ignore the ceiling div naively, you could

Create a stricter bound by converting $\lceil\text{log}_2(N)\rceil < B$ into $\lceil\text{log}_2(N)\rceil \leq B - 1$.
Then, realize that $\text{log}_2(N) \leq \lceil\text{log}_2(N)\rceil$, so we can plug this in with equivalence.

This leaves you with the following expression.

$$\text{log}_2(N) \leq B-1 \implies N \leq 2^{B-1}$$

This is much simpler to reason about, and we can quickly arrive at our final conclusion: For training with B=32, this sparse representation is definitely a good idea. At our largest model dimension of N=16384, we trivially satisfy this inequality, as $N=2^{14} \leq 2^{32-1}$. Plugging into our expressions above, we would save 28% space.

For inference, this sparse representation is definitely a bad idea. Even our smallest model dimension of N=1024 would not benefit, as $N=1024\not\leq 2^{4-1}$. From this expression, we can now derive real-world value, by understanding when our sparse vector representation will actually save on-disk storage space.

←Prev Preparing argmax for training Next→ Streaming from Flash Attention