Numbers to memorize

There are several common numbers for Large Language Models to know. Note that no one will ever explicitly quiz you on this, and as a result, you shouldn't actually sit down to memorize all this information.

Instead, build familiarity with the rough range of potential values — as though you've spent months seeing these numbers repeatedly, over and over again. This can be especially helpful when you need to suggest default values, as in . Very specifically, whether in an interview or even in your own projects, you'll need a sense of what batch sizes, prompt lengths etc. are reasonable to use.

Here's a brief anecdote. Let me tell you from personal experience why this guide is extremely important for your own sanity — and for sounding reasonable in an interview.

One of the first times I ever downloaded a deep learning model to run locally on my laptop, I initialized a FP32 tensor to hold a 4K image. This is 3840x2160, about 8 million pixels that take up 33 MB — which isn't so bad. However, I fed it into a segmentation model, which scales compute and memory requirements quite poorly with input resolution. Little did I know.

For context, at the time, segmentation models were usually trained on 256x256 crops of images. At test time, we would run inference on approximately 1024x512¹ images, or about half a million pixels — even on server-grade, world-class GPUs.[^2] So in short, I was using my Macbook to segment images 16x bigger than images that researchers were using server-grade GPUs to segment! This is okay for learning purposes, but don't make this mistake in an interview.

Systems

There are several defaults to keep in mind for systems, which applies to really all of computer science. These concepts just become exceptionally important given how large Large Language Models are.

Tip #1. Most everything is a power of two. This goes without saying, but if you're guessing wildly, guess a power of two, or at least a multiple of a power of two. If you can do this, combined with a rough intuition for the range of feasible values, you'll find yourself pretty spot on.

For example, how much GPU RAM does your own laptop have? Most likely, it's 8, 16, or 32 GB of RAM. Have a gaming console or desktop? Most likely 16 or 32 GB of VRAM.
If you took this knowledge and guessed the HBM2 memory available for a V100, knowing that it was a server-grade top-of-the-line GPU back in 2017, you could then potentially guess the right answer — 32 GB.

Tip #2. Know how to translate prefixes from bytes to numbers and back. Most of the numbers on this guide range from the millions to the billions. This tip may seem silly, but many of your napkin math will rely on this. Million to mega. Billion to giga.

For example, say you see a 30-billion parameter model. You should immediately translate this into 60 GB of memory consumed; now you know this won't fit on a single V100, and you may need to commandeer an A100 instead.
Say you see a model dimension of 16,384, with vocabulary size 128,000. With a calculator on hand, you should be able to translate this into 4 GB of just embedding weights alone. Now you know that an M3 air with 2,109 MB/s SSD bandwidth would take about 2 seconds to load just the embedding weights from disk.

We'll dive into those examples in more detail later, but for now, your takeaway should be at a decently high level: Number knowledge is critical, guess in powers of two, and know your prefixes.

There are several specific numbers to in systems. I'll provide two buckets of numbers: Numbers for end consumers running models locally, and state-of-the-art for anyone serving models in the cloud.

GPU RAM: For server-grade GPUs, A100-SXMs and H100s offer 80 GB of GPU RAM, and H200s offer 141 GB of GPU RAM. For end consumers, VRAM usually ranges from 8 GB to 24 GB, with M3 and M4 offering 16 GB and high-end GPUs like the 3090 or 4090 offering 24 GB.

CPU RAM: For cloud, CPU RAM is effectively "endless" and rarely the bottleneck — ranging from 128GB to 256GB. For end consumers, CPU RAM usually ranges from 16 to 32 GB.

Cache size: For server-grade GPUs, L1 cache and shared memory is around 64 to 128 KB. L2 cache size is usually in the tens of megabytes. H100s has 40MB of L2 and H200s have 80MB. These are important limits to keep in mind, if you're trying to determine how fast a model can run and how much it'll cost. In the simplest case, if a matrix multiply can fully fit in L2, it'll run much faster than if it doesn't fit.

Model Architecture

There are a number of values in the architecture that impact your day-to-day as a practitioner; let's start with the most common ones:

Parameter count: Based off of open-source models alone, most Large Language Models deployed to run on-device range from 1 billion to 7 billion parameters. Most deployed to the cloud range from 7 billion to 70 billion parameters.

There are exceptions. A few models go even larger, like the older BLOOM 175B or PaLM 540B; or the newer Llama 405B. A few go smaller, like the OPT family, which goes as small as 125 million parameters. The general guidelines above cover most models.
Generally speaking, model weights are usually quantized to FP16 or BF16 in RAM, which uses 2 bytes per parameter. As a result, you can get a rough sense of memory usage just by multiplying the parameter count by 2. For example, a 7B model would take 14 GB of RAM to run.

Model dimension: This is the dimensionality of each token's embedding. Following the convention set in , your tensor would look like (batch_size, seq_len, d_model) right after the embedding layer. On-device models range from 1024 to 4096. Cloud models range from 4096 to 16384.

In the feed-forward neural network, the hidden dimension is often 2-5x the model dimension. In the original transformer architecture, this was a multiple of 4. In more modern architectures such as Llama and Deepseek, this hovers around 2-3².
The number of heads, multiplied by the dimensionality per head, tends to equal the model dimension. As a result, $W_K, W_V$ tend to be square, with size d_model x d_model.

Vocabulary size: Most modern models accommodate multiple languages, using a vocabulary size of somewhere from 128,000 to 256,000. Previous models supporting just English used a vocabulary size as small as 50,000.

Context size: We previously called this "sequence length" in , as the two terms are synonymous. Generally, the maximum context lengths now range from about 128,000 to 256,000 for both open-source and proprietary models alike.

Combining your knowledge of the architecture here with the systems knowledge from above, you can get a rough sense of how much it'd cost to serve a certain number of requests, given a number of input and output tokens.

Training

To get a rough sense of data and how much it costs to train, fine-tune, and evaluate, see these numbers below for dataset sizes.

Words to tokens: On average, one English word converts to roughly 1.3 tokens with subword tokenization. That means 1,000 words typically correspond to about 1,300 tokens—a useful conversion when estimating context size and memory requirements.

Pre-training Datasets: Pre-training datasets are enormous—often spanning tens to hundreds of billions of tokens. They aggregate data from diverse sources such as Common Crawl, StackOverflow, The Pile, and Wikipedia. Combined, as in RedPajama, these sources can even reach trillions of tokens.

Code Datasets: For models targeting programming tasks, code datasets like CodeSearchNet provide billions of tokens drawn from public source code repositories, enabling robust learning of coding patterns and syntax.

Instruction-tuning Datasets: Datasets for instruction tuning are usually much smaller than full pre-training corpora. They typically contain on the order of 1–10 million tokens. For example, human-curated collections like Stanford Alpaca and community efforts such as Self-Instruct fall within this range.

Evaluation: Evaluation benchmarks such as MATH, GSM8K typically range from tens of thousands of tokens to hundreds of thousands of tokens, with the largest among them such as MMLU even reaching a few million tokens.

To date, there isn't a straightforward way to translate tokens into training cost, only because there are many variables that come to play — the efficiency of your training library, your training hyper-parameters themselves, etc.

Takeaways

In short, know your numbers for systems, model architecture, and training. It'll save you hassle down the road, and during interviews, you'll have reasonable defaults to suggest. Again, don't memorize numbers for individual datasets and models. You're just looking for reasonable ranges of values — information in aggregate.

←Prev How to read transformer code Next→ Types of Questions

The KITTI test set, used most often for self-driving papers, featured images at 1242 x 375 pixels, which was about half a million pixels. Other datasets like ADE20k featured images no bigger than this as well. Even modern datasets like SA-V still provide training and validation datasets in 1401x1037, or about 1.5 million pixels. So in short, even by today's standards, I was running on images well beyond what state-of-the-art models were trained to segment. ↩
DeepSeek-V3 uses a hidden dimension of 2048 (config) per expert and activates 8 experts per forward pass, making an effective mixture-of-experts hidden dimension of 16,384. Compared with a model dimension of 7168, this is 2.28x. Llama3 uses ratios of about 3.25 to 3.5, based on the model size (wikipedia). ↩