The VRAM Lie: Why Your 24GB GPU Cannot Run a 7B Model at Full Context

The math is simple: 7B parameters × 2 bytes = 14GB. So why does my 24GB card run out of memory at 8K context?

I have watched this exact scenario play out in Discord servers, GitHub issues, and Reddit threads at least a hundred times. Someone buys an RTX 4090, reads that a 7B model in FP16 is "only" 14GB, and assumes they have 10GB of headroom. Then they fire up vLLM with --max-model-len 32768, paste a long document, and the process dies with a CUDA out-of-memory error. The reaction is always the same: confusion, then anger, then the sinking realization that the GPU box lied to them.

It is not a lie, exactly. It is an omission. The parameter count tells you how much memory the weights need. It does not tell you about the KV cache, the activation buffers, the CUDA context overhead, or the memory fragmentation that happens when you actually try to serve requests. This article is the full accounting.

The Simple Math Everyone Knows

Let us start with the part that is actually simple. A 7B parameter model stored in 16-bit floating point (FP16 or BF16) needs:

7,000,000,000 parameters × 2 bytes = 14,000,000,000 bytes ≈ 13.0 GiB

If you quantize to INT8, you halve that. If you go to INT4, you halve it again. This is the number that gets printed on model cards, benchmark tables, and Reddit comments. It is also the number that misleads people into thinking they can run anything on anything.

The problem is that this 13 GiB is just the weights. It is the model's long-term memory, burned into the file you downloaded from Hugging Face. When you actually run inference, the GPU needs a lot more working memory. Think of it like this: the weights are the engine block, but you still need coolant, fuel lines, and an exhaust system. You cannot drive an engine block.

The Hidden Memory Consumers

Here is where the bill starts adding up. During inference, VRAM is consumed by four main categories beyond the weights themselves.

1. The KV Cache

This is the big one. When a transformer processes a sequence, it computes key and value vectors for every token. For efficiency, it stores these so it does not recompute them for every new token. That stored data is the KV cache, and it grows linearly with context length.

The formula for KV cache memory is:

KV_cache = 2 × num_layers × num_heads × head_dim × seq_len × batch_size × bytes_per_param

For a typical 7B model (32 layers, 32 heads, 128 head dim, FP16):

Per token: 2 × 32 × 32 × 128 × 2 bytes = 524,288 bytes ≈ 0.5 MB per token

At 4K context:  4,096 × 0.5 MB = 2,048 MB ≈ 2.0 GiB
At 8K context:  8,192 × 0.5 MB = 4,096 MB ≈ 4.0 GiB
At 32K context: 32,768 × 0.5 MB = 16,384 MB ≈ 16.0 GiB

Notice what just happened. At 32K context, the KV cache alone is larger than the model weights. Your 13 GiB model just became a 29 GiB model, and we have not even counted the other stuff yet.

Some newer architectures use grouped-query attention (GQA), which reduces the number of key-value heads and shrinks the KV cache. A model with GQA might use only 8 KV heads instead of 32, cutting the cache by 75%. But most 7B models on Hugging Face still use standard multi-head attention, so the full formula applies.

2. Activation Memory

Every layer produces intermediate tensors during the forward pass. These activations are not stored permanently like the KV cache, but they do occupy VRAM while the layer is computing. For a 7B model processing a batch, activation memory typically ranges from 1-3 GiB depending on batch size and sequence length.

The exact amount varies by framework. vLLM usesPagedAttention to reduce activation waste, but it still needs working buffers. Ollama and llama.cpp have different allocation strategies. The point is: there is no such thing as zero activation memory.

3. CUDA Overhead and Framework Buffers

CUDA itself eats a chunk of VRAM before you even load a model. The CUDA context, cuBLAS handles, and driver overhead typically consume 300-800 MB. Then your inference framework (vLLM, TGI, llama.cpp) allocates its own management structures, request queues, and scratch buffers.

vLLM specifically allocates a "workspace" for PagedAttention block tables. This is usually 5-10% of total VRAM. On a 24GB card, that is another 1-2 GB that disappears before you serve a single token.

4. Memory Fragmentation

This is the silent killer. GPUs allocate memory in contiguous chunks. When you have a 13 GiB weight block, a 4 GiB KV cache, and a bunch of smaller activation buffers, the allocator starts leaving gaps. Over time, especially with variable-length requests, you end up with plenty of free VRAM but no single chunk large enough to satisfy a new allocation. The result is an OOM error even though nvidia-smi says you have 2 GB free.

vLLM mitigates this with its block-table approach, but fragmentation still happens. Other frameworks are worse. I have seen llama.cpp report 1.5 GB of "free" memory while failing to allocate a 512 MB tensor.

Real Calculation: A 7B Model at Different Context Lengths

Let us put it all together. Here is the actual VRAM budget for a standard 7B model (Llama-2-7B, FP16) on a single GPU, using vLLM with a batch size of 1:

Component          | 4K context | 8K context | 32K context
-------------------|------------|------------|-------------
Model weights      | 13.0 GiB   | 13.0 GiB   | 13.0 GiB
KV cache           |  2.0 GiB   |  4.0 GiB   | 16.0 GiB
Activations        |  1.2 GiB   |  1.8 GiB   |  3.5 GiB
CUDA + framework   |  1.5 GiB   |  1.5 GiB   |  1.5 GiB
Fragmentation pad  |  0.5 GiB   |  0.8 GiB   |  2.0 GiB
-------------------|------------|------------|-------------
TOTAL              | 18.2 GiB   | 21.1 GiB   | 36.0 GiB

At 4K context, your 24GB RTX 4090 is fine. At 8K, you have about 3 GB of breathing room, which feels safe until you remember that batch size 1 is not how anyone actually uses these models. Bump the batch size to 4, and the KV cache quadruples. At 8K context with batch size 4, you are at roughly 33 GiB. Your 24GB card is dead.

At 32K context, even batch size 1 exceeds 24GB. You need a 48GB A6000 or a dual-GPU setup just to run a 7B model at full context. This is the reality that the "7B = 14GB" meme hides.

Why Quantization Does Not Always Save as Much as You Think

"Just quantize to 4-bit" is the standard reply to any VRAM complaint. And yes, quantization helps. But the savings are not as dramatic as the parameter math suggests, because the KV cache does not shrink.

Here is the same table for an INT4-quantized 7B model:

Component          | 4K context | 8K context | 32K context
-------------------|------------|------------|-------------
Model weights      |  3.5 GiB   |  3.5 GiB   |  3.5 GiB
KV cache (FP16)    |  2.0 GiB   |  4.0 GiB   | 16.0 GiB
Activations        |  1.2 GiB   |  1.8 GiB   |  3.5 GiB
CUDA + framework   |  1.5 GiB   |  1.5 GiB   |  1.5 GiB
Fragmentation pad  |  0.5 GiB   |  0.8 GiB   |  2.0 GiB
-------------------|------------|------------|-------------
TOTAL              |  8.7 GiB   | 11.6 GiB   | 26.5 GiB

At 4K and 8K, quantization is transformative. You can run on a 12GB card. But at 32K context, you still need more than 24GB because the KV cache is unchanged. The weights went from 13 GB to 3.5 GB, but the total only dropped from 36 GB to 26.5 GB. The KV cache dominates at long context, and quantization does not touch it.

There is research into KV cache quantization (INT8 KV cache, cache eviction, streaming attention), but most production frameworks do not implement it yet. As of early 2026, if you want 32K context on a consumer GPU, you need either GQA architecture, KV cache compression, or a smaller model.

A Practical Rule of Thumb

After deploying dozens of models across different hardware, here is the heuristic I use:

For FP16 inference, budget 2.5x the model weight size for 4K context, 3x for 8K, and 5x for 32K+. For INT4, use 1.5x, 2x, and 4x respectively.

So a 7B model in FP16:

4K context: 13 × 2.5 = 32.5 GB needed for comfortable operation
8K context: 13 × 3 = 39 GB
32K context: 13 × 5 = 65 GB

Wait, those numbers are higher than my table above. That is because the table assumed batch size 1 and a perfectly efficient allocator. In practice, you want headroom. You want to handle occasional spikes. You do not want your server dying because one user pasted a slightly longer prompt than usual.

These multipliers also account for the fact that you are rarely running at batch size 1 in production. Even a "personal" server usually has 2-4 concurrent requests. The multipliers bake in a small batch size.

Caveats

GQA models (like Mistral, some Qwen variants) have smaller KV caches. You can use 0.6-0.7x the multiplier.
Multi-GPU tensor parallelism splits weights but replicates the KV cache per GPU. Two 24GB cards do not give you 48GB of usable KV cache space. Use pipeline parallelism or sequence parallelism if you need to scale context across GPUs.
CPU offloading (llama.cpp, Ollama) lets you exceed GPU VRAM by paging layers to system RAM. It works, but latency spikes dramatically. Do not use it for real-time APIs.
FlashAttention reduces activation memory but not KV cache size. It is worth enabling, but it will not save you from the long-context cliff.

Tools to Estimate Before You Deploy

Do not guess. Use tools that actually compute this stuff.

1. Our VRAM Estimator — Built specifically for local LLM deployment. Input your model size, quantization, context length, and batch size. It breaks down weights, KV cache, activations, and overhead separately. It also flags when your chosen GPU is insufficient and suggests alternatives.

2. vLLM's built-in profiler — Start vLLM with --max-model-len set to your target, then monitor with nvidia-smi during a warmup request. The memory usage you see after the first forward pass is your baseline. Multiply by 1.2 for headroom.

3. The transformers memory calculator — Hugging Face has a model memory calculator that estimates training and inference memory. It is conservative (assumes full backprop buffers even for inference), but it will not undercount.

4. Manual KV cache calculation — If you want to verify the math yourself, inspect the model config for num_hidden_layers, num_key_value_heads (or num_attention_heads), and hidden_size. Divide hidden_size by num_attention_heads to get head_dim. Then plug into the formula above.

Here is a quick Python snippet:

def kv_cache_gb(layers, kv_heads, head_dim, seq_len, batch=1, dtype_bytes=2):
    bytes_per_token = 2 * layers * kv_heads * head_dim * dtype_bytes
    total = bytes_per_token * seq_len * batch
    return total / (1024**3)

# Llama-2-7B: 32 layers, 32 heads, 128 head dim
print(kv_cache_gb(32, 32, 128, 8192))   # ~4.0 GiB
print(kv_cache_gb(32, 32, 128, 32768))  # ~16.0 GiB

# Mistral-7B with GQA: 32 layers, 8 KV heads, 128 head dim
print(kv_cache_gb(32, 8, 128, 32768))   # ~4.0 GiB (75% smaller!)

The Bottom Line

The GPU manufacturers and the model publishers are not trying to deceive you. They report the numbers that are easy to measure and compare. But easy comparisons are not operational reality.

If you are planning a deployment, do not stop at "7B = 14GB." That is where the analysis starts, not where it ends. Work through the KV cache formula for your target context length. Add activation buffers. Add CUDA overhead. Add fragmentation padding. Then add 20% more because production is never as clean as a spreadsheet.

The 24GB RTX 4090 is an incredible card. It can run a 7B model at 4K context with room to spare. It can even handle 8K if you are careful with batch size. But 32K context? That is not a 7B model problem. That is a hardware problem, and no amount of forum optimism will change the math.

Know the numbers before you buy the hardware. Or at least before you promise your users a context window your GPU cannot deliver.