Fixing vLLM OOM Errors: Reading the Logs Like a Detective

It was 11:47 PM on a Thursday. I had been trying to get vLLM to serve Qwen3-32B-Instruct for three hours, and the process kept dying with one word: Killed. No Python traceback. No CUDA error. No friendly message saying "hey, you ran out of memory." Just that single word, and then silence.

I was tired. I was annoyed. And I was about to learn that debugging OOM errors is less about reading error messages and more about reading the absence of them.

The Crime Scene

Here is what I saw in my terminal at 23:47:12:

$ python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-32B-Instruct \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.92

INFO 04-23 23:45:33 [config.py:576] Using fp16 data type.
INFO 04-23 23:45:34 [parallel_state.py:947] Initializing tensor model parallel with size 2
INFO 04-23 23:45:35 [weight_utils.py:243] Using model weights format ['*.safetensors']
INFO 04-23 23:45:36 [loader.py:402] Loading weights onto GPU. This may take a few minutes...
Killed

That was it. The entire output. After two minutes of hopeful waiting, watching the weight loading progress bar climb, the process vanished. My dual RTX 4090 setup — 48GB of VRAM total, 128GB of system RAM — had just murdered my inference server without explanation.

I stared at the screen. I checked nvidia-smi. Both GPUs showed zero memory usage. The model had not even finished loading. Something killed it mid-allocation, and Python had no idea what happened.

Wrong Guess #1: CUDA Version Mismatch

My first theory was a CUDA toolkit mismatch — the classic vLLM gotcha. PyTorch compiled for CUDA 12.1 running on 11.8 drivers causes silent crashes. It fit the pattern.

At 23:52, I ran diagnostics:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131

$ nvidia-smi | grep "CUDA Version"
| NVIDIA-SMI 550.67                 Driver Version: 550.67       CUDA Version: 12.4 |

$ python -c "import torch; print(torch.version.cuda)"
12.4

$ python -c "import torch; print(torch.cuda.is_available())"
True

Everything matched. CUDA 12.4 across the board. PyTorch saw the GPUs. I even ran a quick sanity check — allocated a 10GB tensor, did a matmul. It worked fine. The CUDA theory was dead.

It was midnight. I wanted a simple fix, and I didn't have one.

Wrong Guess #2: Corrupt Model Weights

At 00:15, I pivoted. Maybe the model download was corrupt. My internet had hiccuped twice during the Hugging Face pull. What if a safetensors shard had silent corruption that only triggered during GPU transfer?

I checked file hashes. They matched. I loaded the model on CPU with a minimal Transformers script — it worked fine, all 65GB of it. The weights were intact.

Two wrong guesses. It was 00:42. I had work in the morning, and my server was still dead.

The Breakthrough: Checking dmesg

At 00:47, I remembered something from a Linux administration course I took years ago. When a process gets killed by the kernel with no explanation, the explanation is usually in the kernel ring buffer. Not in the application logs. Not in stderr. In the kernel's own memory.

I typed:

$ sudo dmesg | tail -20
[Thu Apr 23 23:47:12 2026] python invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[Thu Apr 23 23:47:12 2026] CPU: 8 PID: 28471 Comm: python Not tainted 6.5.0-35-generic #35~22.04.1-Ubuntu
[Thu Apr 23 23:47:12 2026] Hardware name: ASUS System Product Name/PRIME Z790-P, BIOS 2202 01/17/2025
[Thu Apr 23 23:47:12 2026] Call trace:
[Thu Apr 23 23:47:12 2026]  dump_backtrace+0x4a/0x5f
[Thu Apr 23 23:47:12 2026]  show_stack+0x1f/0x30
[Thu Apr 23 23:47:12 2026]  dump_stack_lvl+0x4a/0x63
[Thu Apr 23 23:47:12 2026]  dump_stack+0x10/0x16
[Thu Apr 23 23:47:12 2026]  dump_header+0x4a/0x230
[Thu Apr 23 23:47:12 2026]  oom_kill_process+0x10b/0x140
[Thu Apr 23 23:47:12 2026]  out_of_memory+0x106/0x2e0
[Thu Apr 23 23:47:12 2026] Memory cgroup out of memory: Killed process 28471 (python) total-vm:142341124kB, anon-rss:128923312kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:256312kB oom_score_adj:0

There it was. The OOM killer. At exactly 23:47:12 — the same second my terminal printed "Killed" — the Linux kernel had decided my Python process was using too much memory and terminated it.

But here is the part that confused me: the log said anon-rss:128923312kB. That is 123GB of anonymous memory. My machine has 128GB of RAM. The process was eating system RAM, not VRAM. I had been watching nvidia-smi this whole time, assuming the problem was GPU memory. The GPUs were not even the bottleneck.

I felt simultaneously relieved and stupid. Relieved because I finally knew what was happening. Stupid because I had been looking at the wrong metric for three hours.

The Actual Log Line That Mattered

I went back to the vLLM startup output. I had been so focused on the "Killed" at the end that I missed the actual clue earlier in the logs. I reran the command with --verbose and watched more carefully:

INFO 04-23 23:45:35 [loader.py:402] Loading weights onto GPU. This may take a few minutes...
INFO 04-23 23:45:35 [loader.py:403] Total shards: 17
INFO 04-23 23:45:36 [loader.py:410] Loading shard 1/17 to CPU buffer...
INFO 04-23 23:45:38 [loader.py:410] Loading shard 2/17 to CPU buffer...
INFO 04-23 23:45:40 [loader.py:410] Loading shard 3/17 to CPU buffer...
...
INFO 04-23 23:46:55 [loader.py:410] Loading shard 17/17 to CPU buffer...
INFO 04-23 23:46:56 [loader.py:415] Transferring weights to GPU 0...
INFO 04-23 23:47:01 [loader.py:415] Transferring weights to GPU 1...
Killed

vLLM was loading all 17 model shards into CPU memory first — as a staging buffer — before transferring them to the GPUs. The model is 65GB of weights at FP16. With tensor parallelism across two GPUs, vLLM loads the full weights into system RAM, then splits them during transfer. But it was not just 65GB. The loader creates additional CPU-side buffers during the transfer process. My 128GB of RAM was getting eaten by the staging area, plus the OS, plus the desktop environment, plus the dozen browser tabs I had open.

The log line that mattered was not the "Killed" at the end. It was the sequence of "Loading shard X/17 to CPU buffer" that I had scrolled past without thinking. The model was too big to stage in RAM with my current workload.

The Fix

At 01:03, I tried a memory-mapped loader flag I had read about in a GitHub issue. It still failed. Same OOM. I was getting desperate.

Then I remembered something else: vLLM pre-allocates GPU memory aggressively based on gpu-memory-utilization. I had it at 0.92. What if the CPU-side staging buffers were proportional to that reservation? I tried dropping it to 0.88:

$ python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-32B-Instruct \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.88

It loaded. But I was not satisfied — maybe it was a fluke. I killed the process and ran it again at 01:22 to confirm:

$ python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-32B-Instruct \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.88

It loaded successfully. The server started. I sent a test request:

$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-32B-Instruct",
    "messages": [{"role": "user", "content": "Hello, are you working?"}],
    "max_tokens": 50
  }'

{"choices":[{"message":{"role":"assistant","content":"Yes, I'm here and operational. How can I help you today?"}}]}

It worked. The fix was simply lowering gpu-memory-utilization from 0.92 to 0.88.

Here is why: vLLM pre-allocates GPU memory based on gpu-memory-utilization. At 0.92 on two RTX 4090s, it reserved about 44GB of VRAM. But the weight loader also allocates CPU-side staging buffers proportional to that reservation. At 0.92, the staging buffers pushed my 128GB of system RAM over the edge. At 0.88, they peaked at 112GB and stabilized.

What I Now Check First

This three-hour debugging session taught me a workflow. Now, whenever I see a silent "Killed" with vLLM, I run through this checklist in order:

Step 1: Check dmesg immediately. Do not guess. Do not Google. Run sudo dmesg | grep -i "killed process" and see if the OOM killer was involved. If you see oom_kill_process, you have your answer.

Step 2: Look at the RSS. The dmesg output tells you exactly how much memory the process was using. In my case, anon-rss:128923312kB meant 123GB of system RAM. If the RSS is close to your physical limit, the fix is memory reduction, not driver updates.

Step 3: Watch system RAM during loading. Open htop in a second terminal while vLLM starts. If system RAM climbs and hits your physical limit before the model finishes loading, you have a CPU-side staging buffer problem. This is common with large models on machines where RAM is tight relative to VRAM.

Step 4: Reduce gpu-memory-utilization. I used to set this to 0.95 by default, thinking "use as much VRAM as possible." That is the wrong mindset. vLLM needs headroom for temporary allocations. I now start at 0.85 and only increase it after verifying stable operation.

Step 5: Check max-model-len. A 32768 context window sounds nice, but the KV cache reservation is enormous. For Qwen3-32B, the KV cache at 32K context and batch size 1 is roughly 16GB. At batch size 4, it is 64GB — more than one of my GPUs. I now set max-model-len based on actual use cases, not theoretical maximums.

The Real Lesson

The most frustrating bugs are the ones that give you no information. A stack trace is a gift. An error message is a clue. "Killed" is nothing — it is the absence of information, and it forces you to become a detective.

I spent three hours on this problem. Two wrong guesses. One breakthrough at 00:47 when I remembered dmesg existed. The actual fix took thirty seconds to implement. The remaining time was verification and understanding why it worked.

If you take one thing from this article, let it be this: when a process dies silently on Linux, the kernel knows why. The kernel always knows why. You just have to know where to ask. dmesg is where the truth lives. Everything else is speculation.

I finished at 01:45. The server was running. I had a working Qwen3-32B endpoint. And I had a new habit: every time I see "Killed," I check dmesg first. No exceptions.