Docker + vLLM: The Complete Deployment Checklist I Wish I Had on Day One

I have deployed vLLM in Docker enough times to have a scar for each mistake. The first time, I spent three hours debugging why the container could not see my GPU. The second time, I filled a 2TB disk with logs in 36 hours. The third time, I forgot to set shared memory and watched a batch inference job crash with a cryptic PyTorch multiprocessing error that sent me down a rabbit hole of reading Linux kernel documentation.

This checklist is what I now run through every single time I start a vLLM container. It is not theoretical. Every item on this list is something that either caught me, still gets me, or saved me after I added it to the routine. I keep it in a text file on my server and copy-paste from it before every deployment. You should too.

1. GPU Runtime Check (nvidia-docker)

Why it matters: Docker does not speak GPU by default. Without the NVIDIA Container Toolkit, your vLLM container starts, loads the model into system RAM, and then dies when it tries to allocate a CUDA context. The error message is usually something useless like "CUDA error: no CUDA-capable device is detected," which makes you think your driver is broken when it is not.

Caught me: Yes, on my very first Docker deployment. I had CUDA 12.4 installed on the host, nvidia-smi worked fine, and I assumed Docker would "just inherit" the GPU. It does not.

Verify the toolkit is installed:

nvidia-ctk --version

Test that Docker can see the GPU:

docker run --rm --gpus all nvidia/cuda:12.4-base-ubuntu22.04 nvidia-smi

If that prints your GPU info, you are good. If it says "could not select device driver," install the toolkit:

sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

2. Shared Memory Size (--shm-size)

Why it matters: PyTorch uses shared memory for inter-process communication during data loading and tensor operations. Docker defaults to 64MB of shared memory. vLLM with tensor parallelism or even moderate batch sizes will blow through that in seconds. The error you get is "RuntimeError: DataLoader worker (pid XXXX) is killed by signal: Bus error," which is Linux-speak for "I ran out of shared memory and crashed your process."

Caught me: Yes, during a batch summarization job at 2 AM. The container had been running fine for hours, then died on a larger batch. I spent 45 minutes reading PyTorch GitHub issues before realizing it was a Docker flag.

Set it to at least half your system RAM, or use the host's shared memory directly:

docker run --gpus all --shm-size=32g ...

Or mount the host's /dev/shm:

docker run --gpus all --mount type=bind,source=/dev/shm,target=/dev/shm ...

I now default to --shm-size=64g on my 128GB machine and have not had a bus error since.

3. Port Mapping (Why 8000 Is Not Always 8000)

Why it matters: vLLM's default API port is 8000. So is Grafana's. So is the default port for a dozen other services you might have installed six months ago and forgotten about. If something is already bound to 8000 on your host, Docker will not warn you with a helpful message. It will either fail to start the container or, worse, map the port to something random if you are using Docker Compose and did not pin it.

Caught me: I had an old Grafana instance running from a monitoring experiment. My vLLM container started, mapped port 8000 to some high-numbered ephemeral port, and my API client could not connect. I spent 20 minutes checking firewall rules.

Always check before you map:

sudo lsof -i :8000

Be explicit about the mapping:

docker run --gpus all -p 8000:8000 ...

If 8000 is taken, pick a different host port:

docker run --gpus all -p 8080:8000 ...

And document which port you chose, because future-you will forget.

4. Volume Mounts (Model Path Consistency)

Why it matters: vLLM downloads models from Hugging Face on first run. Without a volume mount, those downloads live inside the container's ephemeral filesystem. When the container restarts, the model is gone, and you wait another 20 minutes for it to download again. With a volume mount, the model persists on the host, and startup drops from 20 minutes to 20 seconds.

Caught me: Not once, but three times. I kept wondering why my "optimized" container took forever to start after a reboot. The model cache was inside the container, and I was re-downloading 15GB every time.

Mount your Hugging Face cache:

docker run --gpus all \
  -v /opt/models:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen3-7B-Instruct

Use absolute paths on the host. Relative paths in Docker volume mounts are relative to the Docker daemon's working directory, which is almost never what you expect.

I keep my models in /opt/models and symlink /root/.cache/huggingface inside the container to that mount. Consistent paths make debugging easier when you are SSH'd into the host at midnight.

5. Environment Variables (CUDA_VISIBLE_DEVICES)

Why it matters: You might have multiple GPUs and want to reserve one for another task. Or you might want to test on a specific GPU to verify it is not the one with the flaky cooler. CUDA_VISIBLE_DEVICES controls which GPUs the container sees, but it interacts weirdly with Docker's --gpus flag. If you pass --gpus all and then set CUDA_VISIBLE_DEVICES=1, the container sees all GPUs but CUDA only uses device 1. If you pass --gpus '"device=1"', the container only sees one GPU, and CUDA_VISIBLE_DEVICES inside the container will see it as device 0.

Caught me: I set CUDA_VISIBLE_DEVICES=1 inside the container while passing --gpus all, then tensor parallelism failed because it could not find the second GPU it expected. The error was "Expected 2 GPUs but found 1," which sent me checking my hardware instead of my flags.

If you want only GPU 1:

docker run --gpus '"device=1"' -e CUDA_VISIBLE_DEVICES=0 ...

If you want all GPUs but restrict vLLM to GPU 1:

docker run --gpus all -e CUDA_VISIBLE_DEVICES=1 ...

Keep a note of which approach you used. Mixing them is how you end up with a GitHub issue at 3 AM.

6. Memory Limits (Docker vs GPU Memory)

Why it matters: Docker's --memory flag limits system RAM. It does not limit GPU memory. vLLM has its own GPU memory limit via --gpu-memory-utilization, but that is inside the container and independent of Docker's memory cgroups. If you set Docker's memory limit too low, vLLM will OOM during model loading even though your GPU has 24GB free. If you set it too high, you might starve the host OS.

Caught me: I set --memory=16g thinking I was being conservative. vLLM loaded the model weights into GPU memory fine, but the initial CPU-side preprocessing needed 20GB of system RAM. The container got killed by the cgroup OOM killer before it ever touched the GPU.

Set Docker memory generously, or not at all for vLLM:

docker run --gpus all --memory=64g ...

Control GPU memory through vLLM's own flag:

--gpu-memory-utilization 0.85

This leaves 15% headroom on the GPU for CUDA overhead and unexpected allocations. I have found 0.85 to be the sweet spot. 0.95 works until it does not, and then you get a CUDA OOM at the worst possible moment.

7. Health Checks

Why it matters: vLLM can be in a state where the process is running but the API is not responding. This happens during model loading, after a CUDA error that did not crash the process, or when the model download is corrupted and the server hangs during weight initialization. Without a health check, your orchestrator thinks everything is fine.

Caught me: My container was "running" for six hours without serving a single request. The model download had been interrupted, and vLLM was stuck in a retry loop that never timed out. Docker showed the container as healthy because the process had not exited.

Add a health check in Docker Compose:

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 120s

Or in a Dockerfile:

HEALTHCHECK --interval=30s --timeout=10s --start-period=120s \
  CMD curl -f http://localhost:8000/health || exit 1

The start_period is crucial. Model loading can take 60-120 seconds, and you do not want the health check killing the container before it has finished starting.

8. Restart Policy

Why it matters: vLLM crashes. Not often, but it crashes. A CUDA OOM, a corrupted model file, a transient network issue during download — any of these can kill the process. Without a restart policy, your API is down until you manually notice and fix it. With a restart policy, Docker brings it back automatically.

Caught me: A power flicker rebooted my server at 3 AM. The container did not restart because I had not set a policy. I found out when my morning batch job failed.

Use unless-stopped for most cases:

docker run --gpus all --restart unless-stopped ...

In Docker Compose:

restart: unless-stopped

Avoid always unless you genuinely want the container starting on every boot regardless of whether you manually stopped it. I use unless-stopped because sometimes I want to take the container down for maintenance without fighting Docker's restart loop.

9. Log Rotation (The One That Filled My Disk)

Why it matters: Docker containers log stdout and stderr to JSON files on the host by default. These files grow forever. vLLM is verbose. A single day of serving can generate hundreds of megabytes of logs. I once filled a 2TB NVMe drive in 36 hours because I had four vLLM containers running with debug logging enabled and no rotation configured.

Caught me: This is the one that still gets me. I know about log rotation. I have been burned by it. And yet, every few months, I spin up a container without checking, and three weeks later I get a disk space alert.

Configure log rotation in Docker daemon settings (/etc/docker/daemon.json):

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m",
    "max-file": "5"
  }
}

Or per-container:

docker run --gpus all \
  --log-driver json-file \
  --log-opt max-size=100m \
  --log-opt max-file=5 ...

In Docker Compose:

logging:
  driver: "json-file"
  options:
    max-size: "100m"
    max-file: "5"

100MB times 5 files per container times 4 containers is 2GB of logs, which is manageable. Without this, it is unbounded, and unbounded growth always wins.

10. Network Mode (Host vs Bridge)

Why it matters: Docker's default bridge network adds a NAT layer between the container and the host. For most services, this is fine. For vLLM serving large models with high-throughput streaming, the NAT overhead is measurable, and port mapping can be finicky if you are running multiple containers. Host mode removes the NAT and binds the container directly to the host network stack.

Caught me: I was benchmarking vLLM throughput and getting numbers 8% lower than expected. After two hours of checking CUDA settings, I realized the bridge network was adding latency to each request. Switching to host mode fixed it.

Use host mode for performance-critical deployments:

docker run --gpus all --network host ...

Use bridge mode if you need port isolation or are running on a shared host where you do not want the container binding directly to interfaces:

docker run --gpus all -p 8000:8000 --network bridge ...

Host mode means you cannot map ports. The container's port 8000 is the host's port 8000. If you run two vLLM containers in host mode, they will fight over the port unless you configure vLLM to use different ports internally.

11. User Permissions

Why it matters: By default, Docker containers run as root. When you mount a host directory for model caching, the container writes files as root. Later, when you try to manage those files from your regular user account, you get permission denied. Or worse, you run sudo chmod -R 777 on your model cache and create a security hole.

Caught me: My CI pipeline could not clean up old model files because they were owned by root. I ended up adding the CI user to the docker group and using sudo, which is a hack on top of a hack.

Run the container as your user:

docker run --gpus all --user $(id -u):$(id -g) \
  -v /opt/models:/root/.cache/huggingface ...

Or create a dedicated user for vLLM and set the volume permissions beforehand:

sudo useradd -r -s /bin/false vllm
sudo mkdir -p /opt/models
sudo chown -R vllm:vllm /opt/models

If the container needs to write to the mount, the host directory must be writable by the container's UID. Map them to match, or you will spend an afternoon debugging why the model download "succeeds" but the files are not there.

12. Timezone Sync

Why it matters: vLLM logs timestamps. Prometheus metrics have timestamps. Your log aggregation system has timestamps. If the container is in UTC and your host is in EST, your logs are five hours off, and correlating events across systems becomes a nightmare. This is especially painful when debugging performance issues where you need to line up vLLM logs with system metrics.

Caught me: I was correlating vLLM request logs with Grafana dashboards and could not figure out why the latency spikes did not line up. The container was in UTC. My dashboards were in local time. Five hours is a lot of latency to explain.

Mount the host timezone:

docker run --gpus all \
  -v /etc/localtime:/etc/localtime:ro \
  -v /etc/timezone:/etc/timezone:ro ...

Or set it explicitly:

docker run --gpus all -e TZ=America/New_York ...

Verify inside the container:

docker exec  date

This takes ten seconds and saves hours of confusion.

13. Resource Monitoring Setup

Why it matters: vLLM exposes Prometheus metrics on /metrics, but only if you enabled them. Docker exposes container stats via docker stats, but only if you are looking. You need both: vLLM metrics for inference performance, Docker metrics for container health. Without monitoring, you are flying blind, and the first sign of trouble is usually a user complaint.

Caught me: I had no monitoring on my first production vLLM deployment. The model was running, but GPU cache usage was at 98% and requests were queueing. I only found out when a user reported timeouts. By then, the queue was 200 requests deep.

Enable vLLM metrics:

--enable-metrics

Scrape them with Prometheus:

scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['localhost:8000']

Watch Docker stats in real time:

docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"

Key vLLM metrics to alert on: vllm:gpu_cache_usage_perc above 0.9 means you are about to start evicting KV cache. vllm:num_requests_waiting above 10 means your throughput is below your arrival rate. Set up alerts for both.

14. Backup Strategy for Models

Why it matters: Model files are large and take a long time to download. If your disk fails, or you accidentally delete the cache, or Hugging Face has an outage, you do not want to wait 30 minutes for a re-download while your API is down. A local backup of your most-used models is cheap insurance.

Caught me: I ran rm -rf on the wrong directory during a cleanup. 340GB of models vanished. It took six hours to re-download everything on my connection.

Keep a secondary copy on a different disk:

rsync -av --progress /opt/models/ /mnt/backup/models/

Or use a simple cron job:

0 3 * * * rsync -av /opt/models/ /mnt/backup/models/ >> /var/log/model-backup.log 2>&1

You do not need to back up every model. Just the ones you rely on for production. I back up my three most-used models weekly. The rest can be re-downloaded if needed.

15. Update Strategy

Why it matters: vLLM releases updates frequently. New versions fix bugs, add models, and improve performance. They also occasionally break backward compatibility, change default behavior, or introduce new CUDA requirements. Updating blindly is how you turn a working deployment into a weekend debugging session.

Caught me: I pulled vllm/vllm-openai:latest and my startup scripts broke because the default API server path changed. I had not pinned the version, and "latest" had moved on without me.

Pin your image version:

vllm/vllm-openai:v0.11.2

Test updates in a separate container before touching production:

docker run --gpus all -p 8001:8000 \
  vllm/vllm-openai:v0.11.3 \
  --model Qwen/Qwen3-7B-Instruct

Read the release notes before updating. vLLM's GitHub releases are detailed and usually flag breaking changes. I keep a changelog file on my server with the current version and any migration steps I needed.

16. Security (Do Not Run as Root)

Why it matters: A vLLM container running as root with GPU access and network exposure is a high-value target. If there is a vulnerability in vLLM or one of its dependencies, an attacker gets root on your GPU server. That is bad. GPU servers are expensive, and attackers love them for cryptomining.

Caught me: Not directly, but I ran a security scan on my setup and found the container listening on 0.0.0.0:8000 as root with no authentication. Anyone who could reach that port could send prompts and potentially exploit the API.

Run as a non-root user:

docker run --gpus all --user 1000:1000 ...

Bind to localhost only if you do not need external access:

docker run --gpus all -p 127.0.0.1:8000:8000 ...

Put an authentication layer in front. vLLM's API server does not have built-in auth. Use nginx with basic auth, or a reverse proxy with API key validation, or run it on a private network. Exposing vLLM directly to the internet without auth is asking for trouble.

17. The "Did I Turn Off the Oven" Check Before Deploying

Why it matters: You have been through the checklist. You have set the flags. You are ready to deploy. But there is always one thing. The model ID you typed from memory. The port number you copied from an old script. The volume path that is slightly different on this machine. The environment variable that worked on your other server but not this one.

Still gets me: Every single time. I once deployed a container with the wrong model name and did not notice for two hours because the API was responding — it was just responding with a different model than expected. My users got answers from Llama-3.1 when they thought they were querying Qwen3. The answers were fine, but the behavior was wrong, and it took a while to trace.

Before you hit enter, run this sanity check:

# Verify the model path exists and has the right files
ls -la /opt/models/hub/models--Qwen--Qwen3-7B-Instruct/snapshots/

# Verify the port is free
sudo lsof -i :8000

# Verify GPU availability
nvidia-smi

# Do a dry-run of the docker command without -d (detached)
# Watch the logs for the first 60 seconds
# Then Ctrl+C and re-run with -d

My final pre-flight script looks like this:

#!/bin/bash
set -e

echo "=== Pre-flight check ==="
echo "GPU status:"
nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv

echo "Port 8000 status:"
if lsof -i :8000 >/dev/null 2>&1; then
  echo "WARNING: Port 8000 is in use"
  exit 1
else
  echo "OK: Port 8000 is free"
fi

echo "Model cache:"
df -h /opt/models

echo "Docker image:"
docker images vllm/vllm-openai:v0.11.2 --format "{{.Repository}}:{{.Tag}} {{.Size}}"

echo "=== All checks passed ==="

I run this before every deployment. It takes five seconds. It has caught wrong model paths, full disks, port conflicts, and missing images. It is the cheapest insurance you can buy.

The Final Word

This checklist lives in /opt/vllm/checklist.md on my server. I read it before every deployment, even when I am sure I remember everything. I do not remember everything. Neither will you.

The thing about Docker deployments is that they feel repeatable until they are not. You run the same command ten times and it works. The eleventh time, you are on a different machine, or a different network, or a different version of Docker, and something subtle breaks. The checklist is not about preventing every failure. It is about preventing the failures you have already had once.

Print this out. Save it somewhere you will actually look at it. Add your own items as you discover them. The best checklist is the one you use, not the one that sits in a bookmark folder collecting dust.

And if you find item 18 — the one I have not learned yet — email me. I will add it to the list and credit you for saving me from my next 2 AM debugging session.