GPUStack Setup Guide: From Zero to Local API in 20 Minutes

You don't need Kubernetes. You don't need Docker Compose. Just one binary, one config file, and twenty minutes.

I had been running vLLM directly from the command line for months. Every time I wanted to switch models, I killed the process, scrolled through my shell history to find the right flags, and started it again. When I needed two models running at once, I opened a second terminal and hoped I remembered the correct port number. It worked, but it felt like managing a server with sticky notes.

GPUStack is an open-source model management layer that sits above inference engines like vLLM and llama.cpp, giving you a web UI to deploy, switch, and monitor models without memorizing command-line flags. I was skeptical, but I decided to give it an afternoon. It took 23 minutes from a fresh Ubuntu install to my first working API call. Three of those were me typing the wrong password. Here is exactly what I did.

What GPUStack Is and Why I Chose It

GPUStack is a lightweight model serving platform — a control plane for your local GPUs. It supports multiple inference backends: vLLM for GPU serving, llama.cpp for CPU deployments. I chose it because I was tired of maintaining shell scripts for every model. GPUStack is not a replacement for vLLM — it is a manager. The inference still runs through vLLM or llama.cpp under the hood. GPUStack handles the lifecycle: download, start, expose, monitor, restart.

Prerequisites

Before you start, you need the following. I am being specific because "a modern GPU" is not enough information when you are troubleshooting a failed install at midnight.

  • GPU: NVIDIA GPU with compute capability 7.0 or higher. I tested on an RTX 4090 and a dual-RTX-3090 setup. Older cards like the GTX 1080 Ti will not work with the vLLM backend.
  • Drivers: NVIDIA driver 535.54.03 or newer. Check with nvidia-smi. I wasted 15 minutes on a 525-driver machine before the CUDA check failed silently.
  • CUDA: 12.2 or newer. GPUStack bundles its own libraries, but the host needs a compatible driver.
  • OS: Ubuntu 22.04 or 24.04 LTS. This guide assumes Ubuntu.
  • RAM: 32GB minimum. I tried on 16GB and the OOM killer terminated a Qwen3-14B download.
  • Disk: At least 100GB free. A 7B model is ~15GB, but the cache grows fast.
  • Network: Hugging Face access. GPUStack downloads from Hugging Face by default.

My test machine: Ryzen 9 7950X, 128GB DDR5, dual RTX 4090, Ubuntu 22.04, driver 550.67.

Installation

GPUStack provides a one-line installer script. I am normally suspicious of curl | sh installations, but the script is readable and only installs a single binary plus a systemd service.

Step 1: run the installer.

curl -sfL https://get.gpustack.io | sh -

The script detects your OS, downloads the appropriate binary, and installs it to /usr/local/bin/gpustack. On my machine, the output looked like this:

[INFO]  Detecting architecture...
[INFO]  Architecture: amd64
[INFO]  Downloading GPUStack v0.5.1...
[INFO]  Installing to /usr/local/bin/gpustack
[INFO]  Creating systemd service...
[INFO]  Starting GPUStack service...
[INFO]  GPUStack is running at http://localhost:80
[INFO]  Default username: admin
[INFO]  Default password: (randomly generated, see below)

The installer prints the default admin password at the end. Copy it immediately. I did not copy mine, and the reset process took an extra 8 minutes of reading documentation.

Step 2: verify the service is running.

sudo systemctl status gpustack

You should see active (running). If you see failed, check the logs:

sudo journalctl -u gpustack -n 50 --no-pager

The most common failure is a CUDA driver mismatch. If the logs mention libcudart.so or nvmlInit errors, your NVIDIA driver is too old.

Step 3: check that GPUStack sees your GPUs.

gpustack list-gpus

Expected output on a dual-GPU machine:

+----+------+----------+-----------+----------------+
| ID | Name | Memory   | Utilization | Temperature   |
+----+------+----------+-----------+----------------+
| 0  | RTX  | 24564 MB | 0%        | 34C            |
| 1  | RTX  | 24564 MB | 0%        | 36C            |
+----+------+----------+-----------+----------------+

If this command returns no GPUs, reboot and try again. A fresh driver install often requires a reboot before the NVML library becomes visible to userspace.

First Model Deployment

Open your browser and navigate to http://localhost. Log in with username admin and the password from the installer output.

The dashboard is clean: a GPU overview at the top, a model list on the left, and a "Deploy Model" button in the center. Click that button.

You will see a form with these fields:

  • Model Source: Hugging Face (default) or local path
  • Model ID: the Hugging Face model identifier
  • Backend: vLLM, llama.cpp, or auto-detect
  • GPU: which GPU to use, or auto-schedule
  • Quantization: optional, for specifying AWQ or GPTQ
  • Replicas: how many instances to run

For your first deployment, use these exact values:

  • Model Source: Hugging Face
  • Model ID: Qwen/Qwen3-7B-Instruct
  • Backend: vLLM
  • GPU: Auto
  • Replicas: 1

Click "Deploy." GPUStack downloads the model weights from Hugging Face, which takes 5-10 minutes. When the status changes from "Downloading" to "Running," the model is live.

I recommend using the OpenAI-compatible endpoint at http://localhost/v1-openai/chat/completions, which handles model routing through the standard model parameter in the JSON body. The default /v1/chat/completions path uses URL-encoded model names that some HTTP clients mishandle.

Testing the API

Open a terminal and run this curl command. Replace the URL if your model name is different.

curl http://localhost/v1-openai/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-7B-Instruct",
    "messages": [{"role": "user", "content": "Explain quantum computing in one paragraph."}],
    "max_tokens": 200,
    "temperature": 0.7
  }'

The first request takes a few seconds while vLLM warms up the CUDA kernels. Here is the response I received:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1714123456,
  "model": "Qwen/Qwen3-7B-Instruct",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Quantum computing leverages the principles of quantum mechanics, specifically superposition and entanglement, to process information in ways that classical computers cannot. While a classical bit is either 0 or 1, a quantum bit (qubit) can exist in a superposition of both states simultaneously. This allows quantum computers to explore many possible solutions to a problem at once, making them potentially exponentially faster for certain tasks like factoring large numbers, simulating molecular structures, and optimizing complex systems. However, quantum computers are still in early development and face significant challenges including error correction and maintaining quantum coherence at scale."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 118,
    "total_tokens": 132
  }
}

The API is fully OpenAI-compatible. Point any client library at it by changing the base URL. The api_key field is required by the Python SDK but ignored by GPUStack.

Adding a Second Model

This is where GPUStack starts to shine. Without changing any running services, I deployed a second model on the other GPU.

Back in the web UI, click "Deploy Model" again and use these values:

  • Model Source: Hugging Face
  • Model ID: meta-llama/Llama-3.1-8B-Instruct
  • Backend: vLLM
  • GPU: GPU 1 (or auto-schedule if you have only one GPU)
  • Replicas: 1

On my dual-RTX-4090 machine, each model got its own GPU. If you have a single GPU, both models will share it — a 7B and an 8B at FP16 will not fit on a 24GB card. Query either model by changing the model name:

curl http://localhost/v1-openai/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

The response format is identical. From the client side, switching models is a one-line change.

What I Struggled With

I promised honesty, so here are the problems I actually hit.

The default password. I mentioned this already, but it bears repeating. The installer prints a random password once. If you miss it, you reset it by running gpustack reset-password admin on the server. I did not know this command existed and spent 8 minutes reading source code before finding it in the CLI help.

Model download failures. My first attempt to download Llama-3.1-8B failed halfway through with a network timeout. GPUStack's retry restarted from the beginning instead of resuming. The workaround is to pre-download with huggingface-cli and point GPUStack at the local cache.

Context length defaults. GPUStack's vLLM backend defaults to max-model-len of 4096 tokens. I did not realize this until a summarization task started truncating at 4K. The setting is buried in the "Advanced" tab. I had to redeploy with max_model_len=8192.

Port conflicts. GPUStack binds to port 80 by default. On a machine already running nginx, this fails silently — the service crashes in a loop. I found it by checking journalctl for "Address already in use." The fix is to edit /etc/gpustack/config.yaml and change the port to 8080.

GPU scheduling confusion. When I set GPU to "Auto" with two models and one GPU, I expected GPUStack to queue them or load-balance. Instead, it tried to run both simultaneously and the second one failed with an OOM error. "Auto" means "pick any available GPU," not "manage resource contention." As of version 0.5.1, GPUStack does not check VRAM availability before scheduling.

Quick Comparison to Alternatives

GPUStack is not the only tool in this space. Here is how it compares to the options I considered.

vLLM alone: Maximum performance and flexibility, but you manage everything yourself. No web UI, no model registry, no automatic restarts. Best for production APIs where you have an ops team.

Ollama: Easier than GPUStack for a single-user setup. One command to install, built-in chat REPL. But no multi-model API management, no web dashboard, and no multi-GPU support for a single model. I still use Ollama on my laptop for quick experiments.

Text Generation Inference (TGI): Hugging Face's serving stack. Faster than vLLM for some workloads, but heavier to install. No built-in model management UI comparable to GPUStack's dashboard.

llama.cpp server: Excellent for CPU inference and edge devices. GPUStack uses llama.cpp as one of its backends. If you only run small models on CPU, llama.cpp alone is simpler.

Kubernetes + KServe: The enterprise choice. Massive overhead for a single machine. I would not recommend this unless you are running a cluster of at least four GPUs.

For my use case — a single workstation with 2-4 GPUs, serving 2-4 models to a small team — GPUStack hits the sweet spot. More structured than raw vLLM, more capable than Ollama, without the weight of Kubernetes.

Final Notes

My GPUStack instance has been running for three weeks without a crash. I have four models deployed: Qwen3-7B for general chat, Llama-3.1-8B for coding, a fine-tuned SQL model, and a small 3B fallback. Switching between them is a dropdown in the dashboard. The setup is not perfect — documentation could be clearer, download retry needs resume support, GPU scheduling could be smarter — but the core promise is real. It took me 23 minutes, and 3 of those were my own fault.

If you are managing local models through shell scripts and terminal tabs, GPUStack is worth the migration. Start with one model, get comfortable with the dashboard, and expand. The OpenAI-compatible API means your existing clients work without changes. One last tip: bookmark the dashboard and save the admin password. You will need both more often than you think.

AdSense ad slot (mid-content) — replace with real ad code after deployment