I wanted ChatGPT, but I wanted it to live in my house. Not on OpenAI's servers. Not subject to rate limits, content filters, or API bills that spike when I run a batch job. I wanted to open a browser tab, type a prompt, and get an answer from a model I controlled — with conversation history that persisted across sessions and access from my phone without a VPN.
This is the story of how I built that. It took six weeks, three complete rewrites of the middle layer, and one 2 AM debugging session where I realized my auth middleware was checking tokens against the wrong database table. The stack I ended up with is not the most elegant. It is the one that survived contact with reality.
My hardware: a single workstation with an RTX 4090 (24GB VRAM), 64GB DDR5 system RAM, a Ryzen 9 7950X, and a 2TB NVMe SSD. The machine sits in my office, draws about 450 watts under load, and cost roughly $3,400 to build. It runs Ubuntu 22.04 and has been online for five months.
Architecture Overview
Here is what the full stack looks like, from the browser down to the GPU:
┌─────────────────────────────────────────────┐
│ Browser / Phone / API Client │
│ (HTTPS via llm.mydomain.com) │
└──────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Cloudflare Tunnel (outbound, no open ports)│
└──────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ LobeChat (Docker, port 3210) │
│ - Chat UI, conversation history │
│ - Reads/writes to SQLite │
└──────────────┬──────────────────────────────┘
│ OpenAI-compatible API calls
▼
┌─────────────────────────────────────────────┐
│ FastAPI Wrapper (port 8001) │
│ - Token auth │
│ - Request validation & logging │
│ - Conversations stored in SQLite │
└──────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ vLLM (port 8000) │
│ - Qwen3-7B-Instruct (AWQ 4-bit) │
│ - PagedAttention, continuous batching │
└──────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ RTX 4090 (24GB VRAM) │
│ - Model weights: ~4.5GB (AWQ) │
│ - KV cache: up to ~16GB at 8K context │
└─────────────────────────────────────────────┘
Every layer has a reason for being there. I will explain each one, including what I tried first and why I switched.
Model Layer: vLLM with Qwen3-7B
I started with Ollama. It took 30 seconds to install and I had a working chat interface by dinner. But Ollama processes requests sequentially. When I opened two browser tabs and sent prompts simultaneously, the second one waited. When I tried to run a batch script against the API while chatting, the chat froze.
I switched to vLLM because of continuous batching. Under the hood, vLLM uses PagedAttention to share memory between requests and fill batches dynamically. With Ollama, two concurrent requests meant one waits. With vLLM, two concurrent requests means they run together, and the GPU utilization jumps from 35% to 90%.
I chose Qwen3-7B-Instruct because it punches above its weight. On my reasoning benchmarks, it outperformed Llama-3.1-8B and Mistral-7B on math and code tasks. The 7B size fits comfortably in VRAM even at FP16, but I run it in AWQ 4-bit to leave headroom for long contexts.
Here is the exact startup command I use, wrapped in a systemd service:
[Unit]
Description=vLLM API Server
After=network.target
[Service]
Type=simple
User=llm
WorkingDirectory=/home/llm
ExecStart=/home/llm/venv/bin/python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-7B-Instruct \
--quantization awq \
--max-model-len 8192 \
--gpu-memory-utilization 0.85 \
--port 8000 \
--host 127.0.0.1 \
--allowed-origins "*"
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
The --host 127.0.0.1 is intentional. vLLM does not face the internet directly. Only the FastAPI wrapper on port 8001 talks to it, and only Cloudflare Tunnel talks to the outside world. If something goes wrong at the edge, the model server is not exposed.
I tried GPTQ first because the files were smaller, but switched to AWQ after noticing a 12% accuracy drop on my math test set. AWQ costs slightly more VRAM but the quality is closer to FP16. For a 7B model, the difference is negligible. For a 70B model, I might reconsider.
API Layer: FastAPI Wrapper
vLLM's built-in OpenAI API server is solid, but it has no authentication, no request logging, and no conversation persistence. I needed all three. So I wrote a thin FastAPI wrapper that sits between LobeChat and vLLM.
I tried using vLLM directly at first. I put an nginx reverse proxy in front of it with basic auth and called it done. That lasted three days. The problem: LobeChat expects standard OpenAI error codes, and nginx basic auth returns 401 with an HTML body that breaks LobeChat's JSON parser. I also had no way to log which user sent which prompt, and no way to store conversation threads.
The FastAPI wrapper does four things:
- Token validation. Every request must include
Authorization: Bearer <token>. Tokens are SHA-256 hashes stored in SQLite. I have two: one for my laptop, one for my phone. If a token leaks, I revoke it in 30 seconds. - Request logging. Every prompt, response, token count, and timestamp goes into a
requeststable. I use this for debugging and for spotting anomalies — like the night my phone app got stuck in a loop and sent 200 requests in 10 minutes. - Conversation persistence. The wrapper maintains a
conversationstable. When LobeChat sends a thread ID, the wrapper fetches prior messages from SQLite, prepends them to the prompt, and sends the full context to vLLM. This means I can close a tab, reopen it tomorrow, and the conversation continues. - Error normalization. vLLM's error messages are informative for engineers and baffling for UIs. The wrapper catches common failures — OOM, model not found, context too long — and returns clean OpenAI-compatible JSON with human-readable
messagefields.
The core of the wrapper is about 180 lines of Python. Here is the simplified request handler:
@app.post("/v1/chat/completions")
async def chat_completions(request: ChatRequest, token: str = Header(None)):
if not validate_token(token):
raise HTTPException(status_code=401, detail="Invalid token")
# Load conversation history
messages = load_history(request.conversation_id)
messages.extend(request.messages)
# Forward to vLLM
async with httpx.AsyncClient() as client:
resp = await client.post(
"http://127.0.0.1:8000/v1/chat/completions",
json={"model": request.model, "messages": messages, **request.params},
timeout=120.0
)
# Log and store
log_request(token, request, resp)
if request.conversation_id:
append_to_conversation(request.conversation_id, messages, resp.json())
return resp.json()
I run this with uvicorn inside a virtual environment, also as a systemd service. Startup time is under 2 seconds. Memory footprint is about 80MB — negligible next to vLLM's 18GB.
UI Layer: LobeChat
I tried OpenWebUI first. It is beautiful, actively maintained, and has a huge feature set. I ran it for two weeks and then uninstalled it.
The problem was not OpenWebUI's quality. The problem was its complexity. It requires a full Node.js build, a PostgreSQL database for persistence, and a Redis cache for real-time features. On my single-GPU machine, that meant four additional services running alongside vLLM. The startup time from a cold boot was over 90 seconds. More importantly, OpenWebUI's model provider system assumes you are connecting to cloud APIs or Ollama. Getting it to talk to my FastAPI wrapper involved editing internal config files and rebuilding the frontend.
I switched to LobeChat because it is a static frontend. The Docker image is a prebuilt Next.js export. It starts in 3 seconds. It connects to any OpenAI-compatible endpoint by pasting a URL. And it stores conversations in localStorage by default, with an optional PostgreSQL backend if you need persistence.
My LobeChat Docker Compose file:
version: '3.8'
services:
lobechat:
image: lobehub/lobe-chat:latest
ports:
- "3210:3210"
environment:
- ACCESS_CODE=your-admin-password
- OPENAI_API_KEY=sk-local
- OPENAI_PROXY_URL=http://host.docker.internal:8001/v1
volumes:
- ./lobe-data:/app/.config/lobe-chat
restart: unless-stopped
The OPENAI_PROXY_URL points to my FastAPI wrapper. The ACCESS_CODE prevents random visitors from using the UI. I also override the model name in LobeChat's settings so it sends Qwen/Qwen3-7B-Instruct instead of gpt-3.5-turbo.
LobeChat is not perfect. The mobile view is usable but not polished. Plugin support is limited compared to OpenWebUI. But for my use case — a clean chat interface that talks to my local API — it is exactly the right weight.
Auth: Simple Token-Based
I considered OAuth. I even set up a Keycloak container and wired it into the FastAPI wrapper. It worked. It also added 400MB of JVM memory usage, a separate database for user sessions, and a login flow that took 8 seconds on mobile.
For a single-user system, OAuth is overkill. I tore it out after a week and replaced it with a simple token table in SQLite. Two columns: token_hash and description. I generate tokens with secrets.token_urlsafe(32), hash them with SHA-256, and store the hash. The plaintext token lives in my password manager and in environment variables on my devices.
There is no refresh flow, no session expiry, no password reset. If I need to revoke access, I delete the row. If I need to add a device, I insert a new row. The entire auth system is 12 lines of Python and one SQL table.
The only security concern is that tokens are bearer tokens — anyone who intercepts one can use it. I mitigate this by running everything over HTTPS via Cloudflare Tunnel, and by keeping vLLM itself on localhost so it is never directly exposed. For a personal deployment, this is sufficient. If I were sharing this with a team, I would add token expiry and IP allowlisting.
Persistent Memory: SQLite
Conversation history lives in SQLite. I started with PostgreSQL because it felt more "production," but maintaining a PostgreSQL container for a single-user chatbot was absurd. Backups were harder. Migrations were unnecessary. The connection pool added complexity with no benefit.
SQLite is a single file. I back it up with cp conversations.db conversations.db.backup before every model update. The schema is three tables:
CREATE TABLE tokens (
token_hash TEXT PRIMARY KEY,
description TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE conversations (
id TEXT PRIMARY KEY,
title TEXT,
model TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE messages (
id INTEGER PRIMARY KEY AUTOINCREMENT,
conversation_id TEXT REFERENCES conversations(id),
role TEXT,
content TEXT,
tokens_prompt INTEGER,
tokens_completion INTEGER,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
The database file is currently 14MB after five months of daily use. Query times are under 5 milliseconds. I have never had to optimize an index.
I did consider using LobeChat's built-in PostgreSQL support for persistence, but that would have tied my data to LobeChat's schema. Keeping conversations in my own SQLite database means I can swap UIs without losing history. If LobeChat breaks or I want to try something else, my data stays intact.
Reverse Proxy: Cloudflare Tunnel
I wrote a separate article about this setup, but the summary is: Cloudflare Tunnel creates an outbound connection from my server to Cloudflare's edge. No open ports. No public IP exposure. HTTPS and DDoS protection are free.
I tried port forwarding first. It worked for 20 minutes. Then I ran a port scan on my own public IP and found that port 8000 was already being probed by bots. I closed the port, installed cloudflared, and never looked back.
My tunnel config:
tunnel: <UUID>
credentials-file: /home/llm/.cloudflared/<UUID>.json
ingress:
- hostname: llm.mydomain.com
service: http://localhost:3210
- service: http_status:404
The tunnel points to LobeChat on port 3210. I do not expose the FastAPI wrapper or vLLM directly. If someone finds my subdomain, they get the LobeChat login page. They cannot reach the API without a valid token, and they cannot reach vLLM at all.
I also use Cloudflare Access as a second gate. Before LobeChat loads, Cloudflare shows an identity check. Only my email address is allowed through. This means an attacker needs to bypass Cloudflare's auth, guess my LobeChat admin password, and steal my API token before they can send a single prompt. I am comfortable with that risk profile.
Monitoring: What I Check
I do not run Prometheus or Grafana. For one GPU and one user, that is theater. Instead, I have three lightweight checks.
1. Netdata for resource alerts. I installed netdata with their kickstart script. It monitors GPU temperature, utilization, and VRAM usage. I get a Discord webhook alert if the GPU stays above 85C for more than 3 minutes, or if VRAM usage hits 95%. These thresholds have caught two issues: a runaway script that pinned the GPU at 100%, and a dust buildup that raised idle temps by 8 degrees.
2. A simple health endpoint. My FastAPI wrapper exposes GET /health, which checks that vLLM is responding and that the database connection is alive. I have a cron job on a VPS that pings this every 5 minutes. If it fails twice in a row, I get an email. In five months, it has fired once — during a CUDA driver update that required a reboot.
3. Weekly log review. Every Sunday morning, I run:
grep -i "error\|oom\|timeout" /var/log/llm-wrapper.log | tail -20
This takes 10 seconds. Most weeks, there is nothing. When there is something, it is usually a context-length exceeded error from a long paste I forgot about.
What I Abandoned and Why
GPUStack. I used GPUStack for a month and loved the web dashboard. But it added a full control plane I did not need. For a single model on a single GPU, GPUStack's scheduling and multi-model features were dead weight. I also hit a bug where GPUStack would not restart vLLM after a crash without manual intervention. I went back to raw vLLM with systemd.
OpenWebUI. Already covered, but worth repeating: it is a great tool for teams. For one person, the setup cost exceeds the benefit.
LangChain for the wrapper. I initially built the FastAPI wrapper using LangChain's OpenAI-compatible adapter. It added 300MB of dependencies and abstracted away the request flow to the point where debugging was painful. I rewrote it with plain httpx and pydantic. The new version is faster, smaller, and I can read every line and know what it does.
Redis for caching. I thought I would cache frequent prompts. In practice, I almost never send the exact same prompt twice. Redis sat idle for a week, consuming 100MB of RAM, before I removed it.
Docker for vLLM. I ran vLLM in Docker for the first month to avoid CUDA version conflicts. It worked, but passing the GPU through added friction, and container networking made localhost assumptions brittle. I switched to a system-wide virtual environment with pinned dependencies. One requirements.txt, one systemd service, no layers.
Total Resource Usage on RTX 4090
Here is what the system looks like under normal load — one user chatting, occasional API calls:
- vLLM: ~18GB VRAM, 15-45% GPU utilization during chat, 0% idle
- FastAPI wrapper: ~80MB RAM, 0% GPU
- LobeChat (Docker): ~120MB RAM, 0% GPU
- SQLite: ~15MB disk, negligible RAM
- Cloudflared: ~30MB RAM, 0% GPU
- Netdata: ~150MB RAM, 0% GPU
- System overhead: ~2GB RAM
Total system RAM usage: about 4GB out of 64GB. The remaining 60GB is available for large context windows, batch jobs, or running a second model if I ever need one.
GPU utilization during a typical chat session peaks at 60% and idles at 0%. The RTX 4090's fans do not even spin up for short prompts. Only sustained batch processing or long-context summarization pushes utilization above 80%.
Power draw: 45 watts idle, 320 watts under sustained load, 450 watts peak during model loading. At my electricity rate of $0.14 per kWh, running the server 8 hours a day costs about $0.55 daily, or $16 per month. For unlimited inference with no token limits, that is effectively free compared to cloud API pricing.
What I'd Do Differently
If I were building this again today, I would make three changes.
1. Start with the wrapper. I spent two weeks trying to make vLLM's built-in server do everything before accepting that I needed a middle layer. Writing the FastAPI wrapper took four hours. Those two weeks of fighting vLLM's limitations were wasted.
2. Use AWQ from day one. I started with FP16 because I was afraid of quality loss. Then I ran out of VRAM at 6K context and had to switch anyway. AWQ 4-bit gives me 95% of FP16 quality with 75% less memory. I should have benchmarked quantization methods before deploying, not after.
3. Document the token rotation process. I have two tokens right now. When I needed to revoke one last month, I could not remember which hash corresponded to which device. I had to revoke both and regenerate. A simple description field in the tokens table would have prevented this. I added it, but I should have designed for it from the start.
One thing I would not change: keeping it simple. Every time I added a tool that was "industry standard" — PostgreSQL, Redis, Keycloak, Docker for everything — the system got worse. The stack that works is the stack I can debug at 2 AM without reading documentation. SQLite, systemd, and a 180-line Python file are boring. They are also reliable.
Parting Notes
This stack is not a product. It is a personal tool that happens to be stable enough to share. If you are building something similar, resist the urge to over-engineer. You do not need Kubernetes for one GPU. You do not need OAuth for one user. You do not need a distributed database for 14MB of chat history.
Start with vLLM and a model that fits your VRAM. Add a thin API layer when you need auth or persistence. Pick a UI that gets out of your way. Put a tunnel in front of it so you can sleep. Then use it. The best stack is the one you actually run, not the one that looks impressive in a diagram.
My server has been running for five months. It has survived driver updates, power outages, and one accidental rm -rf in the wrong directory (backups saved me). The total maintenance time averages about 10 minutes per week. Most of that is reading logs and confirming that nothing is wrong. That is the goal: a system so boring that you forget it exists, until you open a browser tab and it answers your question.