How I Tried to Deploy Qwen3-235B on a Single RTX 4090 (And Why It Failed)
I thought 24GB of VRAM would be enough. I was wrong. Here's what happened when I tried to run the world's largest open MoE model on a consumer GPU.
Field notes from the deployment trenches — real failures, real numbers, real solutions
I thought 24GB of VRAM would be enough. I was wrong. Here's what happened when I tried to run the world's largest open MoE model on a consumer GPU.
I started with Ollama because it took 30 seconds to get running. Three months later, I run everything on vLLM. Here's what changed my mind.
The math is simple: 7B parameters × 2 bytes = 14GB. So why does my 24GB card run out of memory at 8K context? The answer involves something NVIDIA doesn't advertise.
You don't need Kubernetes. You don't need Docker Compose. Just one binary, one config file, and twenty minutes. Here's the exact setup that worked for me.
I wanted to use my local model from my phone. I did NOT want to wake up to a $5000 cloud bill because someone found my open port. Cloudflare Tunnel saved me.
I was paying $127/month for OpenAI API calls. Then I bought a $3,800 server. Six months later, the math looks very different — but not in the way you'd expect.
No stack trace, no error code, just `Killed`. Three hours of printf debugging later, I found the culprit hiding in a log line I almost scrolled past.
The 17 items I now check before every deployment. Number 12 cost me a weekend. Number 9 still catches me occasionally.
The docs say "just set the API URL." They don't mention the CORS nightmare, the timeout defaults, or why your model shows up as "unavailable" for no reason.
Everyone talks about speed. Nobody talks about the quality cliff at INT4. I ran the same prompts through every quantization method and measured what actually degrades.
Six error messages, three red herrings, and one DNS propagation that took 47 minutes. My complete Cloudflare Tunnel war journal.
Web UI, API layer, auth, and persistent memory — all running on a single RTX 4090. Here's what I built, what I abandoned, and what I'd do differently.