Posts — AI Deployment Notes

How I Tried to Deploy Qwen3-235B on a Single RTX 4090 (And Why It Failed)

I thought 24GB of VRAM would be enough. I was wrong. Here's what happened when I tried to run the world's largest open MoE model on a consumer GPU.

8 min read

Comparison

vLLM vs Ollama: A Hands-On Comparison After 3 Months of Daily Use

I started with Ollama because it took 30 seconds to get running. Three months later, I run everything on vLLM. Here's what changed my mind.

10 min read

Deep Dive

The VRAM Lie: Why Your 24GB GPU Cannot Run a 7B Model at Full Context

The math is simple: 7B parameters × 2 bytes = 14GB. So why does my 24GB card run out of memory at 8K context? The answer involves something NVIDIA doesn't advertise.

8 min read

Tutorial

GPUStack Setup Guide: From Zero to Local API in 20 Minutes

You don't need Kubernetes. You don't need Docker Compose. Just one binary, one config file, and twenty minutes. Here's the exact setup that worked for me.

7 min read

Tutorial

How to Expose Your Local LLM to the Internet Without Getting Pwned

I wanted to use my local model from my phone. I did NOT want to wake up to a $5000 cloud bill because someone found my open port. Cloudflare Tunnel saved me.

9 min read

Field Notes

I Migrated from Cloud AI to Local LLM: A Real Cost Breakdown After 6 Months

I was paying $127/month for OpenAI API calls. Then I bought a $3,800 server. Six months later, the math looks very different — but not in the way you'd expect.

8 min read

Field Notes

Fixing vLLM OOM Errors: Reading the Logs Like a Detective

No stack trace, no error code, just `Killed`. Three hours of printf debugging later, I found the culprit hiding in a log line I almost scrolled past.

7 min read

Tutorial

Docker + vLLM: The Complete Deployment Checklist I Wish I Had on Day One

The 17 items I now check before every deployment. Number 12 cost me a weekend. Number 9 still catches me occasionally.

8 min read

Tutorial

Connecting LobeChat to Your Local Model: The Missing Manual

The docs say "just set the API URL." They don't mention the CORS nightmare, the timeout defaults, or why your model shows up as "unavailable" for no reason.

6 min read

Deep Dive

Quantization Deep Dive: FP16, INT8, GPTQ, AWQ — What Actually Matters

Everyone talks about speed. Nobody talks about the quality cliff at INT4. I ran the same prompts through every quantization method and measured what actually degrades.

10 min read

Field Notes

Cloudflare Tunnel for Local AI: Every Issue I Hit and How I Fixed It

Six error messages, three red herrings, and one DNS propagation that took 47 minutes. My complete Cloudflare Tunnel war journal.

7 min read

Tutorial

Building a Local ChatGPT Alternative: My Full Stack from Scratch

Web UI, API layer, auth, and persistent memory — all running on a single RTX 4090. Here's what I built, what I abandoned, and what I'd do differently.

9 min read