AI Deployment Notes
  • Home
  • Toolkit
    • vLLM Command Generator
    • VRAM Estimator
    • Cost Calculator
    • Common Errors: Causes & Fixes
    • Deployment Advisor
  • Posts
    • Tutorial
    • Troubleshooting
    • Comparison
  • About
  • FAQ
  • Contact
中文

Posts

Field notes from the deployment trenches — real failures, real numbers, real solutions

Field Notes

How I Tried to Deploy Qwen3-235B on a Single RTX 4090 (And Why It Failed)

I thought 24GB of VRAM would be enough. I was wrong. Here's what happened when I tried to run the world's largest open MoE model on a consumer GPU.

8 min read
Comparison

vLLM vs Ollama: A Hands-On Comparison After 3 Months of Daily Use

I started with Ollama because it took 30 seconds to get running. Three months later, I run everything on vLLM. Here's what changed my mind.

10 min read
Deep Dive

The VRAM Lie: Why Your 24GB GPU Cannot Run a 7B Model at Full Context

The math is simple: 7B parameters × 2 bytes = 14GB. So why does my 24GB card run out of memory at 8K context? The answer involves something NVIDIA doesn't advertise.

8 min read
Tutorial

GPUStack Setup Guide: From Zero to Local API in 20 Minutes

You don't need Kubernetes. You don't need Docker Compose. Just one binary, one config file, and twenty minutes. Here's the exact setup that worked for me.

7 min read
Tutorial

How to Expose Your Local LLM to the Internet Without Getting Pwned

I wanted to use my local model from my phone. I did NOT want to wake up to a $5000 cloud bill because someone found my open port. Cloudflare Tunnel saved me.

9 min read
Field Notes

I Migrated from Cloud AI to Local LLM: A Real Cost Breakdown After 6 Months

I was paying $127/month for OpenAI API calls. Then I bought a $3,800 server. Six months later, the math looks very different — but not in the way you'd expect.

8 min read
Field Notes

Fixing vLLM OOM Errors: Reading the Logs Like a Detective

No stack trace, no error code, just `Killed`. Three hours of printf debugging later, I found the culprit hiding in a log line I almost scrolled past.

7 min read
Tutorial

Docker + vLLM: The Complete Deployment Checklist I Wish I Had on Day One

The 17 items I now check before every deployment. Number 12 cost me a weekend. Number 9 still catches me occasionally.

8 min read
Tutorial

Connecting LobeChat to Your Local Model: The Missing Manual

The docs say "just set the API URL." They don't mention the CORS nightmare, the timeout defaults, or why your model shows up as "unavailable" for no reason.

6 min read
Deep Dive

Quantization Deep Dive: FP16, INT8, GPTQ, AWQ — What Actually Matters

Everyone talks about speed. Nobody talks about the quality cliff at INT4. I ran the same prompts through every quantization method and measured what actually degrades.

10 min read
Field Notes

Cloudflare Tunnel for Local AI: Every Issue I Hit and How I Fixed It

Six error messages, three red herrings, and one DNS propagation that took 47 minutes. My complete Cloudflare Tunnel war journal.

7 min read
Tutorial

Building a Local ChatGPT Alternative: My Full Stack from Scratch

Web UI, API layer, auth, and persistent memory — all running on a single RTX 4090. Here's what I built, what I abandoned, and what I'd do differently.

9 min read
AI Deployment Notes

Hands-on guides for local LLM deployment, covering real-world troubleshooting with vLLM, GPUStack, Qwen, and more.

Field Notes · Pitfall Guides · Daily Updates

Quick Links

  • Home
  • Toolkit
    • vLLM Command Generator
    • VRAM Estimator
    • Cost Calculator
    • Common Errors: Causes & Fixes
    • Deployment Advisor
  • Posts
    • Tutorial
    • Troubleshooting
    • Comparison
  • About
  • FAQ
  • Contact

Legal

  • Privacy Policy
  • Terms of Service
  • Cookie Policy

© 2026 AI Deployment Notes. All rights reserved.

Local AI Field Notes

This site uses cookies to analyze traffic. By continuing, you agree to our Cookie Policy and Privacy Policy.