The minimum requirement is an 8GB VRAM GPU (e.g., RTX 3060 Ti), which can run 7B parameter quantized models (AWQ/GPTQ 4-bit). For smooth 13B model operation or running multiple models simultaneously, 24GB VRAM (RTX 3090/4090) is recommended. My current setup uses dual RTX 4090s, which can run Qwen3-7B and Qwen3-Coder-7B in parallel.

What's the difference between vLLM and Ollama?

vLLM is a production-grade inference engine with high concurrency, continuous batching, and PagedAttention—ideal for API service scenarios. Ollama prioritizes developer experience with simple installation and intuitive commands, perfect for personal experimentation. In short: choose vLLM for services, Ollama for quick trials. For a detailed comparison, see this in-depth comparison.

What if I run out of VRAM?

Five ways to reduce VRAM usage: ① Use AWQ/GPTQ 4-bit quantized models (saves 50-60% VRAM); ② Reduce max-model-len (e.g., from 8192 to 4096); ③ Enable FlashAttention-2 (saves 10-15%); ④ Lower gpu-memory-utilization (e.g., from 0.9 to 0.8); ⑤ Use multi-GPU tensor parallelism (tensor-parallel-size). For detailed steps, see GPU VRAM Optimization Guide.

How can I access my local model from the internet without a public IP?

We recommend Cloudflare Tunnel (free, secure, with HTTPS). No public IP needed, no router changes, no port exposure. I wrote a complete tutorial here. Alternatives include frp, ngrok, and Peanut Shell, but Cloudflare Tunnel wins on stability and security.

Is self-hosting cheaper than buying API access?

It depends on usage. My calculation: a dual RTX 4090 server costs about 35,000 upfront, but at 5,000 requests/day, it saves 2,000-3,000 per month compared to OpenAI API—payback in one year. If you make fewer than 100K requests/month, API is more cost-effective. See the full comparison at Local vs Cloud: Full Comparison.

View All FAQs

Local LLMDeployment Toolkit

Choose by Deployment Stage

VRAM Estimator

Cost Calculator

vLLM Command Generator

Common Deployment Errors & Fixes

Deployment Advisor

Latest Deployment Notes

vLLM + Qwen3 Complete Deployment Guide: From Zero to API Service

Local vs Cloud Models: A Full Cost, Performance, and Experience Comparison

Fixing Qwen Context Overflow Issues

Quick Answers

Subscribe for Updates

Local LLM
Deployment Toolkit