FAQ — AI Deployment Notes

Getting Started

What GPU do I need for local LLM deployment?

The minimum is an 8GB VRAM card (e.g., RTX 3060 Ti), which can run 7B parameter quantized models (AWQ/GPTQ 4-bit). For smooth 13B model operation or running multiple models simultaneously, 24GB VRAM (RTX 3090/4090) is recommended. My current setup uses dual RTX 4090s, allowing me to run Qwen3-7B and Qwen3-Coder-7B in parallel.

Can I run models on CPU without a GPU?

Yes, but it will be slow. A 7B model on CPU achieves roughly 2-5 tokens/s, while an RTX 4090 reaches 30-40 tokens/s. For testing or infrequent use, CPU solutions like llama.cpp work fine. For production or coding assistant scenarios, a GPU is strongly recommended.

Can I deploy locally on a Mac?

Yes. A Mac Studio (M2 Ultra, 64GB unified memory) can smoothly run 7B-13B models using Ollama or llama.cpp with Metal backend. However, Macs don't support CUDA, so CUDA-only frameworks like vLLM won't work. For Apple Silicon, Ollama is recommended — it installs in one click and automatically leverages the Neural Engine.

Model Selection

Qwen, Llama, or ChatGLM — which should I choose?

For Chinese-language tasks, Qwen is the top choice. It has better Chinese corpus coverage, a more complete open-source ecosystem on HuggingFace (0.5B to 110B), and superior long-context Chinese support. Llama 3 excels in English but needs extra fine-tuning for Chinese. ChatGLM is decent but updated less frequently than Qwen.

Is there a big difference between 7B and 13B models?

For casual conversation, the gap is small. But for complex reasoning and code generation, 13B is noticeably stronger. In benchmarks, Qwen2-7B scores 42% on HumanEval while Qwen2-14B reaches 58%. If you mainly use AI for coding or technical docs, go for 13B. For simple Q&A and chat, 7B is sufficient and saves VRAM.

Deployment & Optimization

What's the difference between vLLM and Ollama?

vLLM is a production-grade inference engine with high concurrency, continuous batching, and PagedAttention — ideal for API serving. Ollama prioritizes developer experience with simple installation and intuitive commands, great for personal experimentation. In short: choose vLLM for services, Ollama for quick trials. See our detailed comparison for more.

What if I run out of VRAM?

Five ways to reduce memory usage: 1) Use AWQ/GPTQ 4-bit quantized models (saves 50-60% VRAM); 2) Reduce max-model-len (e.g., from 8192 to 4096); 3) Enable FlashAttention-2 (saves 10-15%); 4) Lower gpu-memory-utilization (e.g., from 0.9 to 0.8); 5) Use multi-GPU tensor parallelism. Check our GPU memory optimization guide for step-by-step instructions.

Why is my model slow after starting?

Slowness usually has three causes: 1) Cold start — loading a large model from disk to VRAM takes 10-30 seconds, which is normal; 2) First request triggers CUDA Graph compilation, adding 3-5 seconds latency, with subsequent requests being faster; 3) If it stays slow, check if --enforce-eager is enabled (disables CUDA Graph, slowing things by 20-30%). Also verify your GPU is in a PCIe x16 slot, as bandwidth bottlenecks can hurt performance.

Networking & Remote Access

How can I access my local model from outside without a public IP?

Cloudflare Tunnel is recommended — it's free, secure, and provides HTTPS. No public IP, router changes, or exposed ports needed. Check our full tutorial here. Alternatives include frp, ngrok, and Peanut Shell, but Cloudflare Tunnel wins on stability and security.

Is tunneling safe?

It depends on the method. Direct port forwarding (DMZ) is the riskiest — scanners probe constantly. Cloudflare Tunnel is relatively safe because traffic routes through Cloudflare edge nodes without exposing your origin IP, and includes DDoS protection. Even so, we recommend: 1) adding API access keys; 2) rate-limiting requests; 3) monitoring logs for suspicious IPs.

Cost & Value

Is self-hosting cheaper than API services?

It depends on usage. My calculation: a dual RTX 4090 server costs about 35,000 upfront, but at 5,000 requests/day it saves 2,000-3,000 per month compared to OpenAI API, paying for itself in about a year. If you make fewer than 100,000 requests monthly, APIs are more cost-effective. See our local vs. cloud comparison for full details.

Is the electricity bill expensive?

A dual RTX 4090 rig draws about 600W at full load. At 0.6 per kWh running 8 hours daily, that's roughly 86 per month. With GPUStack's auto-scheduling to unload models and downclock GPUs during idle, actual monthly cost drops to around 50-60. Compared to cloud GPU rentals (e.g., AutoDL RTX 4090 at ~1.5/hour), local electricity is almost negligible.

Still Have Questions?

If your question isn't answered here, send me a message through the contact page and I'll do my best to reply. You can also browse all posts for more scenario-specific troubleshooting notes.