tech
2026-03-07
Running Llama and Qwen locally. No cloud. No API keys. No telemetry.
Privacy. Your prompts never leave your machine. No logs, no training on your data, no terms of service that change on Tuesdays.
Offline. Works on a plane. Works in a bunker. Works when AWS goes down and half the internet remembers it's centralized.
Free. No $20/month subscription. No rate limits. No "you've exceeded your quota." The weights are open. The compute is yours.
llama.cpp — the engine. Pure C/C++, runs on CPU. No Python, no PyTorch, no CUDA required. Loads GGUF model files and runs inference. Compiles on anything with a C compiler.
Ollama — the wrapper. Downloads models, manages them, exposes an API. Think of it as Docker for LLMs. One binary, one command, done.
Ollama uses llama.cpp under the hood. If you want control — use llama.cpp directly. If you want convenience — use Ollama.
Two families worth knowing right now:
Llama 3.x (Meta) — the default choice. Well-rounded, good at English, huge community. Llama 3.1 8B is the workhorse. Llama 3.3 70B if you have the RAM.
Qwen 2.5 / Qwen 3 (Alibaba) — strong at code, math, and multilingual tasks. Qwen 2.5 7B punches above its weight. Qwen 3 just dropped with MoE variants — 30B active params from a 235B model.
Both come in GGUF format — the standard for local inference. Quantized versions shrink model size at minimal quality loss:
| Quantization | Size (7B) | Quality |
|---|---|---|
| Q8_0 | ~7 GB | Near-original |
| Q5_K_M | ~5 GB | Sweet spot |
| Q4_K_M | ~4 GB | Good enough |
| Q2_K | ~3 GB | Noticeable loss |
Q4_K_M is where most people land. Good balance of size and coherence.
You don't need a GPU. CPU inference is slow but works.
Speed depends on memory bandwidth more than clock speed. Apple Silicon (M1/M2/M3) is surprisingly good at this — unified memory means the GPU gets to help. A MacBook Air with 16 GB RAM runs a 7B model at 20-30 tokens/second.
On a typical x86 laptop with 16 GB: expect 5-15 tok/s for 7B models. Usable. Not fast. But it's running on your machine.
Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh
Run a model:
ollama run llama3 ollama run qwen2.5
That's it. First run downloads the model (~4 GB for 7B Q4). Then you're talking to it.
API access on localhost:
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Explain GGUF in one paragraph"
}'
Ollama is OpenAI-compatible. Point any tool that speaks OpenAI API to http://localhost:11434 and it just works.
Here's the idea: a USB stick with everything you need. Plug in, run, unplug. No installation.
# Format a fast USB stick (64GB+, USB 3.0) # Copy ollama binary + models usb/ ├── ollama # single binary, ~30 MB ├── models/ │ ├── llama3-8b-q4_k_m.gguf # ~4.3 GB │ └── qwen2.5-7b-q4_k_m.gguf # ~4.5 GB └── run.sh
The run.sh:
#!/bin/bash export OLLAMA_MODELS="$(dirname "$0")/models" export OLLAMA_HOST=127.0.0.1:11434 ./ollama serve & sleep 2 ./ollama run llama3
Or skip Ollama entirely — llama.cpp's llama-cli is a single binary too:
./llama-cli -m models/llama3-8b-q4_k_m.gguf \ -p "You are a helpful assistant." \ --interactive --color -ngl 0
Local 7B models are good at:
They're not GPT-4. They hallucinate more, struggle with complex reasoning, and have smaller context windows. But for 80% of daily tasks — they're enough. And they're yours.
The 70B models close the gap significantly. If you have 64 GB RAM or a GPU with 48 GB VRAM — Llama 3.3 70B is genuinely impressive.
Two years ago, running a language model locally meant a research lab and six figures in hardware. Today it's a laptop and a terminal command.
The weights are open. The tools are free. The only cost is electricity and patience.
$ ollama run llama3 >>> _