2026-03-07

AI on a Flash Drive: Running Llama and Qwen Locally

tech

Running Llama and Qwen locally. No cloud. No API keys. No telemetry.

Why local

Privacy. Your prompts never leave your machine. No logs, no training on your data, no terms of service that change on Tuesdays.

Offline. Works on a plane. Works in a bunker. Works when AWS goes down and half the internet remembers it’s centralized.

Free. No $20/month subscription. No rate limits. No “you’ve exceeded your quota.” The weights are open. The compute is yours.

The tools

llama.cpp — the engine. Pure C/C++, runs on CPU. No Python, no PyTorch, no CUDA required. Loads GGUF model files and runs inference. Compiles on anything with a C compiler.

Ollama — the wrapper. Downloads models, manages them, exposes an API. Think of it as Docker for LLMs. One binary, one command, done.

Ollama uses llama.cpp under the hood. If you want control — use llama.cpp directly. If you want convenience — use Ollama.

The models

Two families worth knowing right now:

Llama 3.x (Meta) — the default choice. Well-rounded, good at English, huge community. Llama 3.1 8B is the workhorse. Llama 3.3 70B if you have the RAM.

Qwen 2.5 / Qwen 3 (Alibaba) — strong at code, math, and multilingual tasks. Qwen 2.5 7B punches above its weight. Qwen 3 just dropped with MoE variants — 30B active params from a 235B model.

Both come in GGUF format — the standard for local inference. Quantized versions shrink model size at minimal quality loss:

Quantization	Size (7B)	Quality
Q8_0	~7 GB	Near-original
Q5_K_M	~5 GB	Sweet spot
Q4_K_M	~4 GB	Good enough
Q2_K	~3 GB	Noticeable loss

Q4_K_M is where most people land. Good balance of size and coherence.

Hardware

You don’t need a GPU. CPU inference is slow but works.

7B model: 8 GB RAM minimum. 16 GB comfortable.
13B model: 16 GB RAM minimum.
30B+ model: 32 GB RAM. Or a GPU with enough VRAM.

Speed depends on memory bandwidth more than clock speed. Apple Silicon (M1/M2/M3) is surprisingly good at this — unified memory means the GPU gets to help. A MacBook Air with 16 GB RAM runs a 7B model at 20-30 tokens/second.

On a typical x86 laptop with 16 GB: expect 5-15 tok/s for 7B models. Usable. Not fast. But it’s running on your machine.

Quick start

Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Run a model:

ollama run llama3
ollama run qwen2.5

That’s it. First run downloads the model (~4 GB for 7B Q4). Then you’re talking to it.

API access on localhost:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain GGUF in one paragraph"
}'

Ollama is OpenAI-compatible. Point any tool that speaks OpenAI API to http://localhost:11434 and it just works.

The flash drive setup

Here’s the idea: a USB stick with everything you need. Plug in, run, unplug. No installation.

# Format a fast USB stick (64GB+, USB 3.0)
# Copy ollama binary + models

usb/
├── ollama           # single binary, ~30 MB
├── models/
│   ├── llama3-8b-q4_k_m.gguf    # ~4.3 GB
│   └── qwen2.5-7b-q4_k_m.gguf  # ~4.5 GB
└── run.sh

The run.sh:

#!/bin/bash
export OLLAMA_MODELS="$(dirname "$0")/models"
export OLLAMA_HOST=127.0.0.1:11434
./ollama serve &
sleep 2
./ollama run llama3

Or skip Ollama entirely — llama.cpp’s llama-cli is a single binary too:

./llama-cli -m models/llama3-8b-q4_k_m.gguf \
  -p "You are a helpful assistant." \
  --interactive --color -ngl 0

A 64 GB USB 3.0 stick fits two 7B models with room to spare. A 128 GB stick fits a 70B quantized model. Your entire AI lab in your pocket.

What to expect

Local 7B models are good at:

Code generation and explanation
Summarization
Translation
General Q&A
Writing drafts

They’re not GPT-4. They hallucinate more, struggle with complex reasoning, and have smaller context windows. But for 80% of daily tasks — they’re enough. And they’re yours.

The 70B models close the gap significantly. If you have 64 GB RAM or a GPU with 48 GB VRAM — Llama 3.3 70B is genuinely impressive.

The point

Two years ago, running a language model locally meant a research lab and six figures in hardware. Today it’s a laptop and a terminal command.

The weights are open. The tools are free. The only cost is electricity and patience.

$ ollama run llama3
>>> _

← Back to base