Running Llama 3 Locally with Ollama
Free LLMs on your machine: no API costs, full control, offline capable
Published July 1, 2026
•
8 min read
TL;DR: Download Ollama (5 min), run `ollama pull llama3:13b`, then `ollama run llama3:13b`. Chat with a powerful LLM for free, locally.
Why Run Models Locally?
- 🔒 Privacy: Your data never leaves your machine
- 💰 Free: No API costs after initial download
- ⚡ Fast: No network latency
- 🔌 Offline: Works without internet after download
- 🎮 Full control: Modify behavior, use custom prompts, batch processing
Step 1: Install Ollama
Mac (M1/M2/M3)
# Download and install from ollama.ai
# Or use Homebrew:
brew install ollama
Linux
curl -fsSL https://ollama.ai/install.sh | sh
Windows
Download installer from ollama.ai
Step 2: Download Llama 3
# Start Ollama service (runs in background)
ollama serve
# In another terminal, pull the model
ollama pull llama3:13b
# Or the larger version
ollama pull llama3:70b
First download takes 5-15 minutes depending on internet speed. Ollama caches models locally.
| Model | Size | RAM | Speed |
|---|---|---|---|
| llama3:7b | 3.7GB | 8GB | ~50 tokens/sec (MacBook) |
| llama3:13b | 7.4GB | 16GB | ~30 tokens/sec |
| llama3:70b | 40GB | 64GB | ~5 tokens/sec |
Step 3: Run and Chat
ollama run llama3:13b
You're now in an interactive chat. Type and press Enter:
>> Explain quantum computing in one sentence
Quantum computers harness the strange properties of quantum
mechanics—where particles exist in multiple states simultaneously—
to perform computations exponentially faster than classical computers
for specific problem types.
>>> /bye # Exit the chat
Advanced: Use Ollama Programmatically
Python
import requests
import json
def query_ollama(prompt):
response = requests.post("http://localhost:11434/api/generate",
json={
"model": "llama3:13b",
"prompt": prompt,
"stream": False
}
)
return response.json()["response"]
answer = query_ollama("What's 2+2?")
print(answer)
cURL
curl http://localhost:11434/api/generate -d '{
"model": "llama3:13b",
"prompt": "Why is the sky blue?",
"stream": false
}'
Other Models Available
Ollama supports many open-source models:
- Llama:
ollama pull llama3:70borollama pull llama2:13b - Mistral:
ollama pull mistral:latest - Qwen:
ollama pull qwen:latest - Neural Chat:
ollama pull neural-chat:latest
Performance Tips
- Mac M1/M2: Models run with GPU acceleration. 13B is recommended sweet spot.
- Linux: Add NVIDIA GPU support: `ollama run ollama/ollama:latest-gpu`
- Windows: Limited GPU support as of 2026. CPU-only recommended.
- Reduce latency: Smaller models (7B) are 2-3x faster than 13B.
- Batch requests: Send multiple prompts to amortize startup cost.
Learn More: Local LLM Architecture
Master local LLM deployment with our comprehensive course covering:
- Ollama setup and optimization
- Custom model fine-tuning locally
- Building production pipelines
- Cost analysis: cloud vs. local
- Real-world case studies
Master Local LLM Architecture
Learn to build, deploy, and optimize LLMs on your own hardware.
Start Local LLM Course →