Table of Contents
Key Takeaways
- Ollama hit 52M monthly downloads in Q1 2026 — the fastest path to running open-weight LLMs locally on your own hardware.
- Minimum viable setup: 16GB RAM and a mid-range GPU run Llama 3.3 8B comfortably; 32GB+ unlocks Gemma 2 27B and Qwen 2.5 32B.
- Install is one command on Mac, Linux, or Windows — then `ollama run llama3.3` pulls the model and drops you into a chat.
- Best 2026 local models: Llama 3.3, Gemma 2, Qwen 2.5, Mistral Small 3, and DeepSeek-Coder for programming tasks.
- Privacy, offline access, and zero per-token cost make Ollama ideal for sensitive work or heavy experimentation.
Want to run powerful AI models on your own computer — completely free, with no API keys, no usage limits, and zero data sent to the cloud? Learning how to use Ollama locally is the fastest way to get there. Ollama is an open-source tool that lets you download and run large language models (LLMs) like Llama 3, Mistral, and Gemma 2 directly on your machine. Whether you’re on Windows, Mac, or Linux, this guide walks you through every step: installing Ollama, pulling your first model, running it from the terminal, and choosing the right model for your needs. No cloud subscription required — just your hardware and a few commands.

What Is Ollama and Why Run AI Models Locally?
Ollama is a free, open-source runtime that simplifies downloading, managing, and running LLMs on your local machine. Think of it as Docker — but for AI models. With a single command you can pull a model, and with another you can start chatting with it in your terminal or integrate it into your own apps via a local REST API.
There are three compelling reasons to run AI locally instead of relying on cloud services like ChatGPT or Claude:
- Privacy: Every prompt and response stays on your machine. Nothing is logged by a third party or used to train future models.
- Cost: Once the model is downloaded, inference is completely free — no per-token billing, no monthly subscription.
- Control: You choose which model to run, you can run it offline, and you can customize its behavior through system prompts or fine-tuning.
Ollama supports macOS, Linux, and Windows (currently in preview). It handles model quantization automatically, so even consumer hardware with 8 GB of RAM can run capable 7B-parameter models at a usable speed.
How to Install Ollama on Windows, Mac, and Linux

Installation takes under five minutes on any platform. Follow the steps for your operating system below.
macOS
- Open your browser and go to ollama.com/download.
- Click Download for macOS — this gives you a standard
.dmginstaller. - Open the downloaded file, drag Ollama into your Applications folder, and launch it.
- Ollama will appear in your menu bar. Open Terminal and verify the install:
ollama --version
Ollama runs as a background service on macOS. It supports both Apple Silicon (M1/M2/M3/M4) and Intel Macs, and it automatically uses the Metal GPU on Apple Silicon for significantly faster inference.
Linux
Linux installation is a single command. Open your terminal and run:
curl -fsSL https://ollama.com/install.sh | sh
The installer detects your distribution, installs the binary to /usr/local/bin/ollama, and sets up a systemd service that starts automatically on boot. It also detects NVIDIA or AMD GPUs and configures GPU acceleration if available. After installation, confirm it’s running:
ollama --version
systemctl status ollama
Windows
- Visit ollama.com/download/windows and download the
OllamaSetup.exeinstaller. - Run the installer — it installs Ollama and adds it to your system PATH automatically.
- Open PowerShell or Command Prompt and verify:
ollama --version
Windows support is currently in preview. NVIDIA GPU acceleration works via CUDA; AMD GPU support is improving. If you hit issues, ensure your GPU drivers are up to date and that you have the Visual C++ Redistributable installed.
Minimum system requirements (all platforms): 8 GB RAM (16 GB recommended for 7B models), 10–15 GB free disk space per model, GPU optional but strongly recommended for speed.
How to Run Your First AI Model with Ollama
Once Ollama is installed, running a model is a single command. Let’s start with Meta’s Llama 3 8B — one of the best open-weight models available in 2026.
Step 1: Pull and run Llama 3
ollama run llama3
This command downloads the Llama 3 8B model (~4.7 GB) the first time you run it, then drops you straight into an interactive chat session. Type your prompt and press Enter.
Step 2: Pull a model separately (without running it immediately)
ollama pull mistral
Step 3: List all models you have downloaded
ollama list
Step 4: Serve Ollama as a local REST API
If you want to integrate Ollama with your own apps, scripts, or tools like Open WebUI, start the API server:
ollama serve
This exposes a REST API on http://localhost:11434. You can send requests to it exactly like the OpenAI API, making it a drop-in local replacement for many tools. For example:
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Explain how transformers work in simple terms"
}'
On macOS and Linux, Ollama already runs the server in the background automatically — you only need ollama serve if you stopped it manually. Check out our how-to guides for tutorials on building apps on top of the Ollama API.
Best AI Models to Run with Ollama in 2026
Ollama’s model library has grown substantially. Here’s a comparison of the most popular models available today, including their size, ideal use case, and RAM requirements:
| Model | Pull Command | Size | RAM Needed | Best For |
|---|---|---|---|---|
| Llama 3 8B | ollama run llama3 | 4.7 GB | 8 GB | General chat, summarization |
| Mistral 7B | ollama run mistral | 4.1 GB | 8 GB | Instruction following, fast replies |
| Phi-3 Mini | ollama run phi3 | 2.3 GB | 4 GB | Low-resource devices, quick tasks |
| Gemma 2 9B | ollama run gemma2 | 5.4 GB | 8 GB | Reasoning, Q&A, analysis |
| CodeLlama 7B | ollama run codellama | 3.8 GB | 8 GB | Code generation, debugging |
| DeepSeek-R1 7B | ollama run deepseek-r1 | 4.7 GB | 8 GB | Step-by-step reasoning, math |
Recommendation: Start with Llama 3 8B for general use — it’s the most well-rounded model at this size. If you’re on a machine with only 8 GB RAM, try Phi-3 Mini first; it’s surprisingly capable and much lighter. For coding tasks, CodeLlama is purpose-built and outperforms general models on code completion and debugging. If you want chain-of-thought reasoning similar to o1-class models, DeepSeek-R1 is worth exploring. You can find the full model library with all variants at ollama.com/library.
Common Questions — How to Use Ollama Locally
Do I need a GPU to run Ollama?
No — Ollama runs on CPU-only machines. However, a GPU makes a significant difference in speed. On a modern CPU, a 7B model might generate 5–10 tokens per second, which feels slow for conversation. An NVIDIA GPU (RTX 3060 or better) or Apple Silicon chip typically delivers 40–80+ tokens per second for the same model. For occasional use on a laptop, CPU-only is fine. For regular or production use, a dedicated GPU is worth it.
Is Ollama free to use?
Yes, completely. Ollama is open-source (MIT license) and free to download and use without any account, API key, or payment. The models it supports — Llama 3, Mistral, Gemma 2, etc. — are also open-weight models released free for personal and commercial use (check each model’s specific license). You only pay for the electricity and hardware you already own.
How is Ollama different from running models on Hugging Face or Google Colab?
Hugging Face and Colab require you to write Python code, manage dependencies, and often deal with CUDA configuration. Ollama abstracts all of that — it’s a single binary that handles model download, quantization, and inference with no Python environment needed. It’s also persistent: models stay on your machine ready to use anytime, with no session timeouts or idle shutdowns. For developers who want a dead-simple local AI backend, Ollama is hard to beat.
Can I use Ollama with a chat interface instead of the terminal?
Yes. Once Ollama is running, you can connect several open-source chat UIs to it. The most popular is Open WebUI (formerly Ollama WebUI), which gives you a ChatGPT-like browser interface that talks to Ollama’s local API on port 11434. Other options include Chatbox, LM Studio (which has its own runtime but supports the same models), and VS Code extensions like Continue for AI-assisted coding. These tools require no additional configuration beyond pointing them to http://localhost:11434. For more developer tools and integrations, browse our Dev & IT Ops articles.
Conclusion
Running AI models locally with Ollama is now genuinely practical for everyday users. Installation takes under five minutes, models like Llama 3 and Mistral run well on standard laptops, and the privacy benefit — keeping every prompt on your own machine — is hard to overstate. Whether you need a general-purpose assistant, a coding helper, or a reasoning engine, there’s an open-weight model ready to pull and run right now.
To go further, explore our collection of step-by-step how-to guides for tutorials on building apps on top of the Ollama API, and check the Dev & IT Ops section for guides on self-hosting AI tools, setting up GPU servers, and integrating local models into your development workflow.
Last Updated: April 13, 2026








