OpenClaw + Ollama: Best VPS for Running Local AI Models
OpenClaw connects to cloud APIs like Anthropic and OpenAI by default, but there is another option: Ollama. By running Ollama on the same VPS as OpenClaw, you can serve open-source models like Llama 3, Mistral, Gemma, and Phi-3 locally — with zero API costs, complete data privacy, and no rate limits. The trade-off is that you need significantly more RAM and CPU than a standard OpenClaw deployment, and inference speed will be slower than cloud providers. This guide covers exactly what hardware you need, which VPS providers deliver the best value for Ollama workloads, and when local models make sense versus paying for API access.
Why Run Ollama with OpenClaw?
The standard way to power OpenClaw is through cloud AI APIs. You send prompts to Anthropic or OpenAI, they process them on GPU clusters, and you pay per token. This works well, but it introduces three friction points: ongoing costs, data leaving your server, and rate limits during peak usage. Ollama eliminates all three.
With Ollama running locally on your VPS, every prompt stays on your machine. Your conversations, files, and tool outputs never leave the server. There are no monthly API bills that scale with usage — you pay a flat VPS fee regardless of how many tokens you generate. And there are no rate limits: your model serves requests as fast as the CPU can process them, 24/7.
This matters most for three use cases:
- Privacy-sensitive deployments — legal, medical, or financial assistants where data must not reach third-party servers
- High-volume automation — OpenClaw agents that run hundreds of tasks per day, where API costs would add up quickly
- Offline or air-gapped environments — servers without reliable internet access, or environments that require complete network isolation
The catch: open-source models like Llama 3 8B are not as capable as Claude 3.5 Sonnet or GPT-4o. For complex reasoning, code generation, and nuanced conversation, cloud APIs still win. But for straightforward tasks — summarization, classification, simple Q&A, and template-based responses — a well-tuned 7B or 13B model running locally is more than adequate.
Model Requirements: RAM and Storage by Model Size
Ollama runs large language models entirely in system RAM (no GPU required on a VPS). The amount of RAM you need depends directly on the model's parameter count. Below is a breakdown of the most popular models supported by OpenClaw through Ollama:
| Model | Parameters | Download Size | Min RAM | Recommended RAM | Notes |
|---|---|---|---|---|---|
| Llama 3 8B | 8B | ~4 GB | 8 GB | 16 GB | Best all-around choice for VPS |
| Mistral 7B | 7B | ~4 GB | 8 GB | 16 GB | Strong reasoning, fast inference |
| Gemma 7B | 7B | ~4 GB | 8 GB | 16 GB | Google's open model, good at instructions |
| Phi-3 Mini | 3.8B | ~2 GB | 4 GB | 8 GB | Smallest option, limited capability |
| Llama 3 13B (Q4) | 13B | ~7 GB | 16 GB | 24 GB | Better quality, needs beefy VPS |
| Llama 3 70B | 70B | ~40 GB | 48+ GB | 64+ GB | Not practical on most VPS plans |
Key takeaway: For most OpenClaw + Ollama deployments, target a 7B parameter model and a VPS with at least 16 GB RAM. The model itself occupies 4–5 GB in memory, OpenClaw and the OS need another 2–3 GB, and the remaining RAM serves as buffer for context windows and concurrent requests. Running on 8 GB is possible but tight — you will experience swap usage and slower responses under load.
Storage is less of a concern. A single 7B model needs about 4 GB of disk space. Even with two or three models downloaded, 50 GB of SSD is more than sufficient. Prioritize RAM over storage when choosing a plan.
Best VPS Providers for Ollama Workloads
Not all VPS providers are created equal for AI inference. You need high RAM allocations at reasonable prices, and ideally multi-core CPUs for faster token generation. Here are the top picks:
Contabo — Best Budget Option
Contabo is the go-to for raw specs per dollar. Their Cloud VPS 10 at $4.95/mo gives you 4 vCPU and 8 GB RAM — enough to run a 7B model, though it will be tight with OpenClaw running alongside. The better choice for Ollama is their Cloud VPS 20 at $8.99/mo with 6 vCPU and 16 GB RAM. At this tier, you get comfortable headroom for Llama 3 8B or Mistral 7B, plus room for OpenClaw, a database, and OS overhead. No other provider matches this price-to-RAM ratio.
Hetzner — Best ARM Value
Hetzner's Ampere ARM servers offer excellent price-to-performance for inference workloads. ARM CPUs handle Ollama's CPU-based inference well, and Hetzner's ARM instances tend to be 20–30% cheaper than equivalent x86 plans. Their cloud ARM options with 16 GB RAM provide a solid foundation for 7B models. Hetzner also has outstanding network performance across their European data centers, which matters if your OpenClaw agent interacts with external APIs.
Oracle Cloud — Best Free Option
Oracle Cloud's Always Free tier is remarkable: 4 ARM (Ampere A1) CPU cores and 24 GB RAM, completely free. This is more than enough to run Llama 3 8B with OpenClaw comfortably — you get the 16 GB recommended RAM plus 8 GB of headroom. The downside is availability: Oracle often restricts new free-tier sign-ups by region, and the free VMs can take days to provision. If you can get one, it is the best deal in existence for an Ollama workload. The 200 GB of free block storage also means you can download multiple models without worrying about disk space.
| Provider | Plan | vCPU | RAM | Storage | Price | Ollama Ready? |
|---|---|---|---|---|---|---|
| Contabo | Cloud VPS 10 | 4 | 8 GB | 75 GB NVMe | $4.95/mo | Tight (7B only, swap needed) |
| Contabo | Cloud VPS 20 | 6 | 16 GB | 150 GB NVMe | $8.99/mo | Yes (7B comfortable) |
| Hetzner | ARM CAX21 | 4 | 8 GB | 80 GB SSD | ~$6/mo | Tight (7B only) |
| Hetzner | ARM CAX31 | 8 | 16 GB | 160 GB SSD | ~$12/mo | Yes (7B comfortable) |
| Oracle Cloud | Always Free A1 | 4 ARM | 24 GB | 200 GB | Free | Yes (7B–13B capable) |
ARM vs x86: Does It Matter for Ollama?
Short answer: no. Ollama has full support for ARM64 (aarch64) processors, including the Ampere A1 cores used by Oracle Cloud and Hetzner's ARM instances. The inference engine (llama.cpp under the hood) compiles natively for ARM and takes advantage of ARM's NEON SIMD instructions for matrix operations.
In practice, ARM and x86 CPUs of similar core count and clock speed produce comparable tokens-per-second for quantized models. ARM instances are often cheaper per GB of RAM, making them a better value proposition for Ollama workloads where RAM is the primary bottleneck. There is no need to specifically seek out x86 instances unless you have other software on the VPS that requires it.
One consideration: some quantization formats (like AVX2-optimized GGUF files) are tuned for x86. Ollama handles this transparently by selecting the appropriate format, but if you manually download model files, ensure you pick ARM-compatible quantizations. Using ollama pull always downloads the correct version for your architecture.
Performance: What to Expect
Let's set realistic expectations. Running Llama 3 8B on a 4 vCPU VPS with 16 GB RAM, you can expect approximately 10–15 tokens per second. For context, that is roughly 2–3 sentences per second — noticeably slower than cloud APIs (which typically return 50–80 tokens/sec), but perfectly acceptable for chat-style interactions and automated tasks.
Factors that affect inference speed:
- CPU core count — more cores = faster inference. Ollama parallelizes across available cores. Going from 2 to 4 vCPU roughly doubles throughput.
- RAM speed — memory bandwidth matters for loading model weights. DDR5-based servers will outperform DDR4 by 15–25%.
- Model quantization — Q4_K_M (4-bit quantized) models run faster and use less RAM than Q8 or FP16. The quality difference is minimal for most tasks.
- Context length — longer conversations require more computation per token. Keep context windows under 4096 tokens for best performance on a VPS.
- Concurrent requests — Ollama handles one request at a time by default. Multiple simultaneous OpenClaw agents will queue, not parallelize.
A practical benchmark: on Contabo's 6 vCPU / 16 GB plan running Mistral 7B (Q4_K_M), a typical OpenClaw task — reading a message, generating a 200-token response — completes in about 15–20 seconds. This is slower than the 2–3 seconds you would get from Claude or GPT-4o, but for background automation where latency is not user-facing, it works fine.
When to Use Ollama vs Cloud APIs
This is not an either-or decision. OpenClaw supports switching between providers, so you can use Ollama for some tasks and cloud APIs for others. Here is a practical decision framework:
Use Ollama (local models) when:
- Data privacy is a hard requirement and prompts must not leave your server
- You are running high-volume automation (100+ tasks/day) where API costs would exceed $20–50/mo
- Tasks are straightforward: summarization, classification, template filling, simple Q&A
- Your VPS has at least 16 GB RAM and 4 vCPU cores
- Response latency of 10–20 seconds is acceptable
Use cloud APIs (Claude, GPT-4o) when:
- Tasks require complex reasoning, multi-step planning, or creative writing
- You need fast response times (under 3 seconds) for user-facing interactions
- Your VPS has less than 8 GB RAM
- You run fewer than 50 tasks per day, keeping API costs under $5–10/mo
- You need the latest model capabilities (tool use, vision, long context)
The break-even point is roughly 100–150 tasks per day. Below that, cloud APIs are cheaper when you factor in the cost of a 16 GB RAM VPS ($9–12/mo) versus a minimal 2 GB VPS ($2–4/mo) plus moderate API usage. Above that volume, Ollama's flat-cost model wins decisively.
Frequently Asked Questions
Can I run Ollama on a VPS with only 8 GB RAM?
Yes, but expect compromises. A 7B model like Llama 3 8B or Mistral 7B will load on 8 GB RAM, but with OpenClaw and the OS consuming 2–3 GB, you will have minimal headroom. The system will likely use swap, which slows inference to 5–8 tokens per second. For reliable performance, 16 GB is the recommended minimum. If you are budget-constrained, consider Phi-3 Mini (3.8B parameters), which runs comfortably on 8 GB.
Does Ollama require a GPU on a VPS?
No. Ollama runs entirely on CPU by default, which is exactly what standard VPS plans provide. GPU acceleration (via CUDA) dramatically improves speed, but GPU VPS instances cost $50–200+/mo, making them impractical for most OpenClaw users. CPU-only inference at 10–15 tokens/sec is sufficient for automated agent tasks where sub-second response times are not required.
How much does it cost to run OpenClaw with Ollama?
The total cost is just your VPS bill. There are no API fees, no per-token charges, and no usage limits. A capable setup (Contabo 6 vCPU / 16 GB RAM) costs $8.99/mo. Oracle Cloud's free tier with 24 GB RAM costs nothing. Compare this to cloud API costs of $10–50+/mo for moderate usage — Ollama pays for itself within the first month if you run more than 100 tasks per day.
Can I switch between Ollama and cloud APIs in OpenClaw?
Yes. OpenClaw supports multiple model providers simultaneously. You can configure it to use Ollama for routine tasks and fall back to Claude or GPT-4o for complex requests. This hybrid approach gives you the cost savings of local inference for 80% of tasks while preserving access to frontier model quality when you need it.
Ready to find a VPS for OpenClaw + Ollama? Use our comparison tool to filter plans by RAM (16 GB+), CPU cores, and price — all the specs that matter for local AI inference.