Mac Mini M4 Hardware for Local AI Inference (Ollama/OpenClaw)
Research Date: April 15, 2026 Scope: Hardware specifications, benchmarks, and pricing for local LLM inference only (no training/fine-tuning) Context: Guillaume Descoteaux-Isabelle plans to run OpenClaw/Ollama locally alongside an OpenAI Codex subscription
Key Findings
- Mac Mini ships with M4 and M4 Pro only — no M4 Max. If you need M4 Max (128GB, 546 GB/s bandwidth), you must buy a Mac Studio (~$2,499–$2,999).
- Memory bandwidth is THE bottleneck for LLM inference, not GPU cores or Neural Engine. M4: 120 GB/s, M4 Pro: 273 GB/s, M4 Max: 546 GB/s. This directly determines tokens/second.
- Scenario A (7B–14B models) is well-served by a Mac Mini M4 with 24GB RAM at $699–$999. Llama 3.1 8B Q4 runs at 28–35 tok/s; plenty of headroom for dev tools.
- Scenario B (40GB+ models like 70B Q4) requires Mac Mini M4 Pro with 64GB RAM at $1,999–$2,199. Llama 3.1 70B Q4_K_M runs at 6–8 tok/s — usable but not fast. The model alone needs ~38GB RAM.
- Ollama 0.19+ uses MLX backend on Apple Silicon (March 2026), delivering up to 87% faster decode speeds vs the old llama.cpp Metal backend.
- Unified memory is Apple Silicon's killer advantage: CPU and GPU share the full memory pool. No VRAM limit like discrete GPUs. A 64GB M4 Pro can load a 70B Q4 model that would require a $10,000+ multi-GPU setup on NVIDIA.
- Thermal throttling is a real concern for sustained inference on Mac Mini. Under 100% GPU/CPU load, stock cooling can throttle within 8–10 minutes, dropping performance 30–45%. External cooling solutions help significantly.
- macOS + dev tools (VS Code, Docker, browser) consume 8–14GB RAM, leaving ~10GB on 24GB and ~50GB on 64GB for models.
- Storage: budget 50–100GB for a typical collection of Ollama models. Individual models range from 1.7GB (Phi-2) to 38GB (Llama 70B Q4_K_M).
- M4 Pro 64GB Mac Mini at $1,999 is the best price/performance for serious local AI work. It covers 95% of use cases short of 70B+ FP16 models.
Mac Mini M4 Lineup (April 2026)
Important: The Mac Mini is available with M4 and M4 Pro chips only. There is no Mac Mini with M4 Max. For M4 Max, you need a Mac Studio ($2,499+) or MacBook Pro.
Chip Specifications
| Spec | M4 (Base) | M4 Pro (12-core) | M4 Pro (14-core) |
|---|---|---|---|
| CPU Cores | 10 (4P + 6E) | 12 (8P + 4E) | 14 (10P + 4E) |
| GPU Cores | 10 | 16 | 20 |
| Neural Engine | 16-core, 38 TOPS | 16-core, 38 TOPS | 16-core, 38 TOPS |
| Memory Bandwidth | 120 GB/s | 273 GB/s | 273 GB/s |
| Max Unified Memory | 32 GB | 64 GB | 64 GB |
| Max Storage | 2 TB | 8 TB | 8 TB |
| Thunderbolt | TB4 (×3) | TB5 (×3) | TB5 (×3) |
| Ethernet | Gigabit (10GbE opt.) | 10 Gigabit | 10 Gigabit |
Complete Pricing Table (Apple MSRP, USD)
Mac Mini M4 (Base Chip)
| RAM | Storage | Price (MSRP) |
|---|---|---|
| 16 GB | 256 GB | $599 |
| 16 GB | 512 GB | $799 |
| 24 GB | 256 GB | $699* |
| 24 GB | 512 GB | $999 |
| 24 GB | 1 TB | ~$1,049* |
| 32 GB | 512 GB | ~$1,199* |
CTO (Configure to Order) pricing — approximate based on upgrade costs: +$200 for 24GB, +$400 for 32GB RAM; +$200 per SSD tier.
Mac Mini M4 Pro (12-core CPU / 16-core GPU — base Pro)
| RAM | Storage | Price (MSRP) |
|---|---|---|
| 24 GB | 512 GB | $1,399 |
| 24 GB | 1 TB | $1,599 |
| 48 GB | 512 GB | $1,799 |
| 48 GB | 1 TB | $1,999 |
| 64 GB | 512 GB | $1,999 |
| 64 GB | 1 TB | $2,199 |
| 64 GB | 2 TB | $2,599 |
Mac Mini M4 Pro (14-core CPU / 20-core GPU — upgraded Pro)
| RAM | Storage | Price (MSRP) |
|---|---|---|
| 24 GB | 1 TB | $1,799 |
| 48 GB | 1 TB | $2,199 |
| 64 GB | 1 TB | $2,399 |
The 14-core/20-core Pro is a CTO upgrade that adds ~$200 to the 12-core base.
Upgrade Cost Summary
| Upgrade | Cost |
|---|---|
| 24GB → 48GB RAM (Pro) | +$400 |
| 24GB → 64GB RAM (Pro) | +$600 |
| 16GB → 24GB RAM (base) | +$200 |
| 16GB → 32GB RAM (base) | +$400 |
| 512GB → 1TB SSD | +$200 |
| 512GB → 2TB SSD | +$600 |
| 1TB → 4TB SSD (Pro) | +$1,000 |
| 10GbE Ethernet (base) | +$100 |
RAM is soldered and cannot be upgraded after purchase. Choose wisely.
Apple Silicon Unified Memory: Why It Matters for LLMs
The Architecture Advantage
Traditional PC setups have separate CPU RAM and GPU VRAM. An NVIDIA RTX 4090 has only 24GB VRAM — a 70B Q4 model (38GB) simply won't fit without multi-GPU setups costing $10,000+.
Apple Silicon uses unified memory architecture (UMA):
- CPU and GPU share the same physical memory pool
- No data copying between CPU and GPU memory (zero-copy)
- The entire memory pool is accessible to the GPU for model weights
- A 64GB Mac Mini M4 Pro gives the GPU access to all 64GB — minus OS overhead
Memory Bandwidth Is King
LLM inference is memory-bandwidth-bound, not compute-bound. Every token generated requires reading the entire model's weights from memory. The formula:
Theoretical max tok/s ≈ Memory Bandwidth (GB/s) / Model Size in Memory (GB)
| Chip | Bandwidth | 7B Q4 (4GB) | 13B Q4 (8GB) | 34B Q4 (19GB) | 70B Q4 (38GB) |
|---|---|---|---|---|---|
| M4 | 120 GB/s | ~30 tok/s | ~15 tok/s | ~6 tok/s | N/A (RAM limit) |
| M4 Pro | 273 GB/s | ~68 tok/s | ~34 tok/s | ~14 tok/s | ~7 tok/s |
| M4 Max* | 546 GB/s | ~136 tok/s | ~68 tok/s | ~29 tok/s | ~14 tok/s |
M4 Max not available in Mac Mini — requires Mac Studio.
Real-world performance is typically 60–80% of theoretical due to attention computation, KV cache, and system overhead.
Memory Requirements by Model Size
RAM Needed for Inference (Q4_K_M Quantization)
| Model | Parameters | Quant | Model RAM | + KV Cache (4K ctx) | Total RAM Needed | Disk Space |
|---|---|---|---|---|---|---|
| Phi-2 | 2.7B | Q4 | ~1.7 GB | ~0.3 GB | ~2 GB | 1.7 GB |
| Phi-3 Mini | 3.8B | Q4 | ~2.3 GB | ~0.4 GB | ~3 GB | 2.3 GB |
| Llama 3.1 8B | 8B | Q4_K_M | ~4.5 GB | ~0.5 GB | ~5 GB | 4.6 GB |
| Mistral 7B | 7B | Q4_K_M | ~4.1 GB | ~0.5 GB | ~5 GB | 4.1 GB |
| CodeLlama 7B | 7B | Q4_K_M | ~4.0 GB | ~0.5 GB | ~5 GB | 4.0 GB |
| Llama 3.1 8B | 8B | Q8_0 | ~8.5 GB | ~0.5 GB | ~9 GB | 8.5 GB |
| CodeLlama 13B | 13B | Q4_K_M | ~7.9 GB | ~0.8 GB | ~9 GB | 7.9 GB |
| DeepSeek-Coder 6.7B | 6.7B | Q4_K_M | ~3.8 GB | ~0.5 GB | ~4 GB | 3.8 GB |
| Mixtral 8x7B (MoE) | 46.7B eff. | Q4_K_M | ~18 GB | ~1.5 GB | ~20 GB | 26 GB |
| CodeLlama 34B | 34B | Q4_K_M | ~19 GB | ~1.2 GB | ~20 GB | 19 GB |
| DeepSeek-Coder 33B | 33B | Q4_K_M | ~19 GB | ~1.2 GB | ~20 GB | 19 GB |
| Llama 3.1 70B | 70B | Q4_K_M | ~38 GB | ~2.5 GB | ~41 GB | 38 GB |
| Llama 3.1 70B | 70B | Q8_0 | ~72 GB | ~2.5 GB | ~75 GB | 72 GB |
Notes:
- KV cache grows with context length. At 32K context, a 70B model's KV cache can reach ~10–20GB
- TurboQuant (2026) compresses KV cache 5× with negligible quality loss, making 70B at 32K context viable on 64GB
- "Total RAM Needed" is the minimum — add OS + apps overhead (8–14GB) for real-world requirement
OS + Development Tools Memory Budget
| Component | Typical RAM | Heavy Usage |
|---|---|---|
| macOS system | 3–5 GB | 5–7 GB |
| VS Code / Cursor | 1–2 GB | 3–5 GB |
| Docker Desktop | 2–4 GB | 8–16 GB |
| Browser (10–20 tabs) | 1–4 GB | 5–8 GB |
| Terminal / misc | 0.5–1 GB | 1–2 GB |
| Total dev overhead | 8–14 GB | 16–30 GB |
Available RAM for Models by Configuration
| Mac Mini Config | Total RAM | OS + Dev Tools | Available for Models |
|---|---|---|---|
| M4, 16 GB | 16 GB | ~10 GB | ~6 GB (7B Q4 only) |
| M4, 24 GB | 24 GB | ~10 GB | ~14 GB (up to 13B Q4) |
| M4, 32 GB | 32 GB | ~10 GB | ~22 GB (up to 34B Q4 tight) |
| M4 Pro, 24 GB | 24 GB | ~10 GB | ~14 GB (up to 13B Q4) |
| M4 Pro, 48 GB | 48 GB | ~12 GB | ~36 GB (34B Q4 comfortable, 70B Q4 very tight) |
| M4 Pro, 64 GB | 64 GB | ~12 GB | ~52 GB (70B Q4_K_M with room to spare) |
Scenario A: Minimal Inference Setup
Goal
Run small/medium models (7B–14B parameters) via Ollama for coding assistance, chat, and general development, alongside standard dev tools.
Target Models
- Llama 3.1 8B / Llama 3.2 3B
- Mistral 7B / Mistral Nemo 12B
- Phi-3 Mini (3.8B) / Phi-3 Medium (14B)
- CodeLlama 7B / DeepSeek-Coder 6.7B
- Qwen 2.5 7B
Recommended Configuration
Mac Mini M4, 24GB RAM, 512GB SSD — $999 MSRP
| Spec | Detail |
|---|---|
| Chip | M4, 10-core CPU, 10-core GPU |
| RAM | 24 GB unified memory |
| Storage | 512 GB SSD (1 TB recommended for comfort: $1,049) |
| Bandwidth | 120 GB/s |
| Price | $999 (or $699 with 256GB SSD) |
What It Can Run
| Model | Size on Disk | RAM Used | Performance (tok/s) | Verdict |
|---|---|---|---|---|
| Llama 3.1 8B Q4_K_M | 4.6 GB | ~5 GB | 28–35 | ✅ Excellent |
| Mistral 7B Q4_K_M | 4.1 GB | ~5 GB | 28–35 | ✅ Excellent |
| CodeLlama 7B Q4_K_M | 4.0 GB | ~5 GB | 28–35 | ✅ Excellent |
| Phi-3 Mini Q4 | 2.3 GB | ~3 GB | 40–50 | ✅ Very fast |
| DeepSeek-Coder 6.7B Q4 | 3.8 GB | ~4 GB | 30–38 | ✅ Excellent |
| Qwen 2.5 7B Q4 | 4.0 GB | ~5 GB | 28–35 | ✅ Excellent |
| Llama 3.1 8B Q8_0 | 8.5 GB | ~9 GB | 18–22 | ✅ Good |
| CodeLlama 13B Q4_K_M | 7.9 GB | ~9 GB | 8–12 | ⚠️ Usable, slower |
| Mixtral 8x7B Q4_K_M | 26 GB | ~20 GB | N/A | ❌ Won't fit with dev tools |
With MLX Backend (Ollama 0.19+)
The MLX backend (default since March 2026) provides significant speedups on Apple Silicon:
- 7B Q4 models: 45–60 tok/s (vs 28–35 with old Metal backend)
- Smaller models (1B–3B): 200–460+ tok/s
Limitations
- Cannot run 30B+ models (insufficient RAM after OS/dev tools)
- 13B models work but leave little headroom for context or multitasking
- 512GB storage fills up if you download many models — consider 1TB
- Memory bandwidth (120 GB/s) is the main performance limiter vs M4 Pro
Budget Alternative
Mac Mini M4, 16GB RAM, 256GB SSD — $599 can run 7B Q4 models, but with only ~6GB free RAM after OS, it's tight. One model at a time, short context only. Not recommended for a developer who also runs Docker/browsers.
Scenario B: Evolved Large Model Setup
Goal
Run large models (~40GB Ollama models) including Llama 3.1 70B Q4, DeepSeek-Coder 33B, Mixtral 8x7B, alongside full development environment.
Target Models
- Llama 3.1 70B Q4_K_M (~38 GB on disk, ~41 GB RAM)
- DeepSeek-Coder 33B Q4_K_M (~19 GB on disk, ~20 GB RAM)
- Mixtral 8x7B Q4_K_M (~26 GB on disk, ~20 GB RAM)
- CodeLlama 34B Q4_K_M (~19 GB on disk, ~20 GB RAM)
Recommended Configuration
Mac Mini M4 Pro (12-core), 64GB RAM, 1TB SSD — $2,199 MSRP
| Spec | Detail |
|---|---|
| Chip | M4 Pro, 12-core CPU, 16-core GPU |
| RAM | 64 GB unified memory |
| Storage | 1 TB SSD (2 TB for model library: $2,599) |
| Bandwidth | 273 GB/s |
| Price | $2,199 |
Why Not 48GB?
A 48GB configuration ($1,799–$1,999) can technically load Llama 70B Q4_K_M (38GB model + 2.5GB KV cache = ~41GB), but after macOS + dev tools (~12GB), you'd have only ~36GB free — the model won't fit. 64GB is required for 70B models when running alongside development tools.
48GB IS viable for: DeepSeek-Coder 33B, Mixtral 8x7B, CodeLlama 34B (all ~20GB loaded). These fit comfortably with dev tools on 48GB.
Is M4 Pro Enough, or Do You Need M4 Max?
M4 Pro (64GB) is sufficient for Scenario B — but with caveats:
| Factor | M4 Pro 64GB | M4 Max 128GB (Mac Studio) |
|---|---|---|
| Can load 70B Q4? | ✅ Yes (~41GB model + 12GB OS = 53GB) | ✅ Yes, with massive headroom |
| 70B tok/s | 6–8 tok/s | 14–20 tok/s |
| 70B at 32K context? | ⚠️ Tight — needs TurboQuant | ✅ Comfortable |
| Run 2 large models? | ❌ No room | ✅ Yes |
| Price | $2,199 | ~$2,999+ (Mac Studio) |
| Form factor | Mac Mini 5×5" | Mac Studio (larger) |
Verdict: M4 Pro 64GB is enough if you run one large model at a time with moderate context (<8K tokens default, up to 32K with TurboQuant). For heavy multi-model or long-context work, the Mac Studio M4 Max is the next step.
What It Can Run
| Model | Size on Disk | RAM Used | Performance (tok/s) | Verdict |
|---|---|---|---|---|
| Llama 3.1 70B Q4_K_M | 38 GB | ~41 GB | 6–8 | ✅ Usable for single-user chat |
| DeepSeek-Coder 33B Q4 | 19 GB | ~20 GB | 10–14 | ✅ Good for coding |
| Mixtral 8x7B Q4_K_M | 26 GB | ~20 GB | 12–18 | ✅ Good, MoE efficiency |
| CodeLlama 34B Q4_K_M | 19 GB | ~20 GB | 10–14 | ✅ Good for coding |
| Llama 3.1 8B Q4_K_M | 4.6 GB | ~5 GB | 35–45 | ✅ Blazing fast |
| Multiple 7B models concurrent | ~10 GB | ~12 GB | 25–35 each | ✅ Multiple models at once |
| Llama 3.1 70B Q8_0 | 72 GB | ~75 GB | N/A | ❌ Exceeds 64GB |
Storage Considerations
With large models, storage matters:
- Minimum 1TB SSD — a typical collection of 5–8 models (mix of 7B and 70B) needs 80–150GB
- 2TB recommended if you plan to keep many model variants downloaded
- Ollama stores models in
~/.ollama/models/ - Models can be deleted and re-downloaded as needed
Upgrade Path
If Scenario B proves insufficient, the next step is a Mac Studio M4 Max with 128GB ($2,499–$2,999):
- 128GB unified memory at 546 GB/s bandwidth
- Llama 70B Q4 at 14–20 tok/s (2–3× faster than M4 Pro)
- Can run 70B at FP16 (~4.5 tok/s) or multiple large models concurrently
- Better sustained thermal performance in the larger Studio enclosure
Benchmark Data
Tokens/Second by Chip and Model (Q4_K_M, Ollama with MLX backend)
| Model | M4 (24GB) | M4 Pro (48–64GB) | M4 Max (128GB)* |
|---|---|---|---|
| Phi-3 Mini 3.8B | 50–65 | 70–90 | 120–150 |
| Llama 3.1 8B | 28–35 | 35–45 | 55–60 |
| Qwen 2.5 7B | 28–35 | 35–45 | 55–60 |
| Mistral 7B | 28–35 | 35–45 | 55–60 |
| CodeLlama 13B | 8–12 | 14–18 | 22–28 |
| DeepSeek-Coder 33B | N/A | 10–14 | 18–22 |
| Mixtral 8x7B | N/A | 12–18 | 25–28 |
| Llama 3.1 70B Q4 | N/A | 6–8 | 18–20 |
| Llama 3.1 70B FP16 | N/A | N/A | ~4.5 |
M4 Max only available in Mac Studio or MacBook Pro, not Mac Mini.
MLX Backend Speed Improvements (Ollama 0.19+, March 2026)
The MLX backend (now default in Ollama) provides dramatic speed improvements over the previous llama.cpp Metal backend:
| Model | llama.cpp (old) | MLX (new) | Improvement |
|---|---|---|---|
| Qwen3 0.6B Q4 | 281 tok/s | 526 tok/s | +87% |
| Llama 3.2 1B Q4 | 331 tok/s | 462 tok/s | +39% |
| Qwen3 8B Q4 | 77 tok/s | 93 tok/s | +21% |
| Llama 3.1 8B Q4 | ~35 tok/s | ~45 tok/s | ~29% |
Benchmarks from M4 Max 128GB; proportional improvements apply to M4 and M4 Pro.
Prompt Processing (Prefill) Speed
Prompt processing is much faster than token generation:
- M4 24GB, Llama 3.2 8B: 326 tok/s prompt ingestion
- This means a 1000-token prompt is processed in ~3 seconds
Comparison with NVIDIA GPUs
| Hardware | Llama 70B Q4 tok/s | Cost | Power |
|---|---|---|---|
| Mac Mini M4 Pro 64GB | 6–8 | $2,199 | ~50W |
| RTX 4070 Super (12GB) | ~12* | ~$600 GPU + PC | ~300W |
| RTX 4090 (24GB) | Can't load 70B | ~$1,600 GPU | ~450W |
| 2× RTX 4090 (48GB) | ~20 | ~$5,000+ system | ~900W |
| Mac Studio M4 Max 128GB | 18–20 | ~$2,999 | ~75W |
RTX 4070 requires offloading layers to CPU RAM, severely degrading performance for 70B.
Thermal and Sustained Performance
Mac Mini M4 Pro Stock Cooling
The Mac Mini's compact 5×5-inch form factor presents thermal challenges for sustained AI inference:
| Workload Type | Temperature | Time to Throttle | Performance Impact |
|---|---|---|---|
| Light use (browsing, coding) | 60–75°C | No throttling | Full performance |
| CPU-heavy batch work | 68–74°C | No throttling | <3% drop (stable) |
| LLM inference (sustained 100%) | 95–118°C | 8–10 minutes | 30–45% drop |
| LLM with external fan | 85–100°C | ~25 minutes | 10–20% less throttling |
Practical Implications
For interactive chat/coding assistant use (Scenario A & B typical):
- Queries are bursty — a few seconds of inference, then idle
- Thermal throttling is not an issue for interactive use
- The Mac Mini will stay cool and quiet for normal Ollama usage
For sustained batch inference (e.g., processing documents, multi-agent hosting):
- Stock cooling will throttle within 10 minutes
- Performance plateaus at 55–70% of peak after throttling
- External cooling solutions (small rear blower fan, ~$20–30) extend full-performance window to ~25 minutes
Mitigation Strategies
- External fan ($20–30): Placed behind the Mac Mini, reduces temps by ~10°C
- Software fan control (free): Max internal fan reduces temps but increases noise
- Ambient temperature: Keeping room at 20–22°C measurably helps
- Thermal pad mod (voids warranty): +15% heat dissipation, not recommended
- Duty cycle management: If doing batch work, schedule pauses every 20–30 minutes
Verdict
For Guillaume's use case (interactive Ollama queries alongside development), thermal throttling will not be a practical concern. It only matters for continuous 24/7 batch inference workloads.
Evidence Quality
Well-Sourced (High Confidence)
| Finding | Source Quality |
|---|---|
| Mac Mini configurations and pricing | Apple official specs page, multiple retailers (B&H, Amazon) |
| M4/M4 Pro/M4 Max chip specifications | Apple official, verified by Notebookcheck, AnandTech |
| No M4 Max in Mac Mini | Apple official, confirmed by MacRumors, 9to5Mac |
| Memory bandwidth figures | Apple official silicon specs |
| RAM requirements per model size | Ollama documentation, llama.cpp community calculations, verified empirically |
| Ollama MLX backend (0.19+) | Ollama official blog, March 2026 |
Community-Validated (Medium-High Confidence)
| Finding | Source Quality |
|---|---|
| Tokens/second benchmarks | Multiple independent user reports (Reddit r/LocalLLaMA, r/ollama), consistent across sources |
| 70B Q4_K_M at 6–8 tok/s on M4 Pro 64GB | Multiple benchmark sites + community reports; cross-validated with bandwidth formula |
| MLX speed improvements (87% for small models) | DEV Community benchmarks, Ollama blog, multiple user confirmations |
| OS + dev tools RAM consumption (8–14GB) | Hacker News survey, community reports, consistent with macOS Activity Monitor data |
Less Certain (Medium Confidence)
| Finding | Source Quality |
|---|---|
| Thermal throttling at 8–10 minutes | Based on a few detailed test reports; real-world varies by ambient temp and workload |
| 118°C peak GPU temp under max load | Extreme case from one test; typical sustained is 95–105°C |
| TurboQuant KV cache compression for 70B@32K on 64GB | New technique (2026); early benchmarks promising but not widely validated yet |
| M4 Pro 14-core/20-core pricing | CTO pricing varies; some retailer listings inconsistent |
Speculative (Low Confidence)
| Finding | Source Quality |
|---|---|
| Future Ollama/MLX optimizations beyond 0.19 | Based on trajectory; no confirmed roadmap |
| M5 chip timeline for Mac Mini refresh | Rumors only as of April 2026 |
Sources
Apple Official
- Apple Mac Mini Specs Page — https://www.apple.com/mac-mini/specs/
- Apple Mac Mini Buy Page — https://www.apple.com/shop/buy-mac/mac-mini
- Apple Newsroom: Mac Mini M4 Announcement — https://www.apple.com/newsroom/2024/10/apples-new-mac-mini-is-more-mighty-more-mini-and-built-for-apple-intelligence/
Benchmarks & Technical Analysis
- Sean Kim, "M4 Max AI Inference Benchmarks" — https://blog.imseankim.com/apple-m4-max-macbook-pro-ai-inference-benchmarks/
- OwnYourAI, "Apple Silicon for Local AI: M4, M4 Pro, M4 Max Compared" — https://ownyourai.dev/hardware/apple-silicon-for-ai/
- DEV Community, "Apple Silicon LLM Inference Optimization Guide" — https://dev.to/starmorph/apple-silicon-llm-inference-optimization-the-complete-guide-to-maximum-performance-3388
- LocalAI Computer, "Mac Mini M4 Pro for Local AI" — https://localai.computer/products/systems/mac-mini-m4-pro
- TurboQuant Benchmark on Apple Silicon — https://asiai.dev/turboquant/
- llama.cpp Performance Discussion on Apple Silicon — https://github.com/ggml-org/llama.cpp/discussions/4167
Ollama
- Ollama Blog, "Ollama is now powered by MLX" — https://ollama.com/blog/mlx
- DEV Community, "Ollama Just Got 93% Faster on Mac" — https://dev.to/alanwest/ollama-just-got-93-faster-on-mac-heres-how-to-enable-it-3gce
Community & Reviews
- Reddit r/ollama, Mac Mini M4 as Ollama server — https://www.reddit.com/r/ollama/comments/1idv02o/has_anyone_been_using_the_base_m4_mac_mini_as_an/
- MacRumors, Mac Mini Roundup — https://www.macrumors.com/roundup/mac-mini/
- PCMag, Apple Mac Mini M4 Pro Review — https://www.pcmag.com/reviews/apple-mac-mini-2024-m4-pro
- yW!an, "Local LLM Performance: The 2025 Benchmark" — https://www.ywian.com/blog/local-llm-performance-2025-benchmark
- RunAI Guide, "Mac Mini M4 vs M2 Ollama Speed Test" — https://www.runaiguide.com/mac-mini-m4-vs-m2-ollama-speed-test-with-qwen-35-models
Thermal
- VPSMac, "Mac mini Thermal Performance 72-Hour Stress Test" — https://vpsmac.com/en/blog/mac-mini-thermal-performance-stress-test.html
- Apple Community Forums, M4 Pro thermals — https://discussions.apple.com/thread/255854367
Pricing
- MacPrices.net — https://www.macprices.net/mac-mini/
- AppleInsider Mac Mini Deals — https://appleinsider.com/deals/best-mac-mini-deals
- SimplyMac, Mac Mini Upgrade Options — https://www.simplymac.com/mac/mac-mini-upgrade
- B&H Photo, Mac Mini M4 Pro configurations — https://www.bhphotovideo.com/
Quick Decision Matrix
| Your Need | Recommendation | Price |
|---|---|---|
| Run 7B models for coding assist, tight budget | Mac Mini M4, 24GB, 256GB | $699 |
| Run 7B–13B models comfortably with dev tools | Mac Mini M4, 24GB, 512GB | $999 |
| Run 7B–13B with future headroom | Mac Mini M4, 32GB, 512GB | ~$1,199 |
| Run 33B–34B models + MoE models | Mac Mini M4 Pro, 48GB, 1TB | $1,999 |
| Run 70B models + full dev environment | Mac Mini M4 Pro, 64GB, 1TB | $2,199 |
| Run 70B models fast + multiple large models | Mac Studio M4 Max, 128GB | ~$2,999 |
For Guillaume Specifically
Start with Scenario A ($999) if:
- Your primary AI work uses OpenAI Codex subscription (cloud)
- Local models are supplementary (private queries, offline work, experimentation)
- 7B–8B models cover your local needs
Go directly to Scenario B ($2,199) if:
- You want the flexibility to run ANY model up to 70B locally
- You plan to use local models as a primary tool, not just supplementary
- You want to run DeepSeek-Coder 33B or similar large coding models
- Future-proofing matters — 64GB won't feel tight for years
The $1,200 gap between Scenario A and B buys you 10× the model size capability. Given that RAM cannot be upgraded later, erring on the side of more RAM is strongly advisable.