Mac Mini for Local AI Inference — Hardware Scenarios

Author: IAIP Research Team
Date: April 15, 2026
Audience: Guillaume Descoteaux-Isabelle
Purpose: Purchase-decision guide for Mac Mini configurations running local AI models via Ollama/OpenClaw
Revision: v2.0 — incorporates pricing verification and reviewer corrections

Executive Summary

For small models (7B–14B), buy a Mac Mini M4 with 24 GB RAM and 512 GB SSD at $999 USD (~$1,379 CAD). This runs Llama 3.1 8B, Mistral 7B, CodeLlama 7B, and DeepSeek-Coder 6.7B at 28–35 tok/s (Metal backend) or potentially faster with the MLX preview backend. It is a capable, quiet, low-power local inference node for a developer who already has cloud subscriptions.
For large models (~~40 GB, including 70B-class), buy a Mac Mini M4 Pro (12-core) with 64 GB RAM and 1 TB SSD at $2,199 USD (~~$3,035 CAD). This is the only Mac Mini configuration that can load Llama 3.1 70B Q4 alongside a full development environment. Expect 6–8 tok/s for 70B — usable for single-user interactive chat but not fast. Models up to 33B–34B run comfortably at 10–14 tok/s.
RAM is soldered and cannot be upgraded. Buy more than you think you need. The $1,200 gap between Scenario A and B buys 10× the model-size capability.
Mac Studio M4 Max (128 GB) at $3,699 USD is the next step if Mac Mini M4 Pro proves insufficient — it doubles memory bandwidth (546 GB/s vs 273 GB/s) and delivers 14–20 tok/s on 70B Q4.
Use a hybrid strategy: Route latency-tolerant and privacy-sensitive queries to local Ollama; route complex reasoning and long-context tasks to cloud (GitHub Copilot with GPT-4.1/GPT-5.x, OpenAI Codex). This maximizes both cost-efficiency and capability.

Apple Silicon Unified Memory: Why It Matters for LLMs

The Architecture Advantage

Traditional PC setups separate CPU RAM from GPU VRAM. An NVIDIA RTX 4090 has 24 GB VRAM — a 70B Q4 model requiring ~38 GB of VRAM simply will not fit without a multi-GPU configuration costing $5,000–$10,000+.

Apple Silicon uses a unified memory architecture (UMA):

CPU and GPU share the same physical memory pool — no PCIe bottleneck for data transfer.
Zero-copy access: the GPU reads model weights directly from the same memory the CPU uses.
A Mac Mini M4 Pro with 64 GB gives the GPU access to the entire pool, minus OS overhead (~12 GB).
Result: a $2,199 Mac Mini can load a 70B Q4 model that would require a $5,000+ discrete-GPU PC rig.

Memory Bandwidth Is the Bottleneck

LLM token generation is memory-bandwidth-bound, not compute-bound. Each output token requires reading the model's entire weight tensor from memory. The governing equation:

Theoretical max tok/s ≈ Memory Bandwidth (GB/s) ÷ Model Size in Memory (GB)

Chip	Bandwidth	7B Q4 (~4 GB)	13B Q4 (~8 GB)	34B Q4 (~19 GB)	70B Q4 (~38 GB)
M4	120 GB/s	~30 tok/s	~15 tok/s	~6 tok/s	N/A (RAM limit)
M4 Pro	273 GB/s	~68 tok/s	~34 tok/s	~14 tok/s	~7 tok/s
M4 Max	546 GB/s	~136 tok/s	~68 tok/s	~29 tok/s	~14 tok/s

Real-world performance is typically 60–80% of theoretical due to attention computation, KV-cache management, and OS overhead. The M4 Max is only available in the Mac Studio ($3,699+) or MacBook Pro — not the Mac Mini.

Sources: Apple Silicon specs (apple.com); bandwidth-throughput formula per llama.cpp community (GitHub ggml-org/llama.cpp#4167).

Mac Mini M4 Lineup (April 2026)

Chip Specifications

Spec	M4 (Base)	M4 Pro (12-core CPU)	M4 Pro (14-core CPU)
CPU Cores	10 (4P + 6E)	12 (8P + 4E)	14 (10P + 4E)
GPU Cores	10	16	20
Neural Engine	16-core, 38 TOPS	16-core, 38 TOPS	16-core, 38 TOPS
Memory Bandwidth	120 GB/s	273 GB/s	273 GB/s
Max Unified Memory	32 GB	64 GB	64 GB
Max Storage	2 TB	8 TB	8 TB
Thunderbolt	TB4 (×3)	TB5 (×3)	TB5 (×3)
Ethernet	Gigabit (10 GbE opt.)	10 Gigabit	10 Gigabit

Source: Apple Mac Mini specs page (apple.com/mac-mini/specs).

Complete Pricing Table (Apple MSRP, USD — verified April 2026)

Mac Mini M4 (Base Chip)

RAM	Storage	Price (USD)	Price (CAD, est.)
16 GB	256 GB	$599	~$827
16 GB	512 GB	$799	~$1,103
24 GB	512 GB	$999	~$1,379
32 GB	512 GB	~$1,199 (CTO)	~$1,655

CTO upgrade costs: +$200 for 16→24 GB RAM, +$400 for 16→32 GB RAM; +$200 per SSD tier.

Mac Mini M4 Pro (12-core CPU / 16-core GPU)

RAM	Storage	Price (USD)	Price (CAD, est.)
24 GB	512 GB	$1,399	~$1,931
24 GB	1 TB	$1,599	~$2,207
48 GB	512 GB	$1,799	~$2,483
48 GB	1 TB	$1,999	~$2,759
64 GB	1 TB	$2,199	~$3,035
64 GB	2 TB	$2,599	~$3,587

Mac Mini M4 Pro (14-core CPU / 20-core GPU — CTO upgrade)

RAM	Storage	Price (USD)	Price (CAD, est.)
24 GB	1 TB	$1,799	~$2,483
48 GB	1 TB	$2,199	~$3,035
64 GB	1 TB	$2,399–$2,499	~$3,311–$3,449

⚠️ Pricing note: The 14-core/20-core M4 Pro upgrade adds $200–$300 over the 12-core base at equivalent RAM/storage. Prices verified against Apple Store, B&H Photo, and Klarna aggregator listings (April 2026). The 12-core variant at $2,199 offers the best value for inference workloads since memory bandwidth (273 GB/s) is identical between the 12-core and 14-core M4 Pro — additional GPU cores provide marginal benefit for LLM inference.

RAM Upgrade Cost Summary

Upgrade	Cost
16→24 GB RAM (base M4)	+$200
16→32 GB RAM (base M4)	+$400
24→48 GB RAM (M4 Pro)	+$400
24→64 GB RAM (M4 Pro)	+$600
512 GB→1 TB SSD	+$200
1→2 TB SSD	+$400
10 GbE Ethernet (base M4)	+$100

🔒 RAM is soldered and cannot be upgraded after purchase. This is the single most important decision. Choose more than you think you need today.

Sources: Apple Store (apple.com/shop), B&H Photo (bhphotovideo.com), MacPrices.net, AppleInsider. CAD conversion at 1 USD ≈ 1.38 CAD (April 2026 rate per valutafx.com).

Memory Requirements by Model

RAM Needed for Inference (Q4_K_M Quantization Unless Noted)

Model	Parameters	Quantization	Model RAM	+ KV Cache (4K ctx)	Total RAM Needed	Disk Space
Phi-3 Mini	3.8B	Q4	~2.3 GB	~0.4 GB	~3 GB	2.3 GB
Phi-4	14B	Q4_K_M	~8 GB	~0.8 GB	~9 GB	8 GB
DeepSeek-Coder 6.7B	6.7B	Q4_K_M	~3.8 GB	~0.5 GB	~4 GB	3.8 GB
Mistral 7B	7B	Q4_K_M	~4.1 GB	~0.5 GB	~5 GB	4.1 GB
CodeLlama 7B	7B	Q4_K_M	~4.0 GB	~0.5 GB	~5 GB	4.0 GB
Llama 3.1 8B	8B	Q4_K_M	~4.5 GB	~0.5 GB	~5 GB	4.6 GB
Llama 3.1 8B	8B	Q8_0	~8.5 GB	~0.5 GB	~9 GB	8.5 GB
CodeLlama 13B	13B	Q4_K_M	~7.9 GB	~0.8 GB	~9 GB	7.9 GB
DeepSeek-R1 32B	32B	Q4_K_M	~18 GB	~1.2 GB	~19 GB	18 GB
DeepSeek-Coder 33B	33B	Q4_K_M	~19 GB	~1.2 GB	~20 GB	19 GB
Mixtral 8×7B (MoE)	46.7B eff.	Q4_K_M	~18 GB	~1.5 GB	~20 GB	26 GB
Qwen2.5 72B	72B	Q4_K_M	~39 GB	~2.5 GB	~42 GB	40 GB
Llama 3.1 70B	70B	Q4_K_M	~38 GB	~2.5 GB	~41 GB	38 GB
Llama 3.1 70B	70B	Q8_0	~72 GB	~2.5 GB	~75 GB	72 GB

Notes:

KV cache grows with context length. At 32K context, a 70B model's KV cache can reach ~10–20 GB, making 64 GB very tight.
"Total RAM Needed" is the model-only minimum — add 8–14 GB for macOS + dev tools (see below) for real-world planning.
Qwen2.5 72B Q4 fits on 64 GB but leaves only ~10 GB for OS + tools — tight. A 48 GB config cannot load it.

OS + Development Tools Memory Budget

Component	Typical RAM	Heavy Usage
macOS system	3–5 GB	5–7 GB
VS Code / Cursor	1–2 GB	3–5 GB
Docker Desktop	2–4 GB	8–16 GB
Browser (10–20 tabs)	1–4 GB	5–8 GB
OpenClaw agent process	0.5–1 GB	1–2 GB
Terminal / misc	0.5–1 GB	1–2 GB
Total dev overhead	8–14 GB	16–30 GB

Effective RAM Available for Models

Mac Mini Config	Total RAM	OS + Dev Tools	Available for Models
M4, 16 GB	16 GB	~10 GB	~6 GB — 7B Q4 only, barely
M4, 24 GB	24 GB	~10 GB	~14 GB — up to 13B Q4
M4, 32 GB	32 GB	~10 GB	~22 GB — up to 33B Q4 tight
M4 Pro, 24 GB	24 GB	~10 GB	~14 GB — up to 13B Q4
M4 Pro, 48 GB	48 GB	~12 GB	~36 GB — 33B comfortable; 70B won't fit
M4 Pro, 64 GB	64 GB	~12 GB	~52 GB — 70B Q4 with room to spare

Sources: Ollama documentation, llama.cpp community calculations, macOS Activity Monitor user reports (Reddit r/LocalLLaMA, Hacker News).

Scenario A: Minimal Inference Setup

Recommended Configuration

Mac Mini M4, 24 GB RAM, 512 GB SSD — $999 USD (~$1,379 CAD)

Spec	Detail
Chip	M4, 10-core CPU, 10-core GPU
RAM	24 GB unified memory
Storage	512 GB SSD (upgrade to 1 TB at ~$1,199 recommended for model library)
Bandwidth	120 GB/s
Price	$999 USD

What It Can Run

All benchmarks below use the default Metal backend (llama.cpp). See MLX note below for potential speedups.

Model	Disk Space	RAM Used	tok/s (Metal)	Verdict
Llama 3.1 8B Q4_K_M	4.6 GB	~5 GB	28–35	✅ Excellent
Mistral 7B Q4_K_M	4.1 GB	~5 GB	28–35	✅ Excellent
CodeLlama 7B Q4_K_M	4.0 GB	~5 GB	28–35	✅ Excellent
Phi-3 Mini Q4	2.3 GB	~3 GB	40–50	✅ Very fast
DeepSeek-Coder 6.7B Q4	3.8 GB	~4 GB	30–38	✅ Excellent
Qwen 2.5 7B Q4	4.0 GB	~5 GB	28–35	✅ Excellent
Llama 3.1 8B Q8_0 (higher quality)	8.5 GB	~9 GB	18–22	✅ Good
CodeLlama 13B Q4_K_M	7.9 GB	~9 GB	8–12	⚠️ Usable, slower

MLX Backend (Preview — Ollama 0.19+): As of April 2026, the MLX backend is available in preview and must be enabled manually with OLLAMA_MLX=1. On models ≤8B it delivers approximately 21–29% faster token generation vs the Metal backend. However, MLX is reported to require ≥32 GB RAM for optimal operation — the 24 GB Scenario A configuration may not fully benefit. Performance gains on 24 GB systems are unconfirmed. (Sources: Ollama blog March 2026; dev.to/alanwest; byteiota.com)

What It Cannot Run

Model	Why Not
Mixtral 8×7B Q4_K_M (~20 GB loaded)	Exceeds available RAM after OS/dev tools (~14 GB free)
DeepSeek-Coder 33B Q4 (~20 GB)	Same — needs ≥48 GB total
Llama 3.1 70B Q4 (~41 GB)	Far exceeds available RAM
Qwen2.5 72B Q4 (~42 GB)	Far exceeds available RAM
Any model >13B parameters	Insufficient headroom for model + KV cache + tools

Use Case Fit

This configuration is ideal when:

Cloud inference (Copilot, Codex) is your primary AI tool and local models supplement it for offline or private queries.
You primarily run 7B–8B coding assistants (CodeLlama 7B, DeepSeek-Coder 6.7B) for autocomplete and code review.
You want a quiet, low-power (~22 W idle, ~45 W inference) always-on local inference node.
Budget is a constraint — this is the minimum viable configuration for a developer running Docker + browser + Ollama.

Budget alternative: Mac Mini M4, 16 GB, 256 GB at $599 can technically run 7B Q4 models, but with only ~6 GB free RAM after OS, it is too tight for a developer who also runs Docker and browsers. Not recommended.

Scenario B: Evolved Large Model Setup

Recommended Configuration

Mac Mini M4 Pro (12-core CPU / 16-core GPU), 64 GB RAM, 1 TB SSD — $2,199 USD (~$3,035 CAD)

Spec	Detail
Chip	M4 Pro, 12-core CPU, 16-core GPU
RAM	64 GB unified memory
Storage	1 TB SSD (2 TB for large model library: $2,599 USD)
Bandwidth	273 GB/s
Price	$2,199 USD

Why 12-core, not 14-core? The 14-core/20-core M4 Pro with 64 GB/1 TB costs $2,399–$2,499 — a $200–$300 premium for 4 additional GPU cores. Memory bandwidth is identical (273 GB/s), and LLM inference is bandwidth-bound, not GPU-core-bound. The extra GPU cores help for graphics and Metal compute workloads but provide negligible benefit for token generation. Save the $200–$300.

Why Not 48 GB?

A 48 GB configuration ($1,999) cannot load Llama 3.1 70B Q4:

Model + KV cache: ~41 GB
macOS + dev tools: ~12 GB
Total needed: ~53 GB > 48 GB

48 GB is viable for 33B-class models (DeepSeek-Coder 33B, CodeLlama 34B, Mixtral 8×7B — all ~20 GB loaded). If you know you will never need 70B models, 48 GB at $1,999 saves $200. But given that RAM cannot be upgraded, 64 GB is the safer choice for future-proofing.

What It Can Run

All benchmarks use the Metal backend unless noted. MLX preview may improve speeds by ~25% for 8B models when enabled (OLLAMA_MLX=1); MLX performance on 64 GB systems with large models is not yet widely validated.

Model	Disk Space	RAM Used	tok/s	Verdict
Llama 3.1 70B Q4_K_M	38 GB	~41 GB	6–8	✅ Usable for single-user chat
Qwen2.5 72B Q4_K_M	40 GB	~42 GB	5–7	✅ Usable but tight on RAM
DeepSeek-R1 32B Q4_K_M	18 GB	~19 GB	11–14	✅ Good for reasoning tasks
DeepSeek-Coder 33B Q4_K_M	19 GB	~20 GB	10–14	✅ Good for coding
Mixtral 8×7B Q4_K_M	26 GB	~20 GB	12–18	✅ Good, MoE efficiency
CodeLlama 34B Q4_K_M	19 GB	~20 GB	10–14	✅ Good for coding
Llama 3.1 8B Q4_K_M	4.6 GB	~5 GB	35–45	✅ Blazing fast
Multiple 7B models concurrent	~10 GB	~12 GB	25–35 each	✅ Multi-model supported
Llama 3.1 70B Q8_0	72 GB	~75 GB	—	❌ Exceeds 64 GB

Sources: like2byte.com (DeepSeek-R1 benchmarks); compute-market.com; Reddit r/LocalLLaMA user reports; Ollama community benchmarks.

Performance Expectations

Model Class	tok/s on M4 Pro 64 GB	Feel
7B–8B Q4	35–45	Instant — faster than you can read
13B Q4	14–18	Very responsive
32B–34B Q4	10–14	Comfortable interactive use
70B Q4	6–8	Noticeably slower; usable for chat, not for rapid iteration

Prompt processing (prefill) is much faster than token generation: ~326 tok/s for 8B models on M4. A 1,000-token prompt is ingested in ~3 seconds.

Concurrency: Ollama supports OLLAMA_NUM_PARALLEL for simultaneous requests. On 64 GB with a 70B model loaded, there is no room for a second large model. For multi-agent OpenClaw setups that route to Ollama, plan to use one large model at a time or multiple 7B models concurrently.

When Mac Studio Is Better

If M4 Pro 64 GB proves insufficient, the upgrade path is:

Mac Studio M4 Max, 128 GB RAM, 1 TB SSD — $3,699 USD (~$5,105 CAD)

Factor	Mac Mini M4 Pro 64 GB	Mac Studio M4 Max 128 GB
Bandwidth	273 GB/s	546 GB/s (2×)
70B Q4 tok/s	6–8	14–20 (2–3× faster)
Can load 70B Q4?	✅ Yes (tight)	✅ Yes, with massive headroom
70B at 32K context?	⚠️ Very tight — KV cache 10–20 GB	✅ Comfortable
Run 2 large models?	❌ No room	✅ Yes
70B Q8 (higher quality)?	❌ Exceeds 64 GB	✅ ~75 GB fits in 128 GB
Price	$2,199	$3,699
Power draw	50–80 W	~75–110 W
Form factor	5×5 in.	Larger desktop unit

Verdict: The Mac Studio makes sense if you routinely need 70B at long context (32K+), want multiple large models loaded, or need faster 70B generation for productive interactive use. For occasional 70B queries alongside 7B–33B daily drivers, the Mac Mini M4 Pro is sufficient and saves $1,500.

Sources: Apple Mac Studio pricing (microcenter.com, bhphotovideo.com — verified $3,699 for 128 GB/1 TB); Macworld Mac Studio M4 Max review.

Comparison Table

	Scenario A (Minimal)	Scenario B (Evolved)
Machine	Mac Mini M4	Mac Mini M4 Pro (12-core)
RAM	24 GB	64 GB
Storage	512 GB (1 TB recommended)	1 TB (2 TB recommended)
Bandwidth	120 GB/s	273 GB/s (2.3×)
Price (USD)	$999	$2,199
Price (CAD, est.)	~$1,379	~$3,035
Max model size	13B Q4 (with tools running)	70B Q4 (with tools running)
Sweet spot models	7B–8B Q4	32B–34B Q4
7B Q4 tok/s	28–35	35–45
70B Q4 tok/s	N/A	6–8
Multi-model concurrency	2–3 small (7B) models	3–4 small or 1 large + 1 small
Power draw (inference)	~45 W	~65–80 W
Primary use	Supplement cloud AI	Self-sufficient local AI
Upgrade path	→ Scenario B ($2,199)	→ Mac Studio M4 Max ($3,699)

Thermal and Sustained Performance

The Mac Mini's compact 5×5-inch enclosure has real thermal limits under sustained AI inference loads.

Thermal Behavior Under Load

Workload Type	Temperature Range	Time to Throttle	Performance Impact
Interactive chat (bursty queries)	60–80°C	No throttling	✅ Full performance
CPU-heavy batch work (compile, Docker)	68–74°C	No throttling	<3% drop
Sustained LLM inference (100% GPU/CPU)	95–105°C (junction)	8–10 minutes	30–45% drop
Sustained inference + external fan	85–95°C	~25 minutes	10–20% less throttling

Note on "118°C" reports: Some community benchmarks report GPU thermal-junction readings up to 118°C under extreme continuous load. This is the junction temperature (hotspot on the die), not the surface or package temperature. While high, Apple Silicon is rated for junction temps up to 110–120°C. Still, sustained operation at these temperatures triggers aggressive thermal throttling.

Practical Implications for Guillaume

For interactive Ollama use (Scenario A & B typical): Queries are bursty — a few seconds of inference followed by idle time. The Mac Mini cools between queries. Thermal throttling is not a practical concern for interactive chat or coding-assistant workflows.

For sustained batch inference (processing documents, running multi-agent loops): Stock cooling throttles within 10 minutes, settling at 55–70% of peak performance. Mitigations:

External fan ($20–30): Small USB blower placed behind the Mac Mini reduces temps by ~10°C and extends full-performance window to ~25 minutes.
Software fan control (free): Setting the internal fan to maximum via sudo powermetrics or third-party tools reduces temps but increases noise.
Duty-cycle management: For batch work, schedule 5-minute pauses every 20–30 minutes to allow cool-down.

Sources: VPSMac 72-hour stress test (vpsmac.com); Apple Community Forums (discussions.apple.com/thread/255854367); blog.shiptasks.com thermal analysis.

Storage Considerations

Model Storage Requirements

Model Collection	Disk Space
3–4 small models (7B Q4 variants)	~15–20 GB
5–8 mixed models (7B + 13B)	~40–60 GB
Full Scenario B library (7B + 33B + 70B)	~80–120 GB
Large library with multiple quantizations	~150–200 GB

Ollama stores models in ~/.ollama/models/. Models can be deleted and re-downloaded at any time (ollama rm <model> / ollama pull <model>).

Storage Recommendations

Scenario A (512 GB SSD): Sufficient for 5–8 small models plus macOS + apps. Budget ~100 GB for the OS and applications, ~50–80 GB for models, leaving ~300 GB for project files. 1 TB upgrade ($200) recommended if you experiment with many model variants.
Scenario B (1 TB SSD): Comfortable for a mixed library including a 70B model (38 GB). Budget ~100 GB for OS/apps, ~120 GB for models, leaving ~700 GB for projects. 2 TB ($2,599 total, +$400) if you keep many large model variants or work with large datasets.

The Hybrid Strategy: Local + Cloud

Guillaume already has OpenAI Codex and GitHub Copilot subscriptions. The optimal strategy is not "local vs cloud" — it is routing queries to the right backend based on the task.

When to Use Local Ollama

Scenario	Why Local
Privacy-sensitive queries (proprietary code, client data)	Data never leaves your machine
Offline work (travel, network outages)	No internet required
High-frequency small queries (autocomplete, quick Q&A)	Zero latency, no rate limits
Experimentation (testing prompts, comparing models)	No API cost, no quota consumption
Always-on background agent (file watcher, code review daemon)	No per-request cost

When to Use Cloud (Copilot / Codex)

Scenario	Why Cloud
Complex reasoning (architecture decisions, long analysis)	GPT-4.1, GPT-5.x, Claude Opus are far more capable than local 7B–70B
Long context (entire codebases, large documents)	Cloud models handle 128K+ context; local 70B struggles past 8K on 64 GB
Speed on large models	GPT-4.1 responds in 1–3 seconds; local 70B takes 5–10 seconds per response
Multi-modal tasks (image understanding, code from screenshots)	Not available locally
Latest model capabilities	GPT-5.4, Claude Opus 4.6 — no local equivalent

Practical Routing via OpenClaw

Configure OpenClaw's multi-provider routing to automatically select the right backend:

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "github-copilot/gpt-4.1",     // Cloud default (unlimited for paid users)
        "fallbacks": ["ollama/llama3.1:8b"]        // Local fallback
      }
    }
  }
}

For privacy-sensitive tasks, explicitly route to local:

# In OpenClaw, select the local model for a specific task
ollama/deepseek-coder:6.7b

Cloud Model Availability (April 2026)

Guillaume's GitHub Copilot subscription includes access to (at no additional per-request cost for most models):

Model	Status	Best For
GPT-4.1	✅ Current default (unlimited)	General coding, reasoning
GPT-5.2	✅ Available	Advanced reasoning
GPT-5.3-Codex	✅ Available	Code-optimized tasks
GPT-5.4	✅ Available	Latest flagship
Claude Sonnet 4.6	✅ Available	Nuanced analysis
Claude Opus 4.6	✅ Available (premium quota)	Deep reasoning
Claude Haiku 4.5	✅ Available	Fast, lightweight

⚠️ Premium request quotas: Some models (Claude Opus, o1, GPT-5.4) may consume limited monthly premium request allocations depending on plan tier. GPT-4.1 is unlimited for paid Copilot users. (Source: github.blog changelog; GitHub community discussions)

⚠️ GPT-4o is deprecated in GitHub Copilot. If existing configurations reference gpt-4o, update them to gpt-4.1.

M5 Chip Timeline: Should You Wait?

The Mac Mini M5 refresh is expected around WWDC in June 2026 — approximately 2 months from this document's date. Early M5 Max benchmarks suggest:

~4× faster time-to-first-token (TTFT) than M4
19–27% faster token decode speed
Potentially higher memory bandwidth and larger max RAM

Should Guillaume wait?

If the need is immediate: Buy now. The M4 Pro 64 GB is a capable machine today, and models you can run on it won't suddenly become obsolete.
If 2 months can wait: The M5 Mac Mini may offer meaningfully better inference performance at similar price points. Apple typically keeps the same pricing structure across generations.
Risk of waiting: Apple could ship M5 Mac Mini later than June (supply chain delays are common). The M4 lineup may also see retailer discounts once M5 launches.

Sources: Macworld (macworld.com/article/2964754); TechRepublic; zeerawireless.com.

Recommendation

For Guillaume Descoteaux-Isabelle

Given that Guillaume already has OpenAI Codex and GitHub Copilot subscriptions for cloud inference, local AI is a supplement, not a replacement. The purchase decision depends on how he intends to use local models:

If Local AI Is Supplementary (Private Queries, Offline, Experimentation)

→ Buy: Mac Mini M4, 24 GB RAM, 512 GB SSD — $999 USD (~$1,379 CAD)

This configuration runs Llama 3.1 8B, Mistral 7B, CodeLlama 7B, and DeepSeek-Coder 6.7B at 28–35 tok/s — fast enough for real-time coding assistance. Cloud handles the heavy lifting. Total annual cost: ~$999 hardware + existing subscriptions.

If Local AI Is a Primary Tool (33B–70B Models, Self-Sufficiency)

→ Buy: Mac Mini M4 Pro (12-core), 64 GB RAM, 1 TB SSD — $2,199 USD (~$3,035 CAD)

This is the configuration that opens the door to large local models: DeepSeek-Coder 33B (10–14 tok/s), DeepSeek-R1 32B (11–14 tok/s), Mixtral 8×7B (12–18 tok/s), and Llama 3.1 70B (6–8 tok/s). The $1,200 premium over Scenario A buys 10× the model-size capability and 2.3× the memory bandwidth.

My Recommendation

Go with Scenario B ($2,199 USD). Here's why:

RAM is permanent. You cannot add memory later. The AI model landscape is trending toward larger models, and 24 GB will feel constraining within 12–18 months.
The incremental cost is modest. $1,200 more for a machine that handles 70B models — on hardware that will serve for 4–5 years, that's ~$20/month.
Local 33B coding models are substantially better than 7B. DeepSeek-Coder 33B and DeepSeek-R1 32B produce meaningfully higher-quality code than their 7B counterparts — and they run comfortably at 10–14 tok/s on this config.
Future-proofing. New model architectures (Mixture of Experts, larger context windows, reasoning models) trend toward higher memory requirements. 64 GB gives headroom.

If the June 2026 M5 timeline works with Guillaume's schedule, waiting 2 months could yield a meaningful performance uplift at the same price point. But the M4 Pro 64 GB is a strong purchase today.

Sources

Currency

ValutaFX — USD/CAD 2026 historical rates — https://www.valutafx.com/history/usd-cad-2026

Thermal

VPSMac, "Mac Mini Thermal Performance 72-Hour Stress Test" — https://vpsmac.com/en/blog/mac-mini-thermal-performance-stress-test.html
Apple Community Forums, M4 Pro thermals — https://discussions.apple.com/thread/255854367

M5 Timeline

Macworld, "2026 Mac Mini (M5 & M5 Pro)" — https://www.macworld.com/article/2964754/2026-mac-mini-m5-pro-design-specs-release-date.html
TechRepublic, "Apple Mac Mini 2026: M5 Upgrade Rumors" — https://www.techrepublic.com/article/news-apple-mac-mini-2026-m5-rumors-stock-shortage/

Community

Reddit r/ollama — Mac Mini M4 as Ollama server — https://www.reddit.com/r/ollama/comments/1idv02o/
Reddit r/LocalLLaMA — community benchmarks and RAM usage reports

Document produced April 15, 2026. All prices are USD MSRP unless otherwise noted. CAD estimates use 1 USD ≈ 1.38 CAD. Benchmark numbers are from community reports and may vary ±15% based on context length, quantization variant, system load, and Ollama version. RAM is soldered on all Apple Silicon Macs and cannot be upgraded after purchase.

Mac Mini for Local AI Inference — Hardware Scenarios

Mac Mini for Local AI Inference — Hardware Scenarios

Executive Summary

Apple Silicon Unified Memory: Why It Matters for LLMs

The Architecture Advantage

Memory Bandwidth Is the Bottleneck

Mac Mini M4 Lineup (April 2026)

Chip Specifications

Complete Pricing Table (Apple MSRP, USD — verified April 2026)

Mac Mini M4 (Base Chip)

Mac Mini M4 Pro (12-core CPU / 16-core GPU)

Mac Mini M4 Pro (14-core CPU / 20-core GPU — CTO upgrade)

RAM Upgrade Cost Summary

Memory Requirements by Model

RAM Needed for Inference (Q4_K_M Quantization Unless Noted)

OS + Development Tools Memory Budget

Effective RAM Available for Models

Scenario A: Minimal Inference Setup

Recommended Configuration

What It Can Run

What It Cannot Run

Use Case Fit

Scenario B: Evolved Large Model Setup

Recommended Configuration

Why Not 48 GB?

What It Can Run

Performance Expectations

When Mac Studio Is Better

Comparison Table

Thermal and Sustained Performance

Thermal Behavior Under Load

Practical Implications for Guillaume

Storage Considerations

Model Storage Requirements

Storage Recommendations

The Hybrid Strategy: Local + Cloud

When to Use Local Ollama

When to Use Cloud (Copilot / Codex)

Practical Routing via OpenClaw

Cloud Model Availability (April 2026)

M5 Chip Timeline: Should You Wait?

Recommendation

For Guillaume Descoteaux-Isabelle

If Local AI Is Supplementary (Private Queries, Offline, Experimentation)

If Local AI Is a Primary Tool (33B–70B Models, Self-Sufficiency)

My Recommendation

Sources

Apple Official

Benchmarks & Technical Analysis

Ollama

Pricing Verification (April 2026)

Currency

Thermal

M5 Timeline

Community