Agent C: Apple Silicon Training & Fine-Tuning Capabilities
Research Date: April 15, 2026 Scope: Training/fine-tuning AI models locally on Mac Mini for Guillaume's Indigenous-AI collaborative platform Focus: Hardware requirements, framework maturity, practical feasibility, training time estimates
Key Findings
-
Local fine-tuning on Mac Mini is viable and practical for Guillaume's use case. LoRA/QLoRA fine-tuning of 7B–8B models completes in 10–30 minutes on M4 Pro hardware. Embedding model fine-tuning for QMD-style models takes 5–20 minutes for domain-specific datasets of 1K–10K examples.
-
MLX + MLX-LM is the gold standard for Apple Silicon training. It eliminates CPU–GPU data copying via unified memory, installs with
pip, and supports LoRA, QLoRA, DPO, GRPO, SFT, embedding fine-tuning, and model export to GGUF/HuggingFace. -
Weekend/overnight self-training by AI agents is absolutely feasible. A complete LoRA fine-tuning cycle (data prep → train → export → serve) can run unattended in under 1 hour. Multiple persona adapters could be trained sequentially in a single night.
-
Recommended hardware: Mac Mini M4 Pro with 48GB RAM (
$1,800) is the sweet spot. The 64GB config ($2,700) provides headroom for 13B models and parallel workloads. -
QMD's embedding model (embeddinggemma-300M) can be fine-tuned with Guillaume's domain-specific text using mlx-tune's embedding fine-tuning capabilities (InfoNCE/contrastive learning), or via sentence-transformers with PyTorch MPS backend.
Training Frameworks on Apple Silicon
MLX / MLX-LM (Apple's Framework) — ⭐ PRIMARY RECOMMENDATION
Maturity: Stable, actively developed by Apple (ml-explore). As of 2026, MLX 0.20+ is production-ready.
What it supports:
- LoRA and QLoRA fine-tuning (native, optimized)
- Full fine-tuning (practical for models ≤7B on 48GB+ RAM)
- Direct HuggingFace model loading (
mlx-communitypre-converted weights) - Model export: fused weights, GGUF for Ollama/llama.cpp, HuggingFace Hub upload
- Training from JSONL datasets with simple
{"text": "..."}format
Key advantage: Unified memory architecture means no CPU↔GPU data transfer bottleneck. All RAM is addressable by GPU. This gives Apple Silicon a 30–40% training speedup over PyTorch MPS on the same hardware.
Installation:
pip install mlx-lm
Example LoRA fine-tune command (from mlx-examples):
python lora.py --model mistralai/Mistral-7B-v0.1 \
--train \
--iters 600 \
--batch-size 1 \
--lora-layers 4
Benchmark from Apple's own repo: Llama 7B LoRA training on WikiSQL — validation loss drops from 2.66 → 1.23 over 1,000 iterations. Training speed: ~475 tokens/sec on M2 Ultra, ~250 tokens/sec on M1 Max 32GB.
Sources: ml-explore/mlx-examples LoRA, markaicode.com MLX-LM Guide, randalscottking.com MLX Guide
mlx-tune (Community Wrapper) — ⭐ BEST FOR EMBEDDING FINE-TUNING
What it is: An Unsloth-compatible API wrapper around MLX, built by ARahim3. Enables same training scripts to work on Mac (MLX) and cloud (CUDA/Unsloth) by changing one import line.
Critical for Guillaume: mlx-tune is the only MLX-based tool that explicitly supports embedding model fine-tuning with contrastive learning (InfoNCE loss). Supports BERT, ModernBERT, Qwen3-Embedding, and Harrier architectures.
Full capability matrix (all marked Stable as of v0.4.21):
| Capability | Status |
|---|---|
| SFT Training | ✅ Stable |
| LoRA/QLoRA | ✅ Stable |
| DPO, ORPO, GRPO, KTO, SimPO | ✅ Stable |
| Vision Model Fine-Tuning (VLMs) | ✅ Stable |
| TTS Fine-Tuning (5 models) | ✅ Stable |
| STT Fine-Tuning (6 models) | ✅ Stable |
| Embedding Fine-Tuning | ✅ Stable |
| OCR Fine-Tuning | ✅ Stable |
| Continual Pretraining | ✅ Stable |
| MoE Fine-Tuning | ✅ Stable |
| Export to HuggingFace / GGUF | ✅ Stable |
| Push to HuggingFace Hub | ✅ Stable |
Installation:
pip install mlx-tune
# With audio support:
pip install 'mlx-tune[audio]'
Source: github.com/ARahim3/mlx-tune, PyPI
PyTorch MPS Backend
Maturity: Stable in PyTorch 2.7+ (2025). Included automatically in standard macOS PyTorch builds.
What it supports:
- Training and fine-tuning of PyTorch models with GPU acceleration via Metal Performance Shaders
- HuggingFace
TrainerAPI auto-detects MPS device - sentence-transformers training works on MPS
Limitations (significant):
- No distributed/multi-GPU training — single device only
- Partial operator coverage — some ops fall back to CPU (use
PYTORCH_ENABLE_MPS_FALLBACK=1) - Limited precision modes — float16/bfloat16 not on par with CUDA; mostly float32
- No fine-grained VRAM tracking — unlike CUDA, can't query GPU memory usage precisely
- 30–40% slower than MLX for equivalent tasks on Apple Silicon due to data copying overhead
Best for: sentence-transformers fine-tuning (since sentence-transformers library is PyTorch-native and doesn't have MLX port), HuggingFace Trainer-based workflows that need MPS fallback.
Device selection:
import torch
device = "mps" if torch.backends.mps.is_available() else "cpu"
Sources: PyTorch MPS docs, HuggingFace Apple Silicon guide
Core ML Training (On-Device)
Status: Limited. Supports incremental on-device model updates via MLUpdateTask API (Swift).
What it does: Small-scale, privacy-preserving model personalization — e.g., keyboard auto-correction adapting to user slang, photo tagging learning user preferences.
NOT suitable for Guillaume's use case. Core ML training is designed for small incremental updates within iOS/macOS apps, not for LoRA fine-tuning of LLMs or embedding models. The API is Swift-only and targets deployed app personalization, not ML research workflows.
llama.cpp Training
Status: Experimental, not recommended. llama.cpp is inference-focused. Native LoRA training support (train-lora) exists but is poorly documented, less feature-rich than MLX-LM, and primarily optimized for CUDA.
Recommended workflow: Train with MLX-LM or HuggingFace PEFT, convert to GGUF, serve with llama.cpp/Ollama.
Common Community Fine-Tuning Scenarios on Mac
What People Are ACTUALLY Training (r/LocalLLaMA, community reports)
Based on r/LocalLLaMA community data (~10,000 benchmark runs across 400 models as of early 2026):
| Use Case | Models Used | Hardware | Outcome |
|---|---|---|---|
| Domain-specific LoRA adapters (legal, medical, customer service) | Mistral 7B, Llama 3B/8B | M1–M4 16–64GB | ✅ Routine success |
| QLoRA on quantized 7B models | Llama 3.1 8B, Gemma 7B, Phi-3 | M2/M3 16GB | ✅ Works in 10–30 min |
| Assistant bots for small organizations | Llama 3B, Mistral 7B | Mac Mini M4, MacBook Air | ✅ Production use |
| Vision model fine-tuning | Qwen VL, PaliGemma | M3/M4 Max 64GB+ | ✅ Stable with mlx-tune |
| TTS/STT fine-tuning | Whisper, Orpheus | M2/M3/M4 32GB+ | ✅ Stable with mlx-tune |
| Full fine-tuning of 13B+ models | Llama 2 13B | 64GB+ RAM | ⚠️ Slow, needs patience |
| Training from scratch (any size) | N/A | Any Mac | ❌ Not practical |
Success Stories (Documented)
-
LoRA fine-tuning Mistral 7B on M1 16GB — 1,000 steps in under 30 minutes, 5–8GB RAM usage (4-bit quantized). Reported in Towards Data Science article with full training logs.
-
Domain-specific lingo adaptation — Community members training legal/medical terminology via QLoRA adapters, then serving via Ollama. Multiple success reports on r/LocalLLaMA.
-
Rapid prototyping → cloud scaling — Developers using Mac for local iteration with mlx-tune, then pushing exact same scripts to CUDA cloud for production training. The "context switch" workflow.
Failure Patterns
-
Running out of RAM on 8–16GB machines with 13B+ models — Even quantized, 13B models need 16GB+ for training overhead (optimizer states, activations, KV cache).
-
Expecting CUDA-level speed — Training is 2–4× slower than RTX 4090. Users who expect GPU-class throughput are disappointed. Those who understand the tradeoff (no cloud cost, privacy, simplicity) are satisfied.
-
Full-parameter training of large models — Not practical on Mac Mini. LoRA/QLoRA is the only viable path for models ≥7B.
Trainable Models for Guillaume's Use Case
1. LoRA Adapters for AI Persona Self-Training — ✅ HIGHLY FEASIBLE
Concept: Each AI persona (e.g., Mia the architect, Miette the emotional resonator) gets its own LoRA adapter trained on persona-specific conversation data, instructions, and domain knowledge.
How it works:
- Curate persona-specific training data as JSONL (conversations, instructions, domain text)
- Run QLoRA fine-tuning on a base model (e.g., Llama 3.1 8B-Instruct, 4-bit quantized)
- Export adapter weights (~50–200MB per persona)
- Serve via Ollama with
ADAPTERdirective in Modelfile
Requirements:
- Base model: ~4–8GB RAM (4-bit 8B model)
- Training overhead: ~2–4GB additional
- Total: ~8–12GB RAM per training run
- Time: 10–30 minutes per persona for 500–1,000 training steps
- Multiple personas can be trained sequentially overnight
Training data format (JSONL):
{"text": "<|user|>\nHow should we approach this relationship with the land?\n<|assistant|>\nAs Mia, I see the architectural pattern here..."}
Weekend self-training cycle:
- Friday night: Data collection/preparation scripts run
- Saturday morning: Sequential LoRA training for each persona (30 min each × N personas)
- Saturday afternoon: Adapter export, merge into Ollama models
- Sunday: Validation/testing
- Total for 5 personas: ~2.5–5 hours of training compute
2. Fine-Tuning Embedding Models for QMD — ✅ FEASIBLE
QMD's current embedding model: embeddinggemma-300M-Q8_0 (768 dimensions, ~300MB, English-optimized). Alternative: Qwen3-Embedding-0.6B-Q8_0 (1024 dimensions, ~640MB, multilingual). Configured via QMD_EMBED_MODEL environment variable.
What fine-tuning would achieve: Train the embedding model to understand that concepts like "relational accountability," "medicine wheel," "seven grandfather teachings," and "ceremony as methodology" are semantically close to each other and to broader Indigenous epistemology concepts — improving QMD's semantic search for Guillaume's domain-specific knowledge base.
Approach A — mlx-tune contrastive learning (recommended):
from mlx_tune import FastLanguageModel
# Fine-tune embedding model with domain-specific pairs
# e.g., ("medicine wheel teachings", "four directions framework") → high similarity
# ("relational methodology", "extractive research") → low similarity
mlx-tune supports InfoNCE/contrastive loss for embedding models (BERT, ModernBERT, Qwen3-Embedding, Harrier).
Approach B — sentence-transformers on PyTorch MPS:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
device = "mps"
model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
train_examples = [
InputExample(texts=['relational accountability', 'ethical research relationship'], label=0.9),
InputExample(texts=['medicine wheel', 'four directions teaching'], label=0.95),
InputExample(texts=['extractive methodology', 'relational approach'], label=0.1),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100
)
Training data needed: 500–5,000 sentence pairs with similarity scores. Guillaume's existing QMD knowledge base content can be used to generate these pairs programmatically.
Time estimate: 5–20 minutes for 1K–10K pairs on Mac Mini M4 Pro (MPS backend). Embedding models (300M–600M params) are tiny compared to LLMs.
After fine-tuning: Export the model and configure QMD to use it via QMD_EMBED_MODEL environment variable.
3. Small Classification/NER Models — ✅ TRIVIAL ON MAC
Use case: Train models to recognize Indigenous concepts, ceremony names, teaching references, relational terms in text.
Tool: spaCy (fully supports Apple Silicon, v3.x/v4.x)
Training time: 5–15 minutes for datasets of 1,000–10,000 annotated examples.
Example:
python -m spacy train config.cfg --output ./output \
--paths.train ./train.spacy --paths.dev ./dev.spacy
RAM usage: Under 8GB for small NER models. Well within any Mac Mini config.
4. What's Realistic vs Aspirational
| Task | Feasibility | Notes |
|---|---|---|
| LoRA adapters per AI persona | ✅ Realistic, proven | 10–30 min each, automate with cron |
| Embedding fine-tuning for QMD | ✅ Realistic, proven | 5–20 min, need domain pairs |
| NER/classification for domain terms | ✅ Trivial | spaCy, minutes to train |
| Full fine-tuning 7B model | ⚠️ Feasible but slow | Hours on 48GB+, not recommended |
| Training 13B+ from LoRA | ⚠️ Marginal | Needs 64GB, slower |
| Pre-training any model from scratch | ❌ Not practical | Days/weeks, use cloud |
| Fine-tuning 30B+ models | ❌ Not on Mac Mini | Need Mac Studio/Mac Pro |
Mac Hardware Scenarios for Training
Apple Silicon Chip Comparison (Training-Relevant Specs)
| Chip | GPU Cores | Max RAM | Memory Bandwidth | Available In |
|---|---|---|---|---|
| M4 (base) | 10 | 32GB | 120 GB/s | Mac Mini ($599+) |
| M4 Pro | 20 | 64GB | 273 GB/s | Mac Mini ($1,399+) |
| M4 Max | 40 | 128GB | 546 GB/s | Mac Studio ($1,999+) |
| M4 Ultra | 80 | 192GB+ | ~800+ GB/s | Mac Studio ($3,999+) |
Memory bandwidth is the bottleneck for training, not GPU cores. More bandwidth = faster token processing during training.
Minimal Mac Mini — "Can It Train At All?"
Config: Mac Mini M4 (base), 24GB RAM, 512GB SSD Price: ~$999
What it can do:
- QLoRA fine-tuning of 3B–7B models (4-bit quantized) — fits in ~8GB, leaves room for OS
- Embedding model fine-tuning (300M–600M models) — trivial
- spaCy NER training — trivial
- sentence-transformers fine-tuning — works fine
What it can't do:
- LoRA on 13B+ models (not enough RAM for optimizer states)
- Full fine-tuning of anything larger than 3B
- Run training while other heavy workloads are active
Limitations:
- 10 GPU cores and 120 GB/s bandwidth make training ~2.3× slower than M4 Pro
- 24GB is tight — model + optimizer + activations must all fit
- Batch size limited to 1–2 for 7B models
Training time (7B QLoRA, 1K steps): ~45–90 minutes (vs 15–30 min on M4 Pro)
Verdict: Usable for prototyping and small embedding models. Not recommended for regular persona training workloads.
Maximal Mac Mini — ⭐ RECOMMENDED
Config: Mac Mini M4 Pro, 48GB RAM, 1TB SSD Price: ~$1,800
What it can do:
- QLoRA fine-tuning of 7B–8B models comfortably — ~8GB model + 4GB overhead, 36GB headroom
- LoRA fine-tuning of 7B–8B models (full precision) — ~14GB model + overhead
- QLoRA on 13B models — ~16GB model + overhead, tight but works
- Embedding model fine-tuning — trivial, all models fit easily
- Multiple sequential persona trainings overnight — completely viable
- Run training while other light workloads continue
What it can't do:
- Full fine-tuning of 13B+ models
- QLoRA on 30B+ models (not enough RAM)
Key specs: 20 GPU cores, 273 GB/s bandwidth
Training times (estimated):
| Task | Time |
|---|---|
| 7B QLoRA, 1K steps, batch 4–8 | 15–30 minutes |
| 7B LoRA (FP16), 1K steps | 30–60 minutes |
| 13B QLoRA, 1K steps, batch 1–2 | 45–90 minutes |
| Embedding fine-tune (300M), 3 epochs, 5K pairs | 5–10 minutes |
| spaCy NER training, 5K examples | 5–10 minutes |
| 5 persona LoRA adapters (sequential) | 1.5–3 hours total |
Verdict: This is the sweet spot for Guillaume. Handles all realistic training workloads. Weekend self-training is completely viable. ~$1,800 is reasonable for a dedicated development/training machine.
Upgrade option: 64GB RAM ($2,700) provides headroom for 13B models, parallel processes, and future-proofing.
Alternative: Mac Studio / Mac Pro
If Mac Mini is insufficient (it shouldn't be for Guillaume's current needs, but for future scaling):
Mac Studio M4 Max — For 13B–30B Training
Config: Mac Studio M4 Max, 128GB RAM Price: ~$2,500–$4,000 (depending on storage/config) Memory bandwidth: 546 GB/s (2× Mac Mini M4 Pro) GPU cores: 40
Unlocks:
- QLoRA on 30B models comfortably (40GB model + overhead in 128GB)
- Full LoRA on 13B models without compromise
- Multiple concurrent training jobs
- ~2× faster training than M4 Pro due to doubled bandwidth and GPU cores
Mac Studio M4 Ultra — For Maximum Local Training
Config: Mac Studio M4 Ultra, 192GB RAM Price: $3,999+ base, ~$6,000–$8,000 maxed Memory bandwidth: 800+ GB/s GPU cores: 80
Unlocks:
- QLoRA on 70B models (fits in memory)
- Full fine-tuning of 13B models
- Apple claims handling 600B+ parameter models in memory (inference)
- Multiple concurrent training runs for different personas
When Guillaume needs this: If he moves to fine-tuning 30B+ models, or needs to train multiple personas in parallel rather than sequentially, or wants to do continual pretraining.
Summary Decision Matrix
| Need | Mac Mini M4 (24GB) | Mac Mini M4 Pro (48GB) | Mac Studio M4 Max (128GB) |
|---|---|---|---|
| Embedding fine-tuning | ✅ | ✅ | ✅ |
| 7B QLoRA persona adapters | ⚠️ Tight | ✅ Comfortable | ✅ Overkill |
| 13B QLoRA | ❌ | ⚠️ Tight | ✅ Comfortable |
| 30B QLoRA | ❌ | ❌ | ✅ |
| Weekend batch training (5 personas) | ⚠️ Slow | ✅ 2–3 hours | ✅ 1–1.5 hours |
| Price | ~$999 | ~$1,800 | ~$3,000+ |
Training Time Estimates
LoRA/QLoRA Fine-Tuning (LLMs)
Based on community benchmarks and framework documentation:
| Model | Method | Hardware | Steps | Batch | Time | Source |
|---|---|---|---|---|---|---|
| Mistral 7B (4-bit) | QLoRA | M2 Pro 16GB | 1,000 | 4 | ~30 min | markaicode.com |
| Mistral 7B (4-bit) | QLoRA | M1 Max 64GB | 1,000 | 8 | ~15 min | randalscottking.com |
| Llama 7B (FP16) | LoRA | M2 Ultra | 1,000 | 4 | ~35 min | ml-explore/mlx-examples (475 tok/s) |
| Llama 7B (FP16) | LoRA | M1 Max 32GB | 1,000 | 1 | ~67 min | ml-explore/mlx-examples (250 tok/s) |
| Llama 8B (4-bit) | QLoRA | M4 Pro 48GB (est.) | 1,000 | 8 | ~10–20 min | Extrapolated from M2 benchmarks + 2× bandwidth |
| 13B (4-bit) | QLoRA | M4 Pro 48GB | 1,000 | 2 | ~45–90 min | Community estimates |
Iteration speed: 0.1–0.3 iter/s on 16GB M1/M2 Pro; 0.5–1.5 iter/s on M4 Pro 48GB (estimated based on bandwidth scaling).
Scaling rule of thumb: Training time is roughly inversely proportional to memory bandwidth. M4 Pro (273 GB/s) is ~2.3× faster than M4 base (120 GB/s) and ~2× slower than M4 Max (546 GB/s).
Embedding Model Fine-Tuning
| Model | Data Size | Epochs | Hardware | Time |
|---|---|---|---|---|
| all-MiniLM-L6-v2 (22M params) | 1,000 pairs | 3 | Any Mac (MPS) | 1–3 min |
| all-MiniLM-L6-v2 (22M params) | 10,000 pairs | 3 | Any Mac (MPS) | 5–15 min |
| embeddinggemma-300M | 1,000 pairs | 3 | M4 Pro (MLX) | 3–5 min |
| embeddinggemma-300M | 10,000 pairs | 3 | M4 Pro (MLX) | 10–20 min |
| Qwen3-Embedding-0.6B | 5,000 pairs | 3 | M4 Pro (MLX) | 10–15 min |
Embedding models are small enough that training is fast on any Apple Silicon Mac.
Is Weekend/Overnight Training Viable?
Absolutely yes. Example weekend training schedule on Mac Mini M4 Pro 48GB:
| Time | Task | Duration |
|---|---|---|
| Friday 10 PM | Data preparation scripts run | 30 min |
| Friday 10:30 PM | Persona 1 QLoRA training (7B, 1K steps) | 20 min |
| Friday 10:50 PM | Persona 2 QLoRA training | 20 min |
| Friday 11:10 PM | Persona 3 QLoRA training | 20 min |
| Friday 11:30 PM | Persona 4 QLoRA training | 20 min |
| Friday 11:50 PM | Persona 5 QLoRA training | 20 min |
| Saturday 12:10 AM | Embedding model fine-tuning | 15 min |
| Saturday 12:25 AM | NER model training | 10 min |
| Saturday 12:35 AM | Export all adapters to GGUF, rebuild Ollama models | 15 min |
| Total | ~2.5 hours |
The entire pipeline completes before midnight. Machine is available for other work by Saturday morning. A cron-driven train.sh script could automate this entirely.
Plugins Supporting Training Workflows
Ollama — Serving Fine-Tuned Models ✅
Ollama supports importing custom GGUF models and LoRA adapters via Modelfile:
FROM ./llama-3.1-8b.Q4_K_M.gguf
ADAPTER ./persona-mia.lora
SYSTEM "You are Mia, the architectural thinker..."
ollama create mia-persona -f Modelfile
ollama run mia-persona
Workflow: Train LoRA adapter with MLX-LM → export to GGUF/LoRA format → import into Ollama → serve via API. This is a proven, documented workflow.
LoRA merging: Can also merge LoRA into base weights pre-export for single-file deployment. MLX-LM provides fuse.py for this.
HuggingFace Hub Integration
Both MLX-LM and mlx-tune support:
- Downloading base models from HuggingFace (automatic conversion to MLX format)
- Uploading fine-tuned models/adapters to HuggingFace Hub (
push_to_hub()) - mlx-community namespace has pre-converted MLX weights for popular models
Guillaume could maintain private HuggingFace repos for each persona's adapter weights, enabling version control and rollback of training.
Dataset Preparation Tools
| Tool | Purpose | Mac Support |
|---|---|---|
| Axolotl | YAML-driven fine-tuning orchestration pipeline | ✅ Works on Mac |
| Label Studio | Manual data annotation/labeling | ✅ Web-based, runs anywhere |
| nlpaug | Text augmentation for training data | ✅ Python library |
| pandas + custom scripts | JSONL generation from QMD knowledge base | ✅ Trivial |
For Guillaume's specific workflow: A Python script that reads his QMD markdown files, extracts domain-specific text, generates training pairs (for embedding fine-tuning) or instruction examples (for persona LoRA training), and outputs JSONL files. This is a straightforward scripting task, not requiring specialized tools.
Training Orchestration
No Mac-specific training orchestrator exists as a plugin, but the workflow is scriptable:
#!/bin/bash
# weekend-training.sh — run via cron on Friday night
# 1. Generate training data from latest QMD content
python scripts/generate_training_data.py
# 2. Fine-tune each persona adapter
for persona in mia miette tushell council; do
python -m mlx_lm.lora \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--data data/${persona}/ \
--train \
--iters 1000 \
--adapter-file adapters/${persona}.npz
done
# 3. Fine-tune embedding model for QMD
python scripts/train_embeddings.py
# 4. Export and rebuild Ollama models
for persona in mia miette tushell council; do
python -m mlx_lm.fuse \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--adapter-file adapters/${persona}.npz \
--export-gguf
ollama create ${persona}-persona -f Modelfiles/${persona}
done
echo "Training complete: $(date)" >> training.log
Evidence Quality
High Confidence (multiple corroborating sources, benchmarks)
- MLX-LM LoRA/QLoRA training works on Apple Silicon — confirmed by Apple's official examples, multiple community guides, and r/LocalLLaMA reports
- Training times for 7B QLoRA: 10–30 minutes — consistent across multiple independent sources
- Memory requirements for quantized models — well-documented in MLX-LM docs and community reports
- Mac Mini M4 Pro pricing and specs — Apple official specs
- Ollama GGUF/LoRA import workflow — documented in Ollama repo
Medium Confidence (extrapolated from related data)
- M4 Pro specific training speeds — extrapolated from M2/M3 benchmarks scaled by memory bandwidth ratio (reasonable methodology)
- Embedding model fine-tuning times on MLX — mlx-tune is newer, fewer community benchmarks; times estimated from PyTorch MPS equivalents and model size scaling
- QMD embedding model fine-tunability — embeddinggemma-300M is a standard Gemma embedding model; fine-tuning Gemma-class models is well-documented, but specific QMD integration path needs testing
Lower Confidence (limited data, requires verification)
- Mac Studio M4 Ultra 512GB RAM claims — some sources mention this but official Apple specs show 192GB max for current models; may be future spec or custom config
- llama.cpp native training maturity — evolving rapidly; status may have changed
- Core ML training capabilities beyond basic MLUpdateTask — Apple documentation is sparse
Sources
Framework Documentation
- ml-explore/mlx-examples LoRA — Apple's official LoRA fine-tuning example with benchmarks
- ml-explore/mlx-lm — Apple's LLM package for MLX
- ARahim3/mlx-tune — Unsloth-compatible fine-tuning for Mac (LLM, Vision, Audio, Embedding, OCR)
- PyTorch MPS docs — Official MPS backend documentation
- HuggingFace Apple Silicon guide — Trainer + MPS integration
- sentence-transformers training guide — Embedding model fine-tuning
Benchmarks and Guides
- markaicode.com — Run and Fine-Tune LLMs on Mac with MLX-LM — Comprehensive MLX-LM guide with training times
- randalscottking.com — Fine-Tuning LLMs on Mac: Complete MLX Framework Guide — Step-by-step training guide
- blog.amsayed.dev — Fine-Tuning Your First LLM on Apple Silicon — Practical walkthrough
- DZone — Fine-Tuning LLMs Locally Using MLX LM — Technical guide
- dev.to/starmorph — Apple Silicon LLM Optimization Complete Guide — Performance optimization
Hardware and Pricing
- Apple Mac Mini Specs — Official specifications
- MacPrices.net — Mac Mini pricing history
- Apple Mac Studio Newsroom — Mac Studio M4 Ultra announcement
- Mac Studio Tech Specs — Official Mac Studio specifications
- Apple M4 Pro/Max announcement — Chip specifications
Community Data
- r/LocalLLaMA community benchmarks analysis — ~10,000 benchmark runs analysis
- Towards Data Science — Local LLM Fine-Tuning on Mac (M1 16GB) — Practical success story
- like2byte.com — Mac Mini M4 Pro 64GB: Real 30B LLM Benchmarks — Real-world benchmarks
QMD-Specific
- DeepWiki — QMD Vector Embeddings — QMD embedding model details (embeddinggemma-300M)
- tobi/qmd GitHub — QMD source repository
Research compiled by Agent C — Apple Silicon Training Capabilities For Agent A (inference) and Agent B (plugin review), see companion documents.