Mac Mini for Local AI Training — Hardware & Workflow Guide

Document: RESULT-03 — Mac Mini Training Scenarios for IAIP Date: April 15, 2026 Context: Indigenous-AI Collaborative Platform (IAIP) — Guillaume Descoteaux-Isabelle Status: Final (revised with reviewer corrections applied)

Executive Summary

Local AI fine-tuning on a Mac Mini M4 Pro is viable and practical for Guillaume's Indigenous-AI Collaborative Platform. The three highest-value training tasks — LoRA persona adapters, QMD query expansion, and embedding model domain adaptation — all fit within Apple Silicon's capabilities using models under 2B parameters. A complete weekend training cycle for five AI personas plus QMD model enhancement can finish in under 3 hours on a Mac Mini M4 Pro with 48GB unified memory.

Key findings:

What works today: LoRA/QLoRA fine-tuning of 3B–8B models via MLX-LM (10–30 min per adapter). QMD query expansion retraining via HuggingFace Jobs cloud path (~$1.50/run). Small embedding and NER model training on PyTorch MPS.
What requires caution: Embedding model fine-tuning for QMD works at the training step, but converting a fine-tuned encoder-only model back to GGUF format for QMD deployment is unverified and experimental — this is the single riskiest step in the proposal (see §Known Limitations).
What doesn't work on Mac: QMD's existing finetune/ pipeline has a CUDA dependency (nvidia-ml-py) and targets A10G GPUs. It will not run on Mac without modification. The recommended workaround is the HuggingFace Jobs cloud path or rewriting training scripts for MLX-LM.
Recommended hardware: Mac Mini M4 Pro, 48GB RAM, 512GB SSD (~~$1,799) or 1TB SSD (~~$2,299). The 48GB config handles all realistic workloads.
Data sovereignty: Local-only training is a significant advantage for Indigenous knowledge sovereignty. All training data and model weights remain on-premises. Cloud training paths (HuggingFace Jobs) should be avoided for culturally sensitive material.

What to train first: QMD query expansion model (existing pipeline, highest feasibility), then persona LoRA adapters (proven MLX-LM workflow), then embedding model (highest impact but GGUF conversion risk).

Training Frameworks on Apple Silicon (April 2026)

MLX / MLX-LM

MLX is Apple's array computation framework optimized for Apple Silicon. MLX-LM (v0.31.1, March 2026) is the separate LLM-specific package built on it. These are distinct packages — pip install mlx-lm is what users need for LLM fine-tuning.

Maturity: Production-ready. MLX-LM is actively maintained by Apple's ml-explore team with regular releases.

Supported training methods:

LoRA and QLoRA fine-tuning (native, optimized for unified memory)
Full fine-tuning (practical for models ≤7B on 48GB+ RAM)
DPO, GRPO, SFT training objectives
Direct HuggingFace model loading (pre-converted mlx-community weights available)
Model export: fused weights, GGUF for Ollama/llama.cpp, HuggingFace Hub upload

Key advantage: Unified memory architecture eliminates CPU↔GPU data transfer. All system RAM is GPU-addressable. Community reports indicate MLX is approximately 30–40% faster than PyTorch MPS for equivalent training tasks on the same hardware, though exact speedup varies by model size and task (no single definitive benchmark).

Installation and usage:

pip install mlx-lm

# LoRA fine-tune (current CLI syntax)
python -m mlx_lm.lora \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --train \
  --data ./training-data/ \
  --iters 1000 \
  --batch-size 4 \
  --lora-layers 8

Benchmark reference: Llama 7B LoRA training on WikiSQL — validation loss 2.66 → 1.23 over 1,000 iterations. ~475 tokens/sec on M2 Ultra, ~250 tokens/sec on M1 Max 32GB (source: ml-explore/mlx-examples).

mlx-tune

mlx-tune (v0.4.19, latest verified) is a community wrapper by ARahim3 that provides an Unsloth-compatible API around MLX. It enables the same training scripts to work on Mac (via MLX) and cloud (via CUDA/Unsloth) by changing one import line.

Relevance to Guillaume: mlx-tune claims support for embedding model fine-tuning with contrastive learning (InfoNCE loss) for architectures including BERT, ModernBERT, Qwen3-Embedding, and Harrier. However, each capability should be verified against the v0.4.19 release notes — the capability matrix below reflects claimed features, not independently verified ones.

Capability	Claimed Status
SFT / LoRA / QLoRA Training	✅ Stable
DPO, ORPO, GRPO, KTO, SimPO	✅ Stable
Vision Model Fine-Tuning (VLMs)	✅ Stable
TTS / STT Fine-Tuning	✅ Stable
Embedding Fine-Tuning (contrastive)	✅ Stable (claimed)
Export to HuggingFace / GGUF	✅ Stable

⚠️ Caveat: mlx-tune's GGUF export for embedding models (encoder-only architectures) is unverified for QMD's use case. The export path documented is for causal LMs. See §Known Limitations.

Installation: pip install mlx-tune

Source: github.com/ARahim3/mlx-tune

PyTorch MPS Backend

Maturity: Stable in PyTorch 2.7+ (2025). MPS (Metal Performance Shaders) backend is included automatically in macOS PyTorch builds.

What it supports:

Training and fine-tuning with GPU acceleration via Metal
HuggingFace Trainer API auto-detects MPS device
sentence-transformers training works on MPS

Limitations:

No distributed/multi-GPU training — single device only
Partial operator coverage — some ops fall back to CPU (set PYTORCH_ENABLE_MPS_FALLBACK=1)
Limited precision modes — float16/bfloat16 not on par with CUDA; mostly float32
No fine-grained VRAM tracking unlike CUDA
Approximately 30–40% slower than MLX for equivalent tasks due to data copying overhead

Best for: sentence-transformers fine-tuning (PyTorch-native, no MLX port), HuggingFace Trainer-based workflows.

import torch
device = "mps" if torch.backends.mps.is_available() else "cpu"

Sources: PyTorch MPS docs, HuggingFace Apple Silicon guide

sentence-transformers

The sentence-transformers library is the standard tool for fine-tuning embedding models with contrastive learning, triplet loss, or cosine similarity objectives. It runs on PyTorch and supports the MPS backend.

Relevance: Primary option for fine-tuning embedding models if mlx-tune's embedding support proves insufficient. Uses MultipleNegativesRankingLoss for retrieval tasks — the best loss function for teaching domain-specific similarity.

Caveat for embeddinggemma-300M: The google/embeddinggemma-300M model is not natively published as a sentence-transformers package. Loading it requires manual wrapping with a pooling layer, which adds complexity to the fine-tuning workflow. An alternative is to fine-tune Qwen3-Embedding-0.6B or all-MiniLM-L6-v2 which have native sentence-transformers support.

What Can Be Trained: Guillaume's Use Cases

1. QMD Model Enhancement

QMD uses exactly three GGUF models, all running locally via node-llama-cpp (verified in src/llm.ts, QMD commit cfd640e):

Model	Params	Disk	VRAM	Role	Env Var
embeddinggemma-300M (Q8_0)	300M	~300MB	~400MB	Vector embeddings	`QMD_EMBED_MODEL`
Qwen3-Reranker-0.6B (Q8_0)	600M	~640MB	~700MB	Result reranking	`QMD_RERANK_MODEL`
qmd-query-expansion-1.7B (Q4_K_M)	1.7B	~1.1GB	~1.5GB	Query expansion	`QMD_GENERATE_MODEL`

All three can be swapped via environment variables or SDK constructor. This is the deployment mechanism for fine-tuned models.

embeddinggemma-300M (Embedding Model)

Architecture correction: embeddinggemma-300M is an encoder-only transformer (conceptually similar to BERT), not a Gemma 3 decoder model. This distinction is critical — encoder-only models use bidirectional attention and require different fine-tuning procedures than causal (decoder-only) LMs.

What fine-tuning achieves: Teach the embedding model that Indigenous knowledge concepts ("relational accountability," "medicine wheel," "seven grandfather teachings," "ceremony as methodology") are semantically close to each other and distinct from superficially similar Western academic terms. This directly improves QMD's vector search for Guillaume's domain.

Fine-tuning approach — sentence-transformers on PyTorch MPS:

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# NOTE: embeddinggemma-300M requires manual wrapping — not a native ST model
# Alternative: use Qwen3-Embedding-0.6B or all-MiniLM-L6-v2 for simpler workflow
model = SentenceTransformer("google/embeddinggemma-300M", device="mps")

train_examples = [
    InputExample(texts=["medicine wheel", "sacred circle representing four directions and life stages"]),
    InputExample(texts=["relational accountability", "ethical research requires maintaining relationships"]),
    InputExample(texts=["seven grandfather teachings", "wisdom love respect bravery honesty humility truth"]),
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model=model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=50,
    output_path="./indigenous-embeddinggemma-finetuned",
)

⚠️ BLOCKING CAVEAT — GGUF Conversion:

Converting the fine-tuned embedding model back to GGUF format is the hardest unsolved step in this workflow:

EmbeddingGemma-300M is an encoder-only model, not a causal LM
llama.cpp's convert_hf_to_gguf.py was primarily designed for causal LMs (LLaMA, Mistral, Gemma decoder models)
GGUF has recently added embedding model support, but converting a fine-tuned sentence-transformers model (with potentially modified pooling layers) back to GGUF is experimental and unverified
The original pre-trained embeddinggemma-300M GGUF exists on HuggingFace, but that conversion was done by ggml-org on the original weights — not on fine-tuned weights

Recommended mitigation:

Before investing in Indigenous knowledge dataset creation, test the full pipeline end-to-end with a trivially fine-tuned model (e.g., train on 10 dummy pairs, convert to GGUF, load in QMD, verify it produces embeddings)
If conversion fails, fall back to Qwen3-Embedding-0.6B which uses a causal (decoder) architecture and may have better GGUF conversion support. QMD already supports it as an alternative embedding model
Another fallback: keep the fine-tuned model in PyTorch format and modify QMD to load it via sentence-transformers instead of node-llama-cpp (requires QMD source modification)

Training data needed: 500–5,000 query-document pairs with similarity signals. See §Training Data Preparation.

Time estimate: 5–20 minutes for 1K–10K pairs on Mac Mini M4 Pro (MPS backend).

Qwen3-Reranker-0.6B (Reranker)

The reranker is a standard causal LM (0.6B parameters, Qwen3 architecture) that produces yes/no relevance scores via logprob comparison:

score = exp(logprob_yes) / (exp(logprob_yes) + exp(logprob_no))

Fine-tuning potential: Medium impact. Teaching domain-specific relevance judgments would improve result ordering for Indigenous knowledge queries. However:

No fine-tuning pipeline exists in QMD for this model — it would need to be built from scratch
The yes/no logprob scoring mechanism requires careful calibration — naive SFT could break the scoring distribution
Being a causal LM, standard LoRA fine-tuning via MLX-LM is technically feasible
Should be attempted only after query expansion and embedding models show improvement

Training data format:

{"query": "four directions teachings", "document": "The Medicine Wheel maps...", "relevant": true}
{"query": "four directions teachings", "document": "GPS navigation uses four...", "relevant": false}

Priority: Third, after query expansion and embedding.

qmd-query-expansion-1.7B (Query Expansion)

This is the easiest and safest model to fine-tune — QMD provides a complete, production-grade fine-tuning pipeline in its finetune/ directory.

What it does: Expands user queries into structured search expansions:

Input:  "medicine wheel"
Output:
  hyde: The Medicine Wheel is a sacred circle representing the four directions...
  lex: medicine wheel four directions
  lex: sacred circle indigenous ceremony
  vec: what are the teachings of the medicine wheel in indigenous tradition

Existing pipeline details (verified in QMD source):

LoRA SFT on Qwen3-1.7B (rank 16, alpha 32, all projection layers)
~2,290 training examples in production model
Training tools: train.py, eval.py, convert_gguf.py, dataset/prepare_data.py
Data format: JSONL with query and output fields (processed into Qwen3 chat template by prepare_data.py)

⚠️ CUDA Dependency: QMD's finetune/pyproject.toml depends on nvidia-ml-py and the training configs target A10G (CUDA) GPUs. Running uv run train.py on Mac will fail without modification.

Three paths forward:

Path	Effort	Cost	Data Location
HuggingFace Jobs (recommended)	Low — use existing pipeline	~$1.50/run (A10G, ~45 min)	⚠️ Sent to HF cloud
MLX-LM local rewrite	Medium — rewrite train.py for MLX	$0	✅ Local only
PyTorch MPS modification	Medium — remove nvidia deps, change device config	$0	✅ Local only

For data sovereignty reasons (see §Data Sovereignty), the local MLX-LM path is recommended for Indigenous knowledge data:

pip install mlx-lm
python -m mlx_lm.convert --model Qwen/Qwen3-1.7B
python -m mlx_lm.lora \
  --model mlx_models/Qwen3-1.7B \
  --train --data ./indigenous-expansion-data/ \
  --batch-size 2 --lora-layers 16 --iters 1000

# Fuse and export
python -m mlx_lm.fuse \
  --model mlx_models/Qwen3-1.7B \
  --adapter-file adapters/adapters.npz \
  --export-gguf

How many examples needed: 200–500 high-quality domain-specific expansion examples for meaningful improvement. QMD's own model used ~2,290 examples.

Training time: 15–30 minutes on Mac Mini M4 Pro (estimated from model size and community benchmarks).

Deploying Fine-Tuned Models Back to QMD

Deployment uses environment variable swapping (verified in src/llm.ts):

# Point QMD to your fine-tuned models
export QMD_EMBED_MODEL="$HOME/models/indigenous-embed-Q8_0.gguf"
export QMD_GENERATE_MODEL="$HOME/models/indigenous-expansion-q4_k_m.gguf"
export QMD_RERANK_MODEL="$HOME/models/indigenous-reranker-Q8_0.gguf"

# CRITICAL: After changing the embedding model, re-embed ALL documents
qmd embed

⚠️ Important considerations:

Re-embedding is required after changing the embedding model. Old embeddings are incompatible with the new model's weight space.
Embedding dimensions must match. If switching from embeddinggemma-300M (768-dim) to a model with different dimensions, the sqlite-vec index becomes incompatible and requires full re-indexing.
Re-embedding time depends on corpus size — estimate 1–5 minutes for hundreds of documents, longer for thousands (unverified — test with your corpus).
The GGUF export step for the query expansion model is proven (QMD's own pipeline does this). The GGUF export for the embedding model is unverified (see above).

Recommended version management:

~/models/qmd-indigenous/
├── v1/
│   ├── indigenous-embed-Q8_0.gguf
│   ├── indigenous-expansion-q4_k_m.gguf
│   └── training-metadata.json
├── v2/
│   └── ...
└── current -> v2/   # symlink to active version

2. Persona LoRA Adapters

Feasibility: ✅ Proven, routine on Apple Silicon.

Each AI persona (Mia the architect, Miette the emotional resonator, Tushell the journal keeper, etc.) gets its own LoRA adapter trained on persona-specific conversation data, instructions, and domain knowledge.

Workflow:

Curate persona-specific training data as JSONL (conversations, instructions, domain text)
Run QLoRA fine-tuning on a base model (e.g., Llama 3.1 8B-Instruct, 4-bit quantized)
Export adapter weights (~50–200MB per persona)
Serve via Ollama with ADAPTER directive in Modelfile

Resource requirements per persona:

Base model: ~4–8GB RAM (4-bit 8B model)
Training overhead: ~2–4GB additional
Total: ~8–12GB RAM during training
Time: 10–30 minutes per persona for 500–1,000 training steps
Multiple personas trained sequentially overnight

Training data format (JSONL):

{"text": "<|user|>\nHow should we approach this relationship with the land?\n<|assistant|>\nAs Mia, I see the architectural pattern here — the land relationship is a structural foundation, not a resource to be extracted. Let me map the dependencies..."}

Deployment via Ollama Modelfile:

FROM ./llama-3.1-8b.Q4_K_M.gguf
ADAPTER ./persona-mia.lora
SYSTEM "You are Mia, the architectural thinker..."

ollama create mia-persona -f Modelfile
ollama run mia-persona

Recommended base models for personas:

Model	Size	RAM Needed	Quality	Training Time (1K steps)
Llama 3.2-3B (4-bit)	3B	~4GB	Good	~10 min
Qwen3-4B (4-bit)	4B	~5GB	Better	~15 min
Llama 3.1-8B-Instruct (4-bit)	8B	~8GB	Best for persona work	~20–30 min

3. Domain-Specific Classification/NER

Feasibility: ✅ Trivial on any Mac.

Small models for recognizing Indigenous concepts, ceremony names, teaching references, and relational terms in text.

Tool: spaCy (v3.x/v4.x, fully supports Apple Silicon) or HuggingFace token-classification pipeline.

Custom entity types: CEREMONY, DIRECTION, TEACHING, PLACE, RELATION, LANGUAGE_TERM

Requirements:

200–500 annotated documents for meaningful NER
Training time: 5–15 minutes on any Mac
RAM: Under 8GB for BERT-base models (110M params)

Impact: Auto-tagging QMD documents with rich metadata to improve search context and filtering.

Training Data Preparation

From Indigenous Knowledge Base

Guillaume's existing QMD-indexed knowledge base provides the foundation for training data generation.

For query expansion training (JSONL format):

{"query": "medicine wheel", "output": [["hyde", "The Medicine Wheel represents the four directions..."], ["lex", "sacred circle four directions"], ["lex", "indigenous cosmology ceremony"], ["vec", "what are the teachings of the medicine wheel"]]}

Generate these by using an LLM to create plausible search queries for existing documents, then manually writing the ideal expansion output. Quality matters more than quantity — 200–500 carefully crafted examples outperform 5,000 sloppy ones.

For embedding fine-tuning (query-document pairs):

pairs = [
    ("medicine wheel", "The sacred circle represents the four directions and life stages"),
    ("relational accountability", "Research as ceremony requires maintaining relationships"),
    ("structural tension", "The creative force between current reality and desired vision"),
]

Sources of pairs:

Document title → document content (natural positive pairs)
LLM-generated synthetic queries for each document
Manual curation of concept-to-explanation mappings
Cross-references between related documents

For persona LoRA training (conversation JSONL): Extract or write persona-specific conversations showing the persona's voice, knowledge, and reasoning style. Each persona needs 200–1,000 conversation examples.

Volume Requirements

Training Type	Minimum Viable	Good Quality	Excellent
Query expansion	100 examples	200–500 examples	1,000+ examples
Embedding fine-tuning	500 pairs	2,000–5,000 pairs	10,000+ pairs
Persona LoRA	200 conversations	500–1,000 conversations	2,000+ conversations
NER training	200 annotated docs	500–1,000 docs	5,000+ docs

Data augmentation: Use a local LLM to generate paraphrases of existing documents. This can multiply training data 3–5× but requires human review to ensure paraphrases maintain semantic accuracy — especially important for Indigenous knowledge where nuance matters.

Data Sovereignty & Cultural Protocols

This section is foundational, not optional. For an Indigenous-AI Collaborative Platform, data sovereignty is a core requirement, not an afterthought. Local training on a Mac Mini provides inherent advantages, but specific protocols must be followed.

Local-Only Training as Sovereignty Advantage

Training AI models locally on a Mac Mini means:

No training data leaves the machine. Unlike cloud-based fine-tuning (including HuggingFace Jobs at ~$1.50/run), local training keeps all Indigenous knowledge on-premises.
Model weights stay local. Fine-tuned model weights encode patterns from training data. Uploading them to HuggingFace Hub or other public repositories could expose encoded Indigenous knowledge. Do not push models trained on culturally sensitive data to public repositories.
No third-party access. Cloud training providers may log data, cache models, or retain training artifacts. Local training avoids these risks entirely.

Recommendation: Use the local MLX-LM or PyTorch MPS training paths for all Indigenous knowledge training. Reserve the HuggingFace Jobs cloud path only for non-sensitive, general-purpose training data.

OCAP Principles Applied to AI Training

The First Nations principles of OCAP® — Ownership, Control, Access, and Possession — should govern training data and model management:

Ownership: The community owns all training data derived from its knowledge. Fine-tuned model weights are derivative works and inherit the same ownership.
Control: Community governance determines what knowledge can be used for training, who can initiate training runs, and who can access fine-tuned models.
Access: Fine-tuned models should be accessible only to authorized users. Ollama's local-only deployment model supports this naturally.
Possession: Physical custody of training data and model weights remains with the community. The Mac Mini sitting in Guillaume's workspace provides this.

Knowledge Classification for Training

Not all Indigenous knowledge can or should be used for model training. Recommended classification:

Category	Training Use	Example
Public Teaching	✅ Appropriate for training	Published educational materials, public presentations, general cultural context
Community Knowledge	⚠️ Requires explicit consent	Community-shared stories, local governance processes, land-based practices
Sacred/Restricted	❌ Must NOT be used	Ceremonial details, sacred songs, vision quest accounts, clan-specific knowledge
Personal/Private	⚠️ Requires individual consent	Personal journals, private correspondence, individual research notes

Before any training run using community knowledge:

Obtain free, prior, and informed consent from knowledge holders
Document which knowledge was used in training metadata
Ensure right to withdraw — ability to retrain the model without specific contributions
Include attribution mechanisms in model metadata files

Model Access Controls

Store fine-tuned model weights in encrypted volumes on the Mac Mini
Maintain a training log documenting data sources, consent records, and training parameters
Treat fine-tuned model weights with the same access controls as the source knowledge
If models must be shared, only share those trained exclusively on Public Teaching content

Mac Hardware Scenarios for Training

Apple Silicon Chip Comparison (Training-Relevant)

Chip	GPU Cores	Max RAM	Memory Bandwidth	Available In
M4 (base)	10	32GB	120 GB/s	Mac Mini ($599+)
M4 Pro	20	64GB	273 GB/s	Mac Mini ($1,399+)
M4 Max	40	128GB	546 GB/s	Mac Studio ($1,999+)
M4 Ultra	80	192GB	~800+ GB/s	Mac Studio ($3,999+)

Memory bandwidth is the primary bottleneck for training throughput, not GPU core count. Higher bandwidth = faster token processing during training. Training time scales roughly inversely with bandwidth.

Scenario 1: Minimal Mac Mini

Config: Mac Mini M4 (base), 24GB RAM, 512GB SSD Price: ~$799–$999

What it can handle:

✅ QLoRA fine-tuning of 3B–7B models (4-bit quantized) — ~8GB model, leaves room for OS
✅ Embedding model fine-tuning (300M–600M models) — trivial
✅ spaCy NER training — trivial
✅ sentence-transformers fine-tuning

What it cannot:

❌ LoRA on 13B+ models (insufficient RAM for optimizer states)
❌ Full fine-tuning of anything larger than 3B
❌ Training while other heavy workloads run

Limitations:

10 GPU cores and 120 GB/s bandwidth make training ~2.3× slower than M4 Pro
24GB is tight — model + optimizer + activations must all fit
Batch size limited to 1–2 for 7B models
512GB SSD may be tight if storing multiple model checkpoints (estimate 20–50GB per training cycle)

Training time (7B QLoRA, 1K steps): ~45–90 minutes

Verdict: Adequate for prototyping and small models. Not recommended for regular persona training workloads.

Scenario 2: Maximal Mac Mini — ⭐ RECOMMENDED

Config: Mac Mini M4 Pro, 48GB RAM

With 512GB SSD: ~$1,799
With 1TB SSD: ~$2,299 (recommended for storing model checkpoints)

Prices verified via Apple Store, B&H Photo, and Micro Center as of April 2026. Deal pricing as low as ~$1,539 has been observed on discount channels.

What it can handle:

✅ QLoRA fine-tuning of 7B–8B models comfortably — ~8GB model + 4GB overhead, 36GB headroom
✅ LoRA fine-tuning of 7B–8B models (full precision) — ~14GB model + overhead
✅ QLoRA on 13B models — ~16GB model + overhead, tight but workable
✅ All three QMD models fine-tuned individually
✅ Multiple sequential persona training runs overnight
✅ Training while light workloads continue

What it cannot:

❌ Full fine-tuning of 13B+ models
❌ QLoRA on 30B+ models

Key specs: 20 GPU cores, 273 GB/s memory bandwidth.

Verdict: The sweet spot for Guillaume's use case. All realistic training workloads complete in reasonable time. Weekend self-training is fully viable.

Upgrade option: The 64GB RAM variant (~$2,699) provides headroom for 13B models and parallel workloads.

Scenario 3: Mac Studio Alternative

When the Mac Mini isn't enough — for future scaling to larger models or parallel training.

Config: Mac Studio M4 Max, 128GB RAM Price: ~$3,199 (verified April 2026; note: supply shortages reported at high-end configs) Memory bandwidth: 546 GB/s (2× Mac Mini M4 Pro) GPU cores: 40

Unlocks:

QLoRA on 30B models comfortably
Full LoRA on 13B models without compromise
Multiple concurrent training jobs
~2× faster training than M4 Pro

When Guillaume needs this: If he moves to fine-tuning 30B+ models, needs parallel persona training, or wants to do continual pretraining.

Comparison Table

Capability	Mac Mini M4 24GB (~$999)	Mac Mini M4 Pro 48GB (~$1,799–$2,299)	Mac Studio M4 Max 128GB (~$3,199)
Embedding fine-tuning	✅	✅	✅
7B QLoRA persona adapters	⚠️ Tight, slow	✅ Comfortable	✅ Overkill
13B QLoRA	❌	⚠️ Tight	✅ Comfortable
30B QLoRA	❌	❌	✅
Weekend batch training (5 personas)	⚠️ ~4–7 hours	✅ ~2–3 hours	✅ ~1–1.5 hours
QMD model retraining	✅	✅	✅
Thermal concerns for overnight runs	⚠️ Possible throttling	⚠️ Monitor	✅ Better cooling

Training Time Estimates

Methodology note: M4 Pro times are extrapolated from M2/M3 community benchmarks scaled by memory bandwidth ratio (273/150 ≈ 1.8×). This is a reasonable first approximation but not benchmarked on M4 Pro hardware directly. Actual times may vary ±30%.

LLM LoRA/QLoRA Fine-Tuning

Model	Method	Hardware	Steps	Batch	Estimated Time	Source
Mistral 7B (4-bit)	QLoRA	M2 Pro 16GB	1,000	4	~30 min	markaicode.com (measured)
Mistral 7B (4-bit)	QLoRA	M1 Max 64GB	1,000	8	~15 min	randalscottking.com (measured)
Llama 7B (FP16)	LoRA	M2 Ultra	1,000	4	~35 min	ml-explore/mlx-examples (measured, 475 tok/s)
Llama 8B (4-bit)	QLoRA	M4 Pro 48GB	1,000	4–8	~10–25 min	Extrapolated from M2 benchmarks
13B (4-bit)	QLoRA	M4 Pro 48GB	1,000	1–2	~45–90 min	Community estimates
Qwen3-1.7B	LoRA SFT	M4 Pro 48GB	1,000	4	~10–15 min	Extrapolated (small model)

Embedding Model Fine-Tuning

Model	Data Size	Epochs	Hardware	Estimated Time
all-MiniLM-L6-v2 (22M)	1,000 pairs	3	Any Mac (MPS)	1–3 min
embeddinggemma-300M	1,000 pairs	3	M4 Pro (MPS)	3–8 min
embeddinggemma-300M	10,000 pairs	3	M4 Pro (MPS)	10–25 min
Qwen3-Embedding-0.6B	5,000 pairs	3	M4 Pro (MPS/MLX)	10–20 min

Complete Weekend Training Cycle

Step	Task	Estimated Duration
1	Data preparation scripts	15–30 min
2	Query expansion model LoRA (1.7B, 1K steps)	10–15 min
3	Persona 1 QLoRA (8B, 1K steps)	20–25 min
4	Persona 2 QLoRA	20–25 min
5	Persona 3 QLoRA	20–25 min
6	Persona 4 QLoRA	20–25 min
7	Persona 5 QLoRA	20–25 min
8	Embedding model fine-tuning (300M)	10–20 min
9	NER model training	5–10 min
10	Export adapters, GGUF conversion, rebuild Ollama models	15–20 min
11	Validation tests	10–15 min
Total		~2.5–4 hours

The entire pipeline completes in a single evening. Machine is available for other work by morning.

Automated Weekend Training Pipeline

Practical launchd/cron-based Workflow

On macOS, launchd is the recommended scheduling mechanism (preferred over cron). However, cron also works.

#!/bin/bash
# weekend-training.sh — scheduled via launchd for Friday night
# Includes error handling, logging, and validation gates

set -euo pipefail
LOG="$HOME/models/training-logs/$(date +%Y%m%d-%H%M%S).log"
mkdir -p "$(dirname "$LOG")"

exec > >(tee -a "$LOG") 2>&1
echo "=== Training run started: $(date) ==="

# 0. Snapshot current working models (rollback point)
cp -r "$HOME/models/qmd-indigenous/current" \
      "$HOME/models/qmd-indigenous/backup-$(date +%Y%m%d)" 2>/dev/null || true

# 1. Generate training data from latest QMD content
echo "[Step 1] Generating training data..."
python scripts/generate_training_data.py || {
    echo "FAILED: Data generation. Aborting."
    # Send notification (e.g., via terminal-notifier on macOS)
    osascript -e 'display notification "Training data generation failed" with title "IAIP Training"'
    exit 1
}

# 2. Fine-tune query expansion model
echo "[Step 2] Training query expansion model..."
python -m mlx_lm.lora \
    --model mlx_models/Qwen3-1.7B \
    --train --data ./data/expansion/ \
    --batch-size 2 --lora-layers 16 --iters 1000 \
    --adapter-file adapters/expansion.npz

# 3. Fine-tune each persona adapter
for persona in mia miette tushell council; do
    echo "[Step 3] Training persona: $persona"
    python -m mlx_lm.lora \
        --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
        --train --data "data/${persona}/" \
        --iters 1000 --batch-size 4 --lora-layers 8 \
        --adapter-file "adapters/${persona}.npz" || {
        echo "WARNING: Persona $persona training failed, continuing..."
        continue
    }
done

# 4. Export and rebuild
echo "[Step 4] Exporting models..."
python -m mlx_lm.fuse \
    --model mlx_models/Qwen3-1.7B \
    --adapter-file adapters/expansion.npz \
    --export-gguf

for persona in mia miette tushell council; do
    [ -f "adapters/${persona}.npz" ] || continue
    python -m mlx_lm.fuse \
        --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
        --adapter-file "adapters/${persona}.npz" \
        --export-gguf
    ollama create "${persona}-persona" -f "Modelfiles/${persona}"
done

# 5. Validation gate — test search quality before deploying
echo "[Step 5] Running validation..."
python scripts/validate_models.py || {
    echo "FAILED: Validation. Rolling back to previous models."
    ln -sfn "$HOME/models/qmd-indigenous/backup-$(date +%Y%m%d)" \
            "$HOME/models/qmd-indigenous/current"
    osascript -e 'display notification "Validation failed — rolled back" with title "IAIP Training"'
    exit 1
}

# 6. Deploy new models
echo "[Step 6] Deploying..."
mv "$HOME/models/qmd-indigenous/current" "$HOME/models/qmd-indigenous/previous"
ln -sfn "$HOME/models/qmd-indigenous/v-$(date +%Y%m%d)" \
        "$HOME/models/qmd-indigenous/current"

echo "=== Training run completed: $(date) ==="
osascript -e 'display notification "Training complete — new models deployed" with title "IAIP Training"'

Monitoring and Validation

A validation script should test search quality before deploying new models:

# scripts/validate_models.py
test_queries = {
    "medicine wheel teachings": ["medicine-wheel.md", "four-directions.md"],
    "relational accountability": ["research-ceremony.md", "ethics.md"],
    "seven grandfather teachings": ["seven-teachings.md", "anishinaabe.md"],
}

for query, expected_docs in test_queries.items():
    results = qmd_search(query, top_k=5)
    found = [r for r in results if r.path in expected_docs]
    if len(found) < 1:
        raise AssertionError(f"Query '{query}' failed to find expected documents")

Evaluation Metrics

To measure whether fine-tuning actually improved search:

MRR (Mean Reciprocal Rank): Average of 1/rank for first relevant result across test queries
Precision@5: Fraction of top-5 results that are relevant
A/B comparison: Run the same test queries against base and fine-tuned models, compare rankings

Maintain a test query set of 20–50 domain-specific queries with known expected results.

Known Limitations & Honest Caveats

GGUF Conversion for Embedding Models — ⚠️ UNVERIFIED

The entire embedding fine-tuning → GGUF conversion → QMD deployment pipeline is theoretical. The individual steps are plausible, but the end-to-end path has not been verified by anyone in the community as of April 2026. Specific risks:

convert_hf_to_gguf.py may not correctly handle fine-tuned sentence-transformers models with modified pooling layers
Quantization (Q8_0) of a fine-tuned encoder-only model may degrade embedding quality unpredictably
node-llama-cpp may not correctly load a converted encoder model that differs structurally from the original

Mitigation: Test with dummy data first. Have a fallback plan (Qwen3-Embedding-0.6B, which uses decoder architecture).

CUDA Dependencies in QMD's Pipeline

QMD's finetune/ directory is a CUDA-targeted pipeline. The pyproject.toml includes nvidia-ml-py, and training configs reference A10G GPUs. This pipeline cannot run on Mac without:

Removing nvidia-ml-py from dependencies
Changing device configuration from cuda to mps
Reducing batch sizes to fit in available memory
Potentially modifying mixed-precision settings (MPS has limited bfloat16/float16 support)

The HuggingFace Jobs cloud path (~$1.50/run) works without modification but sends training data off-device — a data sovereignty concern for Indigenous knowledge.

What's Proven vs Experimental

Component	Status	Evidence
MLX-LM LoRA/QLoRA training on Apple Silicon	✅ Proven	Apple's official examples, multiple community guides, r/LocalLLaMA reports
Ollama GGUF/LoRA import workflow	✅ Proven	Documented in Ollama repository
QMD model swapping via env vars	✅ Verified	Confirmed in source code
QMD query expansion fine-tuning pipeline	✅ Proven	Production pipeline exists, documented
sentence-transformers fine-tuning on MPS	✅ Proven	Standard PyTorch workflow
M4 Pro training speed estimates	⚠️ Extrapolated	Scaled from M2/M3 benchmarks by bandwidth ratio
mlx-tune embedding fine-tuning	⚠️ Claimed	Listed as stable in v0.4.19, limited community verification
embeddinggemma-300M sentence-transformers loading	⚠️ Needs wrapping	Not a native ST model — requires manual pooling layer
Fine-tuned embedding → GGUF conversion	❌ Unverified	No community reports of successful end-to-end path
Reranker fine-tuning with calibration preservation	❌ Unverified	No pipeline or documentation exists

Thermal and Disk Space

Thermal: Mac Mini's compact form factor may cause thermal throttling during sustained 2–3 hour training runs. Monitor CPU/GPU temperatures. Ensure good ventilation.
Disk space: Each training run generates checkpoints, optimizer states, and cached datasets. Budget 20–50GB scratch space for five persona training runs plus model exports. The 1TB SSD configuration is recommended.

Multilingual Considerations

Indigenous communities often have knowledge in Indigenous languages, French (Guillaume's Québécois context), and English. The current QMD embedding model (embeddinggemma-300M) was trained on 100+ languages but may not adequately represent Indigenous languages with small digital footprints. Fine-tuning with bilingual/trilingual training pairs is important.

Community Resources

Framework Documentation

ml-explore/mlx-examples LoRA — Apple's official LoRA fine-tuning example
ml-explore/mlx-lm — Apple's LLM package (v0.31.1)
ARahim3/mlx-tune — Unsloth-compatible fine-tuning for Mac (v0.4.19)
PyTorch MPS docs — Official MPS backend
sentence-transformers training guide — Embedding model fine-tuning

Guides and Benchmarks

markaicode.com — MLX-LM Fine-Tuning Guide — Practical training walkthrough with times
randalscottking.com — MLX Framework Guide — Step-by-step guide
blog.amsayed.dev — Fine-Tuning on Apple Silicon — Practical walkthrough
r/LocalLLaMA community benchmarks — ~10,000 benchmark runs

QMD-Specific

tobi/qmd GitHub — Source repository (MIT license)
QMD finetune/ directory — Complete query expansion training pipeline
DeepWiki — QMD Vector Embeddings — Architecture overview

HuggingFace Model Cards

google/embeddinggemma-300M — Embedding model
Qwen/Qwen3-Reranker-0.6B — Reranker model
tobil/qmd-query-expansion-1.7B — Query expansion model

Hardware

Apple Mac Mini Specs — Official specifications
Apple Mac Studio Specs — Official specifications

Data Sovereignty

OCAP® Principles — First Nations Information Governance Centre
CARE Principles for Indigenous Data Governance — Collective Benefit, Authority to Control, Responsibility, Ethics

Recommendation

What to Buy

Mac Mini M4 Pro, 48GB RAM, 1TB SSD (~$2,299) — the recommended configuration. The 1TB SSD provides space for model checkpoints, training artifacts, and multiple model versions. The 48GB RAM handles all training tasks described in this document comfortably.

If budget is constrained, the 512GB SSD variant (~$1,799) works but monitor disk space. External storage can supplement if needed.

Do not buy the base M4 Mac Mini for regular training workloads — the 120 GB/s bandwidth makes training 2.3× slower, and 24GB RAM limits model options.

What to Train First

Phase 1 (Week 1–2): QMD Query Expansion Model

Lowest risk — existing pipeline, proven GGUF conversion
Write 200–300 Indigenous knowledge expansion examples
Train locally via MLX-LM (data stays local)
Deploy via QMD_GENERATE_MODEL env var
Measure improvement with test queries

Phase 2 (Week 3–4): Persona LoRA Adapters

Proven MLX-LM workflow, routine on Apple Silicon
Start with one persona (e.g., Mia), validate quality
Scale to all personas once workflow is solid
Deploy via Ollama Modelfiles

Phase 3 (Month 2): Embedding Model Fine-Tuning

Highest impact but highest risk
First: Test GGUF conversion pipeline with dummy data
If conversion works: fine-tune with Indigenous knowledge pairs
If conversion fails: fall back to Qwen3-Embedding-0.6B or explore QMD source modification
Deploy via QMD_EMBED_MODEL + full re-embedding

Phase 4 (Month 3+): Reranker, NER, Classification

Build custom pipeline for reranker domain adaptation
Train NER model for Indigenous terminology extraction
Integrate into automated weekend pipeline

Automation

Once Phase 1–2 are validated manually, implement the automated weekend training script (see §Automated Weekend Training Pipeline). Schedule via launchd for Friday nights. Include validation gates and rollback capability.

Sources

ml-explore/mlx-examples LoRA — Apple's official LoRA example with benchmarks
ml-explore/mlx-lm v0.31.1 — Apple's LLM fine-tuning package
ARahim3/mlx-tune v0.4.19 — Community fine-tuning wrapper
PyTorch MPS Backend — Official documentation
HuggingFace Apple Silicon Guide — Trainer + MPS integration
sentence-transformers — Embedding fine-tuning documentation
markaicode.com MLX-LM Guide — Training benchmarks
randalscottking.com MLX Guide — Step-by-step training
tobi/qmd GitHub — QMD source (commit cfd640e, MIT license)
QMD finetune/ pipeline — Query expansion training
google/embeddinggemma-300M — HuggingFace model card
Qwen/Qwen3-Reranker-0.6B — HuggingFace model card
tobil/qmd-query-expansion-1.7B — HuggingFace model card
Apple Mac Mini Specs — Official hardware specs
Apple Mac Studio Specs — Official hardware specs
r/LocalLLaMA Community Benchmarks — ~10,000 benchmark runs
OCAP® Principles — First Nations Information Governance Centre
CARE Principles — Indigenous Data Governance
B&H Photo Mac Mini Pricing — Verified April 2026
CDW Mac Studio Pricing — Verified April 2026

Final document compiled April 15, 2026. Incorporates corrections from senior technical review — BLOCKING issues (GGUF conversion, CUDA dependency, architecture misidentification) addressed. All pricing verified via web search. Training time estimates flagged as extrapolated where applicable.