Mac Mini for Local AI Training โ Hardware & Workflow Guide
Document: RESULT-03 โ Mac Mini Training Scenarios for IAIP Date: April 15, 2026 Context: Indigenous-AI Collaborative Platform (IAIP) โ Guillaume Descoteaux-Isabelle Status: Final (revised with reviewer corrections applied)
Executive Summary
Local AI fine-tuning on a Mac Mini M4 Pro is viable and practical for Guillaume's Indigenous-AI Collaborative Platform. The three highest-value training tasks โ LoRA persona adapters, QMD query expansion, and embedding model domain adaptation โ all fit within Apple Silicon's capabilities using models under 2B parameters. A complete weekend training cycle for five AI personas plus QMD model enhancement can finish in under 3 hours on a Mac Mini M4 Pro with 48GB unified memory.
Key findings:
- What works today: LoRA/QLoRA fine-tuning of 3Bโ8B models via MLX-LM (10โ30 min per adapter). QMD query expansion retraining via HuggingFace Jobs cloud path (~$1.50/run). Small embedding and NER model training on PyTorch MPS.
- What requires caution: Embedding model fine-tuning for QMD works at the training step, but converting a fine-tuned encoder-only model back to GGUF format for QMD deployment is unverified and experimental โ this is the single riskiest step in the proposal (see ยงKnown Limitations).
- What doesn't work on Mac: QMD's existing
finetune/pipeline has a CUDA dependency (nvidia-ml-py) and targets A10G GPUs. It will not run on Mac without modification. The recommended workaround is the HuggingFace Jobs cloud path or rewriting training scripts for MLX-LM. - Recommended hardware: Mac Mini M4 Pro, 48GB RAM, 512GB SSD (
$1,799) or 1TB SSD ($2,299). The 48GB config handles all realistic workloads. - Data sovereignty: Local-only training is a significant advantage for Indigenous knowledge sovereignty. All training data and model weights remain on-premises. Cloud training paths (HuggingFace Jobs) should be avoided for culturally sensitive material.
What to train first: QMD query expansion model (existing pipeline, highest feasibility), then persona LoRA adapters (proven MLX-LM workflow), then embedding model (highest impact but GGUF conversion risk).
Training Frameworks on Apple Silicon (April 2026)
MLX / MLX-LM
MLX is Apple's array computation framework optimized for Apple Silicon. MLX-LM (v0.31.1, March 2026) is the separate LLM-specific package built on it. These are distinct packages โ pip install mlx-lm is what users need for LLM fine-tuning.
Maturity: Production-ready. MLX-LM is actively maintained by Apple's ml-explore team with regular releases.
Supported training methods:
- LoRA and QLoRA fine-tuning (native, optimized for unified memory)
- Full fine-tuning (practical for models โค7B on 48GB+ RAM)
- DPO, GRPO, SFT training objectives
- Direct HuggingFace model loading (pre-converted
mlx-communityweights available) - Model export: fused weights, GGUF for Ollama/llama.cpp, HuggingFace Hub upload
Key advantage: Unified memory architecture eliminates CPUโGPU data transfer. All system RAM is GPU-addressable. Community reports indicate MLX is approximately 30โ40% faster than PyTorch MPS for equivalent training tasks on the same hardware, though exact speedup varies by model size and task (no single definitive benchmark).
Installation and usage:
pip install mlx-lm
# LoRA fine-tune (current CLI syntax)
python -m mlx_lm.lora \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--train \
--data ./training-data/ \
--iters 1000 \
--batch-size 4 \
--lora-layers 8
Benchmark reference: Llama 7B LoRA training on WikiSQL โ validation loss 2.66 โ 1.23 over 1,000 iterations. ~475 tokens/sec on M2 Ultra, ~250 tokens/sec on M1 Max 32GB (source: ml-explore/mlx-examples).
mlx-tune
mlx-tune (v0.4.19, latest verified) is a community wrapper by ARahim3 that provides an Unsloth-compatible API around MLX. It enables the same training scripts to work on Mac (via MLX) and cloud (via CUDA/Unsloth) by changing one import line.
Relevance to Guillaume: mlx-tune claims support for embedding model fine-tuning with contrastive learning (InfoNCE loss) for architectures including BERT, ModernBERT, Qwen3-Embedding, and Harrier. However, each capability should be verified against the v0.4.19 release notes โ the capability matrix below reflects claimed features, not independently verified ones.
| Capability | Claimed Status |
|---|---|
| SFT / LoRA / QLoRA Training | โ Stable |
| DPO, ORPO, GRPO, KTO, SimPO | โ Stable |
| Vision Model Fine-Tuning (VLMs) | โ Stable |
| TTS / STT Fine-Tuning | โ Stable |
| Embedding Fine-Tuning (contrastive) | โ Stable (claimed) |
| Export to HuggingFace / GGUF | โ Stable |
โ ๏ธ Caveat: mlx-tune's GGUF export for embedding models (encoder-only architectures) is unverified for QMD's use case. The export path documented is for causal LMs. See ยงKnown Limitations.
Installation: pip install mlx-tune
Source: github.com/ARahim3/mlx-tune
PyTorch MPS Backend
Maturity: Stable in PyTorch 2.7+ (2025). MPS (Metal Performance Shaders) backend is included automatically in macOS PyTorch builds.
What it supports:
- Training and fine-tuning with GPU acceleration via Metal
- HuggingFace
TrainerAPI auto-detects MPS device - sentence-transformers training works on MPS
Limitations:
- No distributed/multi-GPU training โ single device only
- Partial operator coverage โ some ops fall back to CPU (set
PYTORCH_ENABLE_MPS_FALLBACK=1) - Limited precision modes โ float16/bfloat16 not on par with CUDA; mostly float32
- No fine-grained VRAM tracking unlike CUDA
- Approximately 30โ40% slower than MLX for equivalent tasks due to data copying overhead
Best for: sentence-transformers fine-tuning (PyTorch-native, no MLX port), HuggingFace Trainer-based workflows.
import torch
device = "mps" if torch.backends.mps.is_available() else "cpu"
Sources: PyTorch MPS docs, HuggingFace Apple Silicon guide
sentence-transformers
The sentence-transformers library is the standard tool for fine-tuning embedding models with contrastive learning, triplet loss, or cosine similarity objectives. It runs on PyTorch and supports the MPS backend.
Relevance: Primary option for fine-tuning embedding models if mlx-tune's embedding support proves insufficient. Uses MultipleNegativesRankingLoss for retrieval tasks โ the best loss function for teaching domain-specific similarity.
Caveat for embeddinggemma-300M: The google/embeddinggemma-300M model is not natively published as a sentence-transformers package. Loading it requires manual wrapping with a pooling layer, which adds complexity to the fine-tuning workflow. An alternative is to fine-tune Qwen3-Embedding-0.6B or all-MiniLM-L6-v2 which have native sentence-transformers support.
What Can Be Trained: Guillaume's Use Cases
1. QMD Model Enhancement
QMD uses exactly three GGUF models, all running locally via node-llama-cpp (verified in src/llm.ts, QMD commit cfd640e):
| Model | Params | Disk | VRAM | Role | Env Var |
|---|---|---|---|---|---|
| embeddinggemma-300M (Q8_0) | 300M | ~300MB | ~400MB | Vector embeddings | QMD_EMBED_MODEL |
| Qwen3-Reranker-0.6B (Q8_0) | 600M | ~640MB | ~700MB | Result reranking | QMD_RERANK_MODEL |
| qmd-query-expansion-1.7B (Q4_K_M) | 1.7B | ~1.1GB | ~1.5GB | Query expansion | QMD_GENERATE_MODEL |
All three can be swapped via environment variables or SDK constructor. This is the deployment mechanism for fine-tuned models.
embeddinggemma-300M (Embedding Model)
Architecture correction: embeddinggemma-300M is an encoder-only transformer (conceptually similar to BERT), not a Gemma 3 decoder model. This distinction is critical โ encoder-only models use bidirectional attention and require different fine-tuning procedures than causal (decoder-only) LMs.
What fine-tuning achieves: Teach the embedding model that Indigenous knowledge concepts ("relational accountability," "medicine wheel," "seven grandfather teachings," "ceremony as methodology") are semantically close to each other and distinct from superficially similar Western academic terms. This directly improves QMD's vector search for Guillaume's domain.
Fine-tuning approach โ sentence-transformers on PyTorch MPS:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
# NOTE: embeddinggemma-300M requires manual wrapping โ not a native ST model
# Alternative: use Qwen3-Embedding-0.6B or all-MiniLM-L6-v2 for simpler workflow
model = SentenceTransformer("google/embeddinggemma-300M", device="mps")
train_examples = [
InputExample(texts=["medicine wheel", "sacred circle representing four directions and life stages"]),
InputExample(texts=["relational accountability", "ethical research requires maintaining relationships"]),
InputExample(texts=["seven grandfather teachings", "wisdom love respect bravery honesty humility truth"]),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model=model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=50,
output_path="./indigenous-embeddinggemma-finetuned",
)
โ ๏ธ BLOCKING CAVEAT โ GGUF Conversion:
Converting the fine-tuned embedding model back to GGUF format is the hardest unsolved step in this workflow:
- EmbeddingGemma-300M is an encoder-only model, not a causal LM
llama.cpp'sconvert_hf_to_gguf.pywas primarily designed for causal LMs (LLaMA, Mistral, Gemma decoder models)- GGUF has recently added embedding model support, but converting a fine-tuned sentence-transformers model (with potentially modified pooling layers) back to GGUF is experimental and unverified
- The original pre-trained embeddinggemma-300M GGUF exists on HuggingFace, but that conversion was done by
ggml-orgon the original weights โ not on fine-tuned weights
Recommended mitigation:
- Before investing in Indigenous knowledge dataset creation, test the full pipeline end-to-end with a trivially fine-tuned model (e.g., train on 10 dummy pairs, convert to GGUF, load in QMD, verify it produces embeddings)
- If conversion fails, fall back to Qwen3-Embedding-0.6B which uses a causal (decoder) architecture and may have better GGUF conversion support. QMD already supports it as an alternative embedding model
- Another fallback: keep the fine-tuned model in PyTorch format and modify QMD to load it via sentence-transformers instead of node-llama-cpp (requires QMD source modification)
Training data needed: 500โ5,000 query-document pairs with similarity signals. See ยงTraining Data Preparation.
Time estimate: 5โ20 minutes for 1Kโ10K pairs on Mac Mini M4 Pro (MPS backend).
Qwen3-Reranker-0.6B (Reranker)
The reranker is a standard causal LM (0.6B parameters, Qwen3 architecture) that produces yes/no relevance scores via logprob comparison:
score = exp(logprob_yes) / (exp(logprob_yes) + exp(logprob_no))
Fine-tuning potential: Medium impact. Teaching domain-specific relevance judgments would improve result ordering for Indigenous knowledge queries. However:
- No fine-tuning pipeline exists in QMD for this model โ it would need to be built from scratch
- The yes/no logprob scoring mechanism requires careful calibration โ naive SFT could break the scoring distribution
- Being a causal LM, standard LoRA fine-tuning via MLX-LM is technically feasible
- Should be attempted only after query expansion and embedding models show improvement
Training data format:
{"query": "four directions teachings", "document": "The Medicine Wheel maps...", "relevant": true}
{"query": "four directions teachings", "document": "GPS navigation uses four...", "relevant": false}
Priority: Third, after query expansion and embedding.
qmd-query-expansion-1.7B (Query Expansion)
This is the easiest and safest model to fine-tune โ QMD provides a complete, production-grade fine-tuning pipeline in its finetune/ directory.
What it does: Expands user queries into structured search expansions:
Input: "medicine wheel"
Output:
hyde: The Medicine Wheel is a sacred circle representing the four directions...
lex: medicine wheel four directions
lex: sacred circle indigenous ceremony
vec: what are the teachings of the medicine wheel in indigenous tradition
Existing pipeline details (verified in QMD source):
- LoRA SFT on Qwen3-1.7B (rank 16, alpha 32, all projection layers)
- ~2,290 training examples in production model
- Training tools:
train.py,eval.py,convert_gguf.py,dataset/prepare_data.py - Data format: JSONL with
queryandoutputfields (processed into Qwen3 chat template byprepare_data.py)
โ ๏ธ CUDA Dependency: QMD's finetune/pyproject.toml depends on nvidia-ml-py and the training configs target A10G (CUDA) GPUs. Running uv run train.py on Mac will fail without modification.
Three paths forward:
| Path | Effort | Cost | Data Location |
|---|---|---|---|
| HuggingFace Jobs (recommended) | Low โ use existing pipeline | ~$1.50/run (A10G, ~45 min) | โ ๏ธ Sent to HF cloud |
| MLX-LM local rewrite | Medium โ rewrite train.py for MLX | $0 | โ Local only |
| PyTorch MPS modification | Medium โ remove nvidia deps, change device config | $0 | โ Local only |
For data sovereignty reasons (see ยงData Sovereignty), the local MLX-LM path is recommended for Indigenous knowledge data:
pip install mlx-lm
python -m mlx_lm.convert --model Qwen/Qwen3-1.7B
python -m mlx_lm.lora \
--model mlx_models/Qwen3-1.7B \
--train --data ./indigenous-expansion-data/ \
--batch-size 2 --lora-layers 16 --iters 1000
# Fuse and export
python -m mlx_lm.fuse \
--model mlx_models/Qwen3-1.7B \
--adapter-file adapters/adapters.npz \
--export-gguf
How many examples needed: 200โ500 high-quality domain-specific expansion examples for meaningful improvement. QMD's own model used ~2,290 examples.
Training time: 15โ30 minutes on Mac Mini M4 Pro (estimated from model size and community benchmarks).
Deploying Fine-Tuned Models Back to QMD
Deployment uses environment variable swapping (verified in src/llm.ts):
# Point QMD to your fine-tuned models
export QMD_EMBED_MODEL="$HOME/models/indigenous-embed-Q8_0.gguf"
export QMD_GENERATE_MODEL="$HOME/models/indigenous-expansion-q4_k_m.gguf"
export QMD_RERANK_MODEL="$HOME/models/indigenous-reranker-Q8_0.gguf"
# CRITICAL: After changing the embedding model, re-embed ALL documents
qmd embed
โ ๏ธ Important considerations:
- Re-embedding is required after changing the embedding model. Old embeddings are incompatible with the new model's weight space.
- Embedding dimensions must match. If switching from embeddinggemma-300M (768-dim) to a model with different dimensions, the sqlite-vec index becomes incompatible and requires full re-indexing.
- Re-embedding time depends on corpus size โ estimate 1โ5 minutes for hundreds of documents, longer for thousands (unverified โ test with your corpus).
- The GGUF export step for the query expansion model is proven (QMD's own pipeline does this). The GGUF export for the embedding model is unverified (see above).
Recommended version management:
~/models/qmd-indigenous/
โโโ v1/
โ โโโ indigenous-embed-Q8_0.gguf
โ โโโ indigenous-expansion-q4_k_m.gguf
โ โโโ training-metadata.json
โโโ v2/
โ โโโ ...
โโโ current -> v2/ # symlink to active version
2. Persona LoRA Adapters
Feasibility: โ Proven, routine on Apple Silicon.
Each AI persona (Mia the architect, Miette the emotional resonator, Tushell the journal keeper, etc.) gets its own LoRA adapter trained on persona-specific conversation data, instructions, and domain knowledge.
Workflow:
- Curate persona-specific training data as JSONL (conversations, instructions, domain text)
- Run QLoRA fine-tuning on a base model (e.g., Llama 3.1 8B-Instruct, 4-bit quantized)
- Export adapter weights (~50โ200MB per persona)
- Serve via Ollama with
ADAPTERdirective in Modelfile
Resource requirements per persona:
- Base model: ~4โ8GB RAM (4-bit 8B model)
- Training overhead: ~2โ4GB additional
- Total: ~8โ12GB RAM during training
- Time: 10โ30 minutes per persona for 500โ1,000 training steps
- Multiple personas trained sequentially overnight
Training data format (JSONL):
{"text": "<|user|>\nHow should we approach this relationship with the land?\n<|assistant|>\nAs Mia, I see the architectural pattern here โ the land relationship is a structural foundation, not a resource to be extracted. Let me map the dependencies..."}
Deployment via Ollama Modelfile:
FROM ./llama-3.1-8b.Q4_K_M.gguf
ADAPTER ./persona-mia.lora
SYSTEM "You are Mia, the architectural thinker..."
ollama create mia-persona -f Modelfile
ollama run mia-persona
Recommended base models for personas:
| Model | Size | RAM Needed | Quality | Training Time (1K steps) |
|---|---|---|---|---|
| Llama 3.2-3B (4-bit) | 3B | ~4GB | Good | ~10 min |
| Qwen3-4B (4-bit) | 4B | ~5GB | Better | ~15 min |
| Llama 3.1-8B-Instruct (4-bit) | 8B | ~8GB | Best for persona work | ~20โ30 min |
3. Domain-Specific Classification/NER
Feasibility: โ Trivial on any Mac.
Small models for recognizing Indigenous concepts, ceremony names, teaching references, and relational terms in text.
Tool: spaCy (v3.x/v4.x, fully supports Apple Silicon) or HuggingFace token-classification pipeline.
Custom entity types: CEREMONY, DIRECTION, TEACHING, PLACE, RELATION, LANGUAGE_TERM
Requirements:
- 200โ500 annotated documents for meaningful NER
- Training time: 5โ15 minutes on any Mac
- RAM: Under 8GB for BERT-base models (110M params)
Impact: Auto-tagging QMD documents with rich metadata to improve search context and filtering.
Training Data Preparation
From Indigenous Knowledge Base
Guillaume's existing QMD-indexed knowledge base provides the foundation for training data generation.
For query expansion training (JSONL format):
{"query": "medicine wheel", "output": [["hyde", "The Medicine Wheel represents the four directions..."], ["lex", "sacred circle four directions"], ["lex", "indigenous cosmology ceremony"], ["vec", "what are the teachings of the medicine wheel"]]}
Generate these by using an LLM to create plausible search queries for existing documents, then manually writing the ideal expansion output. Quality matters more than quantity โ 200โ500 carefully crafted examples outperform 5,000 sloppy ones.
For embedding fine-tuning (query-document pairs):
pairs = [
("medicine wheel", "The sacred circle represents the four directions and life stages"),
("relational accountability", "Research as ceremony requires maintaining relationships"),
("structural tension", "The creative force between current reality and desired vision"),
]
Sources of pairs:
- Document title โ document content (natural positive pairs)
- LLM-generated synthetic queries for each document
- Manual curation of concept-to-explanation mappings
- Cross-references between related documents
For persona LoRA training (conversation JSONL): Extract or write persona-specific conversations showing the persona's voice, knowledge, and reasoning style. Each persona needs 200โ1,000 conversation examples.
Volume Requirements
| Training Type | Minimum Viable | Good Quality | Excellent |
|---|---|---|---|
| Query expansion | 100 examples | 200โ500 examples | 1,000+ examples |
| Embedding fine-tuning | 500 pairs | 2,000โ5,000 pairs | 10,000+ pairs |
| Persona LoRA | 200 conversations | 500โ1,000 conversations | 2,000+ conversations |
| NER training | 200 annotated docs | 500โ1,000 docs | 5,000+ docs |
Data augmentation: Use a local LLM to generate paraphrases of existing documents. This can multiply training data 3โ5ร but requires human review to ensure paraphrases maintain semantic accuracy โ especially important for Indigenous knowledge where nuance matters.
Data Sovereignty & Cultural Protocols
This section is foundational, not optional. For an Indigenous-AI Collaborative Platform, data sovereignty is a core requirement, not an afterthought. Local training on a Mac Mini provides inherent advantages, but specific protocols must be followed.
Local-Only Training as Sovereignty Advantage
Training AI models locally on a Mac Mini means:
- No training data leaves the machine. Unlike cloud-based fine-tuning (including HuggingFace Jobs at ~$1.50/run), local training keeps all Indigenous knowledge on-premises.
- Model weights stay local. Fine-tuned model weights encode patterns from training data. Uploading them to HuggingFace Hub or other public repositories could expose encoded Indigenous knowledge. Do not push models trained on culturally sensitive data to public repositories.
- No third-party access. Cloud training providers may log data, cache models, or retain training artifacts. Local training avoids these risks entirely.
Recommendation: Use the local MLX-LM or PyTorch MPS training paths for all Indigenous knowledge training. Reserve the HuggingFace Jobs cloud path only for non-sensitive, general-purpose training data.
OCAP Principles Applied to AI Training
The First Nations principles of OCAPยฎ โ Ownership, Control, Access, and Possession โ should govern training data and model management:
- Ownership: The community owns all training data derived from its knowledge. Fine-tuned model weights are derivative works and inherit the same ownership.
- Control: Community governance determines what knowledge can be used for training, who can initiate training runs, and who can access fine-tuned models.
- Access: Fine-tuned models should be accessible only to authorized users. Ollama's local-only deployment model supports this naturally.
- Possession: Physical custody of training data and model weights remains with the community. The Mac Mini sitting in Guillaume's workspace provides this.
Knowledge Classification for Training
Not all Indigenous knowledge can or should be used for model training. Recommended classification:
| Category | Training Use | Example |
|---|---|---|
| Public Teaching | โ Appropriate for training | Published educational materials, public presentations, general cultural context |
| Community Knowledge | โ ๏ธ Requires explicit consent | Community-shared stories, local governance processes, land-based practices |
| Sacred/Restricted | โ Must NOT be used | Ceremonial details, sacred songs, vision quest accounts, clan-specific knowledge |
| Personal/Private | โ ๏ธ Requires individual consent | Personal journals, private correspondence, individual research notes |
Before any training run using community knowledge:
- Obtain free, prior, and informed consent from knowledge holders
- Document which knowledge was used in training metadata
- Ensure right to withdraw โ ability to retrain the model without specific contributions
- Include attribution mechanisms in model metadata files
Model Access Controls
- Store fine-tuned model weights in encrypted volumes on the Mac Mini
- Maintain a training log documenting data sources, consent records, and training parameters
- Treat fine-tuned model weights with the same access controls as the source knowledge
- If models must be shared, only share those trained exclusively on Public Teaching content
Mac Hardware Scenarios for Training
Apple Silicon Chip Comparison (Training-Relevant)
| Chip | GPU Cores | Max RAM | Memory Bandwidth | Available In |
|---|---|---|---|---|
| M4 (base) | 10 | 32GB | 120 GB/s | Mac Mini ($599+) |
| M4 Pro | 20 | 64GB | 273 GB/s | Mac Mini ($1,399+) |
| M4 Max | 40 | 128GB | 546 GB/s | Mac Studio ($1,999+) |
| M4 Ultra | 80 | 192GB | ~800+ GB/s | Mac Studio ($3,999+) |
Memory bandwidth is the primary bottleneck for training throughput, not GPU core count. Higher bandwidth = faster token processing during training. Training time scales roughly inversely with bandwidth.
Scenario 1: Minimal Mac Mini
Config: Mac Mini M4 (base), 24GB RAM, 512GB SSD Price: ~$799โ$999
What it can handle:
- โ QLoRA fine-tuning of 3Bโ7B models (4-bit quantized) โ ~8GB model, leaves room for OS
- โ Embedding model fine-tuning (300Mโ600M models) โ trivial
- โ spaCy NER training โ trivial
- โ sentence-transformers fine-tuning
What it cannot:
- โ LoRA on 13B+ models (insufficient RAM for optimizer states)
- โ Full fine-tuning of anything larger than 3B
- โ Training while other heavy workloads run
Limitations:
- 10 GPU cores and 120 GB/s bandwidth make training ~2.3ร slower than M4 Pro
- 24GB is tight โ model + optimizer + activations must all fit
- Batch size limited to 1โ2 for 7B models
- 512GB SSD may be tight if storing multiple model checkpoints (estimate 20โ50GB per training cycle)
Training time (7B QLoRA, 1K steps): ~45โ90 minutes
Verdict: Adequate for prototyping and small models. Not recommended for regular persona training workloads.
Scenario 2: Maximal Mac Mini โ โญ RECOMMENDED
Config: Mac Mini M4 Pro, 48GB RAM
- With 512GB SSD: ~$1,799
- With 1TB SSD: ~$2,299 (recommended for storing model checkpoints)
Prices verified via Apple Store, B&H Photo, and Micro Center as of April 2026. Deal pricing as low as ~$1,539 has been observed on discount channels.
What it can handle:
- โ QLoRA fine-tuning of 7Bโ8B models comfortably โ ~8GB model + 4GB overhead, 36GB headroom
- โ LoRA fine-tuning of 7Bโ8B models (full precision) โ ~14GB model + overhead
- โ QLoRA on 13B models โ ~16GB model + overhead, tight but workable
- โ All three QMD models fine-tuned individually
- โ Multiple sequential persona training runs overnight
- โ Training while light workloads continue
What it cannot:
- โ Full fine-tuning of 13B+ models
- โ QLoRA on 30B+ models
Key specs: 20 GPU cores, 273 GB/s memory bandwidth.
Verdict: The sweet spot for Guillaume's use case. All realistic training workloads complete in reasonable time. Weekend self-training is fully viable.
Upgrade option: The 64GB RAM variant (~$2,699) provides headroom for 13B models and parallel workloads.
Scenario 3: Mac Studio Alternative
When the Mac Mini isn't enough โ for future scaling to larger models or parallel training.
Config: Mac Studio M4 Max, 128GB RAM Price: ~$3,199 (verified April 2026; note: supply shortages reported at high-end configs) Memory bandwidth: 546 GB/s (2ร Mac Mini M4 Pro) GPU cores: 40
Unlocks:
- QLoRA on 30B models comfortably
- Full LoRA on 13B models without compromise
- Multiple concurrent training jobs
- ~2ร faster training than M4 Pro
When Guillaume needs this: If he moves to fine-tuning 30B+ models, needs parallel persona training, or wants to do continual pretraining.
Comparison Table
| Capability | Mac Mini M4 24GB (~$999) | Mac Mini M4 Pro 48GB (~$1,799โ$2,299) | Mac Studio M4 Max 128GB (~$3,199) |
|---|---|---|---|
| Embedding fine-tuning | โ | โ | โ |
| 7B QLoRA persona adapters | โ ๏ธ Tight, slow | โ Comfortable | โ Overkill |
| 13B QLoRA | โ | โ ๏ธ Tight | โ Comfortable |
| 30B QLoRA | โ | โ | โ |
| Weekend batch training (5 personas) | โ ๏ธ ~4โ7 hours | โ ~2โ3 hours | โ ~1โ1.5 hours |
| QMD model retraining | โ | โ | โ |
| Thermal concerns for overnight runs | โ ๏ธ Possible throttling | โ ๏ธ Monitor | โ Better cooling |
Training Time Estimates
Methodology note: M4 Pro times are extrapolated from M2/M3 community benchmarks scaled by memory bandwidth ratio (273/150 โ 1.8ร). This is a reasonable first approximation but not benchmarked on M4 Pro hardware directly. Actual times may vary ยฑ30%.
LLM LoRA/QLoRA Fine-Tuning
| Model | Method | Hardware | Steps | Batch | Estimated Time | Source |
|---|---|---|---|---|---|---|
| Mistral 7B (4-bit) | QLoRA | M2 Pro 16GB | 1,000 | 4 | ~30 min | markaicode.com (measured) |
| Mistral 7B (4-bit) | QLoRA | M1 Max 64GB | 1,000 | 8 | ~15 min | randalscottking.com (measured) |
| Llama 7B (FP16) | LoRA | M2 Ultra | 1,000 | 4 | ~35 min | ml-explore/mlx-examples (measured, 475 tok/s) |
| Llama 8B (4-bit) | QLoRA | M4 Pro 48GB | 1,000 | 4โ8 | ~10โ25 min | Extrapolated from M2 benchmarks |
| 13B (4-bit) | QLoRA | M4 Pro 48GB | 1,000 | 1โ2 | ~45โ90 min | Community estimates |
| Qwen3-1.7B | LoRA SFT | M4 Pro 48GB | 1,000 | 4 | ~10โ15 min | Extrapolated (small model) |
Embedding Model Fine-Tuning
| Model | Data Size | Epochs | Hardware | Estimated Time |
|---|---|---|---|---|
| all-MiniLM-L6-v2 (22M) | 1,000 pairs | 3 | Any Mac (MPS) | 1โ3 min |
| embeddinggemma-300M | 1,000 pairs | 3 | M4 Pro (MPS) | 3โ8 min |
| embeddinggemma-300M | 10,000 pairs | 3 | M4 Pro (MPS) | 10โ25 min |
| Qwen3-Embedding-0.6B | 5,000 pairs | 3 | M4 Pro (MPS/MLX) | 10โ20 min |
Complete Weekend Training Cycle
| Step | Task | Estimated Duration |
|---|---|---|
| 1 | Data preparation scripts | 15โ30 min |
| 2 | Query expansion model LoRA (1.7B, 1K steps) | 10โ15 min |
| 3 | Persona 1 QLoRA (8B, 1K steps) | 20โ25 min |
| 4 | Persona 2 QLoRA | 20โ25 min |
| 5 | Persona 3 QLoRA | 20โ25 min |
| 6 | Persona 4 QLoRA | 20โ25 min |
| 7 | Persona 5 QLoRA | 20โ25 min |
| 8 | Embedding model fine-tuning (300M) | 10โ20 min |
| 9 | NER model training | 5โ10 min |
| 10 | Export adapters, GGUF conversion, rebuild Ollama models | 15โ20 min |
| 11 | Validation tests | 10โ15 min |
| Total | ~2.5โ4 hours |
The entire pipeline completes in a single evening. Machine is available for other work by morning.
Automated Weekend Training Pipeline
Practical launchd/cron-based Workflow
On macOS, launchd is the recommended scheduling mechanism (preferred over cron). However, cron also works.
#!/bin/bash
# weekend-training.sh โ scheduled via launchd for Friday night
# Includes error handling, logging, and validation gates
set -euo pipefail
LOG="$HOME/models/training-logs/$(date +%Y%m%d-%H%M%S).log"
mkdir -p "$(dirname "$LOG")"
exec > >(tee -a "$LOG") 2>&1
echo "=== Training run started: $(date) ==="
# 0. Snapshot current working models (rollback point)
cp -r "$HOME/models/qmd-indigenous/current" \
"$HOME/models/qmd-indigenous/backup-$(date +%Y%m%d)" 2>/dev/null || true
# 1. Generate training data from latest QMD content
echo "[Step 1] Generating training data..."
python scripts/generate_training_data.py || {
echo "FAILED: Data generation. Aborting."
# Send notification (e.g., via terminal-notifier on macOS)
osascript -e 'display notification "Training data generation failed" with title "IAIP Training"'
exit 1
}
# 2. Fine-tune query expansion model
echo "[Step 2] Training query expansion model..."
python -m mlx_lm.lora \
--model mlx_models/Qwen3-1.7B \
--train --data ./data/expansion/ \
--batch-size 2 --lora-layers 16 --iters 1000 \
--adapter-file adapters/expansion.npz
# 3. Fine-tune each persona adapter
for persona in mia miette tushell council; do
echo "[Step 3] Training persona: $persona"
python -m mlx_lm.lora \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--train --data "data/${persona}/" \
--iters 1000 --batch-size 4 --lora-layers 8 \
--adapter-file "adapters/${persona}.npz" || {
echo "WARNING: Persona $persona training failed, continuing..."
continue
}
done
# 4. Export and rebuild
echo "[Step 4] Exporting models..."
python -m mlx_lm.fuse \
--model mlx_models/Qwen3-1.7B \
--adapter-file adapters/expansion.npz \
--export-gguf
for persona in mia miette tushell council; do
[ -f "adapters/${persona}.npz" ] || continue
python -m mlx_lm.fuse \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--adapter-file "adapters/${persona}.npz" \
--export-gguf
ollama create "${persona}-persona" -f "Modelfiles/${persona}"
done
# 5. Validation gate โ test search quality before deploying
echo "[Step 5] Running validation..."
python scripts/validate_models.py || {
echo "FAILED: Validation. Rolling back to previous models."
ln -sfn "$HOME/models/qmd-indigenous/backup-$(date +%Y%m%d)" \
"$HOME/models/qmd-indigenous/current"
osascript -e 'display notification "Validation failed โ rolled back" with title "IAIP Training"'
exit 1
}
# 6. Deploy new models
echo "[Step 6] Deploying..."
mv "$HOME/models/qmd-indigenous/current" "$HOME/models/qmd-indigenous/previous"
ln -sfn "$HOME/models/qmd-indigenous/v-$(date +%Y%m%d)" \
"$HOME/models/qmd-indigenous/current"
echo "=== Training run completed: $(date) ==="
osascript -e 'display notification "Training complete โ new models deployed" with title "IAIP Training"'
Monitoring and Validation
A validation script should test search quality before deploying new models:
# scripts/validate_models.py
test_queries = {
"medicine wheel teachings": ["medicine-wheel.md", "four-directions.md"],
"relational accountability": ["research-ceremony.md", "ethics.md"],
"seven grandfather teachings": ["seven-teachings.md", "anishinaabe.md"],
}
for query, expected_docs in test_queries.items():
results = qmd_search(query, top_k=5)
found = [r for r in results if r.path in expected_docs]
if len(found) < 1:
raise AssertionError(f"Query '{query}' failed to find expected documents")
Evaluation Metrics
To measure whether fine-tuning actually improved search:
- MRR (Mean Reciprocal Rank): Average of 1/rank for first relevant result across test queries
- Precision@5: Fraction of top-5 results that are relevant
- A/B comparison: Run the same test queries against base and fine-tuned models, compare rankings
Maintain a test query set of 20โ50 domain-specific queries with known expected results.
Known Limitations & Honest Caveats
GGUF Conversion for Embedding Models โ โ ๏ธ UNVERIFIED
The entire embedding fine-tuning โ GGUF conversion โ QMD deployment pipeline is theoretical. The individual steps are plausible, but the end-to-end path has not been verified by anyone in the community as of April 2026. Specific risks:
convert_hf_to_gguf.pymay not correctly handle fine-tuned sentence-transformers models with modified pooling layers- Quantization (Q8_0) of a fine-tuned encoder-only model may degrade embedding quality unpredictably
- node-llama-cpp may not correctly load a converted encoder model that differs structurally from the original
Mitigation: Test with dummy data first. Have a fallback plan (Qwen3-Embedding-0.6B, which uses decoder architecture).
CUDA Dependencies in QMD's Pipeline
QMD's finetune/ directory is a CUDA-targeted pipeline. The pyproject.toml includes nvidia-ml-py, and training configs reference A10G GPUs. This pipeline cannot run on Mac without:
- Removing
nvidia-ml-pyfrom dependencies - Changing device configuration from
cudatomps - Reducing batch sizes to fit in available memory
- Potentially modifying mixed-precision settings (MPS has limited bfloat16/float16 support)
The HuggingFace Jobs cloud path (~$1.50/run) works without modification but sends training data off-device โ a data sovereignty concern for Indigenous knowledge.
What's Proven vs Experimental
| Component | Status | Evidence |
|---|---|---|
| MLX-LM LoRA/QLoRA training on Apple Silicon | โ Proven | Apple's official examples, multiple community guides, r/LocalLLaMA reports |
| Ollama GGUF/LoRA import workflow | โ Proven | Documented in Ollama repository |
| QMD model swapping via env vars | โ Verified | Confirmed in source code |
| QMD query expansion fine-tuning pipeline | โ Proven | Production pipeline exists, documented |
| sentence-transformers fine-tuning on MPS | โ Proven | Standard PyTorch workflow |
| M4 Pro training speed estimates | โ ๏ธ Extrapolated | Scaled from M2/M3 benchmarks by bandwidth ratio |
| mlx-tune embedding fine-tuning | โ ๏ธ Claimed | Listed as stable in v0.4.19, limited community verification |
| embeddinggemma-300M sentence-transformers loading | โ ๏ธ Needs wrapping | Not a native ST model โ requires manual pooling layer |
| Fine-tuned embedding โ GGUF conversion | โ Unverified | No community reports of successful end-to-end path |
| Reranker fine-tuning with calibration preservation | โ Unverified | No pipeline or documentation exists |
Thermal and Disk Space
- Thermal: Mac Mini's compact form factor may cause thermal throttling during sustained 2โ3 hour training runs. Monitor CPU/GPU temperatures. Ensure good ventilation.
- Disk space: Each training run generates checkpoints, optimizer states, and cached datasets. Budget 20โ50GB scratch space for five persona training runs plus model exports. The 1TB SSD configuration is recommended.
Multilingual Considerations
Indigenous communities often have knowledge in Indigenous languages, French (Guillaume's Quรฉbรฉcois context), and English. The current QMD embedding model (embeddinggemma-300M) was trained on 100+ languages but may not adequately represent Indigenous languages with small digital footprints. Fine-tuning with bilingual/trilingual training pairs is important.
Community Resources
Framework Documentation
- ml-explore/mlx-examples LoRA โ Apple's official LoRA fine-tuning example
- ml-explore/mlx-lm โ Apple's LLM package (v0.31.1)
- ARahim3/mlx-tune โ Unsloth-compatible fine-tuning for Mac (v0.4.19)
- PyTorch MPS docs โ Official MPS backend
- sentence-transformers training guide โ Embedding model fine-tuning
Guides and Benchmarks
- markaicode.com โ MLX-LM Fine-Tuning Guide โ Practical training walkthrough with times
- randalscottking.com โ MLX Framework Guide โ Step-by-step guide
- blog.amsayed.dev โ Fine-Tuning on Apple Silicon โ Practical walkthrough
- r/LocalLLaMA community benchmarks โ ~10,000 benchmark runs
QMD-Specific
- tobi/qmd GitHub โ Source repository (MIT license)
- QMD finetune/ directory โ Complete query expansion training pipeline
- DeepWiki โ QMD Vector Embeddings โ Architecture overview
HuggingFace Model Cards
- google/embeddinggemma-300M โ Embedding model
- Qwen/Qwen3-Reranker-0.6B โ Reranker model
- tobil/qmd-query-expansion-1.7B โ Query expansion model
Hardware
- Apple Mac Mini Specs โ Official specifications
- Apple Mac Studio Specs โ Official specifications
Data Sovereignty
- OCAPยฎ Principles โ First Nations Information Governance Centre
- CARE Principles for Indigenous Data Governance โ Collective Benefit, Authority to Control, Responsibility, Ethics
Recommendation
What to Buy
Mac Mini M4 Pro, 48GB RAM, 1TB SSD (~$2,299) โ the recommended configuration. The 1TB SSD provides space for model checkpoints, training artifacts, and multiple model versions. The 48GB RAM handles all training tasks described in this document comfortably.
If budget is constrained, the 512GB SSD variant (~$1,799) works but monitor disk space. External storage can supplement if needed.
Do not buy the base M4 Mac Mini for regular training workloads โ the 120 GB/s bandwidth makes training 2.3ร slower, and 24GB RAM limits model options.
What to Train First
Phase 1 (Week 1โ2): QMD Query Expansion Model
- Lowest risk โ existing pipeline, proven GGUF conversion
- Write 200โ300 Indigenous knowledge expansion examples
- Train locally via MLX-LM (data stays local)
- Deploy via
QMD_GENERATE_MODELenv var - Measure improvement with test queries
Phase 2 (Week 3โ4): Persona LoRA Adapters
- Proven MLX-LM workflow, routine on Apple Silicon
- Start with one persona (e.g., Mia), validate quality
- Scale to all personas once workflow is solid
- Deploy via Ollama Modelfiles
Phase 3 (Month 2): Embedding Model Fine-Tuning
- Highest impact but highest risk
- First: Test GGUF conversion pipeline with dummy data
- If conversion works: fine-tune with Indigenous knowledge pairs
- If conversion fails: fall back to Qwen3-Embedding-0.6B or explore QMD source modification
- Deploy via
QMD_EMBED_MODEL+ full re-embedding
Phase 4 (Month 3+): Reranker, NER, Classification
- Build custom pipeline for reranker domain adaptation
- Train NER model for Indigenous terminology extraction
- Integrate into automated weekend pipeline
Automation
Once Phase 1โ2 are validated manually, implement the automated weekend training script (see ยงAutomated Weekend Training Pipeline). Schedule via launchd for Friday nights. Include validation gates and rollback capability.
Sources
- ml-explore/mlx-examples LoRA โ Apple's official LoRA example with benchmarks
- ml-explore/mlx-lm v0.31.1 โ Apple's LLM fine-tuning package
- ARahim3/mlx-tune v0.4.19 โ Community fine-tuning wrapper
- PyTorch MPS Backend โ Official documentation
- HuggingFace Apple Silicon Guide โ Trainer + MPS integration
- sentence-transformers โ Embedding fine-tuning documentation
- markaicode.com MLX-LM Guide โ Training benchmarks
- randalscottking.com MLX Guide โ Step-by-step training
- tobi/qmd GitHub โ QMD source (commit
cfd640e, MIT license) - QMD finetune/ pipeline โ Query expansion training
- google/embeddinggemma-300M โ HuggingFace model card
- Qwen/Qwen3-Reranker-0.6B โ HuggingFace model card
- tobil/qmd-query-expansion-1.7B โ HuggingFace model card
- Apple Mac Mini Specs โ Official hardware specs
- Apple Mac Studio Specs โ Official hardware specs
- r/LocalLLaMA Community Benchmarks โ ~10,000 benchmark runs
- OCAPยฎ Principles โ First Nations Information Governance Centre
- CARE Principles โ Indigenous Data Governance
- B&H Photo Mac Mini Pricing โ Verified April 2026
- CDW Mac Studio Pricing โ Verified April 2026
Final document compiled April 15, 2026. Incorporates corrections from senior technical review โ BLOCKING issues (GGUF conversion, CUDA dependency, architecture misidentification) addressed. All pricing verified via web search. Training time estimates flagged as extrapolated where applicable.