โ† Back to Articles & Artefacts
artefactswest

Mac Mini for Local AI Training โ€” Hardware & Workflow Guide

IAIP Research
rch-tech-jgwill-claws-infrastructure

Mac Mini for Local AI Training โ€” Hardware & Workflow Guide

Document: RESULT-03 โ€” Mac Mini Training Scenarios for IAIP Date: April 15, 2026 Context: Indigenous-AI Collaborative Platform (IAIP) โ€” Guillaume Descoteaux-Isabelle Status: Final (revised with reviewer corrections applied)


Executive Summary

Local AI fine-tuning on a Mac Mini M4 Pro is viable and practical for Guillaume's Indigenous-AI Collaborative Platform. The three highest-value training tasks โ€” LoRA persona adapters, QMD query expansion, and embedding model domain adaptation โ€” all fit within Apple Silicon's capabilities using models under 2B parameters. A complete weekend training cycle for five AI personas plus QMD model enhancement can finish in under 3 hours on a Mac Mini M4 Pro with 48GB unified memory.

Key findings:

  • What works today: LoRA/QLoRA fine-tuning of 3Bโ€“8B models via MLX-LM (10โ€“30 min per adapter). QMD query expansion retraining via HuggingFace Jobs cloud path (~$1.50/run). Small embedding and NER model training on PyTorch MPS.
  • What requires caution: Embedding model fine-tuning for QMD works at the training step, but converting a fine-tuned encoder-only model back to GGUF format for QMD deployment is unverified and experimental โ€” this is the single riskiest step in the proposal (see ยงKnown Limitations).
  • What doesn't work on Mac: QMD's existing finetune/ pipeline has a CUDA dependency (nvidia-ml-py) and targets A10G GPUs. It will not run on Mac without modification. The recommended workaround is the HuggingFace Jobs cloud path or rewriting training scripts for MLX-LM.
  • Recommended hardware: Mac Mini M4 Pro, 48GB RAM, 512GB SSD ($1,799) or 1TB SSD ($2,299). The 48GB config handles all realistic workloads.
  • Data sovereignty: Local-only training is a significant advantage for Indigenous knowledge sovereignty. All training data and model weights remain on-premises. Cloud training paths (HuggingFace Jobs) should be avoided for culturally sensitive material.

What to train first: QMD query expansion model (existing pipeline, highest feasibility), then persona LoRA adapters (proven MLX-LM workflow), then embedding model (highest impact but GGUF conversion risk).


Training Frameworks on Apple Silicon (April 2026)

MLX / MLX-LM

MLX is Apple's array computation framework optimized for Apple Silicon. MLX-LM (v0.31.1, March 2026) is the separate LLM-specific package built on it. These are distinct packages โ€” pip install mlx-lm is what users need for LLM fine-tuning.

Maturity: Production-ready. MLX-LM is actively maintained by Apple's ml-explore team with regular releases.

Supported training methods:

  • LoRA and QLoRA fine-tuning (native, optimized for unified memory)
  • Full fine-tuning (practical for models โ‰ค7B on 48GB+ RAM)
  • DPO, GRPO, SFT training objectives
  • Direct HuggingFace model loading (pre-converted mlx-community weights available)
  • Model export: fused weights, GGUF for Ollama/llama.cpp, HuggingFace Hub upload

Key advantage: Unified memory architecture eliminates CPUโ†”GPU data transfer. All system RAM is GPU-addressable. Community reports indicate MLX is approximately 30โ€“40% faster than PyTorch MPS for equivalent training tasks on the same hardware, though exact speedup varies by model size and task (no single definitive benchmark).

Installation and usage:

pip install mlx-lm

# LoRA fine-tune (current CLI syntax)
python -m mlx_lm.lora \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --train \
  --data ./training-data/ \
  --iters 1000 \
  --batch-size 4 \
  --lora-layers 8

Benchmark reference: Llama 7B LoRA training on WikiSQL โ€” validation loss 2.66 โ†’ 1.23 over 1,000 iterations. ~475 tokens/sec on M2 Ultra, ~250 tokens/sec on M1 Max 32GB (source: ml-explore/mlx-examples).

mlx-tune

mlx-tune (v0.4.19, latest verified) is a community wrapper by ARahim3 that provides an Unsloth-compatible API around MLX. It enables the same training scripts to work on Mac (via MLX) and cloud (via CUDA/Unsloth) by changing one import line.

Relevance to Guillaume: mlx-tune claims support for embedding model fine-tuning with contrastive learning (InfoNCE loss) for architectures including BERT, ModernBERT, Qwen3-Embedding, and Harrier. However, each capability should be verified against the v0.4.19 release notes โ€” the capability matrix below reflects claimed features, not independently verified ones.

CapabilityClaimed Status
SFT / LoRA / QLoRA Trainingโœ… Stable
DPO, ORPO, GRPO, KTO, SimPOโœ… Stable
Vision Model Fine-Tuning (VLMs)โœ… Stable
TTS / STT Fine-Tuningโœ… Stable
Embedding Fine-Tuning (contrastive)โœ… Stable (claimed)
Export to HuggingFace / GGUFโœ… Stable

โš ๏ธ Caveat: mlx-tune's GGUF export for embedding models (encoder-only architectures) is unverified for QMD's use case. The export path documented is for causal LMs. See ยงKnown Limitations.

Installation: pip install mlx-tune

Source: github.com/ARahim3/mlx-tune

PyTorch MPS Backend

Maturity: Stable in PyTorch 2.7+ (2025). MPS (Metal Performance Shaders) backend is included automatically in macOS PyTorch builds.

What it supports:

  • Training and fine-tuning with GPU acceleration via Metal
  • HuggingFace Trainer API auto-detects MPS device
  • sentence-transformers training works on MPS

Limitations:

  • No distributed/multi-GPU training โ€” single device only
  • Partial operator coverage โ€” some ops fall back to CPU (set PYTORCH_ENABLE_MPS_FALLBACK=1)
  • Limited precision modes โ€” float16/bfloat16 not on par with CUDA; mostly float32
  • No fine-grained VRAM tracking unlike CUDA
  • Approximately 30โ€“40% slower than MLX for equivalent tasks due to data copying overhead

Best for: sentence-transformers fine-tuning (PyTorch-native, no MLX port), HuggingFace Trainer-based workflows.

import torch
device = "mps" if torch.backends.mps.is_available() else "cpu"

Sources: PyTorch MPS docs, HuggingFace Apple Silicon guide

sentence-transformers

The sentence-transformers library is the standard tool for fine-tuning embedding models with contrastive learning, triplet loss, or cosine similarity objectives. It runs on PyTorch and supports the MPS backend.

Relevance: Primary option for fine-tuning embedding models if mlx-tune's embedding support proves insufficient. Uses MultipleNegativesRankingLoss for retrieval tasks โ€” the best loss function for teaching domain-specific similarity.

Caveat for embeddinggemma-300M: The google/embeddinggemma-300M model is not natively published as a sentence-transformers package. Loading it requires manual wrapping with a pooling layer, which adds complexity to the fine-tuning workflow. An alternative is to fine-tune Qwen3-Embedding-0.6B or all-MiniLM-L6-v2 which have native sentence-transformers support.


What Can Be Trained: Guillaume's Use Cases

1. QMD Model Enhancement

QMD uses exactly three GGUF models, all running locally via node-llama-cpp (verified in src/llm.ts, QMD commit cfd640e):

ModelParamsDiskVRAMRoleEnv Var
embeddinggemma-300M (Q8_0)300M~300MB~400MBVector embeddingsQMD_EMBED_MODEL
Qwen3-Reranker-0.6B (Q8_0)600M~640MB~700MBResult rerankingQMD_RERANK_MODEL
qmd-query-expansion-1.7B (Q4_K_M)1.7B~1.1GB~1.5GBQuery expansionQMD_GENERATE_MODEL

All three can be swapped via environment variables or SDK constructor. This is the deployment mechanism for fine-tuned models.

embeddinggemma-300M (Embedding Model)

Architecture correction: embeddinggemma-300M is an encoder-only transformer (conceptually similar to BERT), not a Gemma 3 decoder model. This distinction is critical โ€” encoder-only models use bidirectional attention and require different fine-tuning procedures than causal (decoder-only) LMs.

What fine-tuning achieves: Teach the embedding model that Indigenous knowledge concepts ("relational accountability," "medicine wheel," "seven grandfather teachings," "ceremony as methodology") are semantically close to each other and distinct from superficially similar Western academic terms. This directly improves QMD's vector search for Guillaume's domain.

Fine-tuning approach โ€” sentence-transformers on PyTorch MPS:

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# NOTE: embeddinggemma-300M requires manual wrapping โ€” not a native ST model
# Alternative: use Qwen3-Embedding-0.6B or all-MiniLM-L6-v2 for simpler workflow
model = SentenceTransformer("google/embeddinggemma-300M", device="mps")

train_examples = [
    InputExample(texts=["medicine wheel", "sacred circle representing four directions and life stages"]),
    InputExample(texts=["relational accountability", "ethical research requires maintaining relationships"]),
    InputExample(texts=["seven grandfather teachings", "wisdom love respect bravery honesty humility truth"]),
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model=model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=50,
    output_path="./indigenous-embeddinggemma-finetuned",
)

โš ๏ธ BLOCKING CAVEAT โ€” GGUF Conversion:

Converting the fine-tuned embedding model back to GGUF format is the hardest unsolved step in this workflow:

  1. EmbeddingGemma-300M is an encoder-only model, not a causal LM
  2. llama.cpp's convert_hf_to_gguf.py was primarily designed for causal LMs (LLaMA, Mistral, Gemma decoder models)
  3. GGUF has recently added embedding model support, but converting a fine-tuned sentence-transformers model (with potentially modified pooling layers) back to GGUF is experimental and unverified
  4. The original pre-trained embeddinggemma-300M GGUF exists on HuggingFace, but that conversion was done by ggml-org on the original weights โ€” not on fine-tuned weights

Recommended mitigation:

  • Before investing in Indigenous knowledge dataset creation, test the full pipeline end-to-end with a trivially fine-tuned model (e.g., train on 10 dummy pairs, convert to GGUF, load in QMD, verify it produces embeddings)
  • If conversion fails, fall back to Qwen3-Embedding-0.6B which uses a causal (decoder) architecture and may have better GGUF conversion support. QMD already supports it as an alternative embedding model
  • Another fallback: keep the fine-tuned model in PyTorch format and modify QMD to load it via sentence-transformers instead of node-llama-cpp (requires QMD source modification)

Training data needed: 500โ€“5,000 query-document pairs with similarity signals. See ยงTraining Data Preparation.

Time estimate: 5โ€“20 minutes for 1Kโ€“10K pairs on Mac Mini M4 Pro (MPS backend).

Qwen3-Reranker-0.6B (Reranker)

The reranker is a standard causal LM (0.6B parameters, Qwen3 architecture) that produces yes/no relevance scores via logprob comparison:

score = exp(logprob_yes) / (exp(logprob_yes) + exp(logprob_no))

Fine-tuning potential: Medium impact. Teaching domain-specific relevance judgments would improve result ordering for Indigenous knowledge queries. However:

  • No fine-tuning pipeline exists in QMD for this model โ€” it would need to be built from scratch
  • The yes/no logprob scoring mechanism requires careful calibration โ€” naive SFT could break the scoring distribution
  • Being a causal LM, standard LoRA fine-tuning via MLX-LM is technically feasible
  • Should be attempted only after query expansion and embedding models show improvement

Training data format:

{"query": "four directions teachings", "document": "The Medicine Wheel maps...", "relevant": true}
{"query": "four directions teachings", "document": "GPS navigation uses four...", "relevant": false}

Priority: Third, after query expansion and embedding.

qmd-query-expansion-1.7B (Query Expansion)

This is the easiest and safest model to fine-tune โ€” QMD provides a complete, production-grade fine-tuning pipeline in its finetune/ directory.

What it does: Expands user queries into structured search expansions:

Input:  "medicine wheel"
Output:
  hyde: The Medicine Wheel is a sacred circle representing the four directions...
  lex: medicine wheel four directions
  lex: sacred circle indigenous ceremony
  vec: what are the teachings of the medicine wheel in indigenous tradition

Existing pipeline details (verified in QMD source):

  • LoRA SFT on Qwen3-1.7B (rank 16, alpha 32, all projection layers)
  • ~2,290 training examples in production model
  • Training tools: train.py, eval.py, convert_gguf.py, dataset/prepare_data.py
  • Data format: JSONL with query and output fields (processed into Qwen3 chat template by prepare_data.py)

โš ๏ธ CUDA Dependency: QMD's finetune/pyproject.toml depends on nvidia-ml-py and the training configs target A10G (CUDA) GPUs. Running uv run train.py on Mac will fail without modification.

Three paths forward:

PathEffortCostData Location
HuggingFace Jobs (recommended)Low โ€” use existing pipeline~$1.50/run (A10G, ~45 min)โš ๏ธ Sent to HF cloud
MLX-LM local rewriteMedium โ€” rewrite train.py for MLX$0โœ… Local only
PyTorch MPS modificationMedium โ€” remove nvidia deps, change device config$0โœ… Local only

For data sovereignty reasons (see ยงData Sovereignty), the local MLX-LM path is recommended for Indigenous knowledge data:

pip install mlx-lm
python -m mlx_lm.convert --model Qwen/Qwen3-1.7B
python -m mlx_lm.lora \
  --model mlx_models/Qwen3-1.7B \
  --train --data ./indigenous-expansion-data/ \
  --batch-size 2 --lora-layers 16 --iters 1000

# Fuse and export
python -m mlx_lm.fuse \
  --model mlx_models/Qwen3-1.7B \
  --adapter-file adapters/adapters.npz \
  --export-gguf

How many examples needed: 200โ€“500 high-quality domain-specific expansion examples for meaningful improvement. QMD's own model used ~2,290 examples.

Training time: 15โ€“30 minutes on Mac Mini M4 Pro (estimated from model size and community benchmarks).

Deploying Fine-Tuned Models Back to QMD

Deployment uses environment variable swapping (verified in src/llm.ts):

# Point QMD to your fine-tuned models
export QMD_EMBED_MODEL="$HOME/models/indigenous-embed-Q8_0.gguf"
export QMD_GENERATE_MODEL="$HOME/models/indigenous-expansion-q4_k_m.gguf"
export QMD_RERANK_MODEL="$HOME/models/indigenous-reranker-Q8_0.gguf"

# CRITICAL: After changing the embedding model, re-embed ALL documents
qmd embed

โš ๏ธ Important considerations:

  • Re-embedding is required after changing the embedding model. Old embeddings are incompatible with the new model's weight space.
  • Embedding dimensions must match. If switching from embeddinggemma-300M (768-dim) to a model with different dimensions, the sqlite-vec index becomes incompatible and requires full re-indexing.
  • Re-embedding time depends on corpus size โ€” estimate 1โ€“5 minutes for hundreds of documents, longer for thousands (unverified โ€” test with your corpus).
  • The GGUF export step for the query expansion model is proven (QMD's own pipeline does this). The GGUF export for the embedding model is unverified (see above).

Recommended version management:

~/models/qmd-indigenous/
โ”œโ”€โ”€ v1/
โ”‚   โ”œโ”€โ”€ indigenous-embed-Q8_0.gguf
โ”‚   โ”œโ”€โ”€ indigenous-expansion-q4_k_m.gguf
โ”‚   โ””โ”€โ”€ training-metadata.json
โ”œโ”€โ”€ v2/
โ”‚   โ””โ”€โ”€ ...
โ””โ”€โ”€ current -> v2/   # symlink to active version

2. Persona LoRA Adapters

Feasibility: โœ… Proven, routine on Apple Silicon.

Each AI persona (Mia the architect, Miette the emotional resonator, Tushell the journal keeper, etc.) gets its own LoRA adapter trained on persona-specific conversation data, instructions, and domain knowledge.

Workflow:

  1. Curate persona-specific training data as JSONL (conversations, instructions, domain text)
  2. Run QLoRA fine-tuning on a base model (e.g., Llama 3.1 8B-Instruct, 4-bit quantized)
  3. Export adapter weights (~50โ€“200MB per persona)
  4. Serve via Ollama with ADAPTER directive in Modelfile

Resource requirements per persona:

  • Base model: ~4โ€“8GB RAM (4-bit 8B model)
  • Training overhead: ~2โ€“4GB additional
  • Total: ~8โ€“12GB RAM during training
  • Time: 10โ€“30 minutes per persona for 500โ€“1,000 training steps
  • Multiple personas trained sequentially overnight

Training data format (JSONL):

{"text": "<|user|>\nHow should we approach this relationship with the land?\n<|assistant|>\nAs Mia, I see the architectural pattern here โ€” the land relationship is a structural foundation, not a resource to be extracted. Let me map the dependencies..."}

Deployment via Ollama Modelfile:

FROM ./llama-3.1-8b.Q4_K_M.gguf
ADAPTER ./persona-mia.lora
SYSTEM "You are Mia, the architectural thinker..."
ollama create mia-persona -f Modelfile
ollama run mia-persona

Recommended base models for personas:

ModelSizeRAM NeededQualityTraining Time (1K steps)
Llama 3.2-3B (4-bit)3B~4GBGood~10 min
Qwen3-4B (4-bit)4B~5GBBetter~15 min
Llama 3.1-8B-Instruct (4-bit)8B~8GBBest for persona work~20โ€“30 min

3. Domain-Specific Classification/NER

Feasibility: โœ… Trivial on any Mac.

Small models for recognizing Indigenous concepts, ceremony names, teaching references, and relational terms in text.

Tool: spaCy (v3.x/v4.x, fully supports Apple Silicon) or HuggingFace token-classification pipeline.

Custom entity types: CEREMONY, DIRECTION, TEACHING, PLACE, RELATION, LANGUAGE_TERM

Requirements:

  • 200โ€“500 annotated documents for meaningful NER
  • Training time: 5โ€“15 minutes on any Mac
  • RAM: Under 8GB for BERT-base models (110M params)

Impact: Auto-tagging QMD documents with rich metadata to improve search context and filtering.


Training Data Preparation

From Indigenous Knowledge Base

Guillaume's existing QMD-indexed knowledge base provides the foundation for training data generation.

For query expansion training (JSONL format):

{"query": "medicine wheel", "output": [["hyde", "The Medicine Wheel represents the four directions..."], ["lex", "sacred circle four directions"], ["lex", "indigenous cosmology ceremony"], ["vec", "what are the teachings of the medicine wheel"]]}

Generate these by using an LLM to create plausible search queries for existing documents, then manually writing the ideal expansion output. Quality matters more than quantity โ€” 200โ€“500 carefully crafted examples outperform 5,000 sloppy ones.

For embedding fine-tuning (query-document pairs):

pairs = [
    ("medicine wheel", "The sacred circle represents the four directions and life stages"),
    ("relational accountability", "Research as ceremony requires maintaining relationships"),
    ("structural tension", "The creative force between current reality and desired vision"),
]

Sources of pairs:

  • Document title โ†’ document content (natural positive pairs)
  • LLM-generated synthetic queries for each document
  • Manual curation of concept-to-explanation mappings
  • Cross-references between related documents

For persona LoRA training (conversation JSONL): Extract or write persona-specific conversations showing the persona's voice, knowledge, and reasoning style. Each persona needs 200โ€“1,000 conversation examples.

Volume Requirements

Training TypeMinimum ViableGood QualityExcellent
Query expansion100 examples200โ€“500 examples1,000+ examples
Embedding fine-tuning500 pairs2,000โ€“5,000 pairs10,000+ pairs
Persona LoRA200 conversations500โ€“1,000 conversations2,000+ conversations
NER training200 annotated docs500โ€“1,000 docs5,000+ docs

Data augmentation: Use a local LLM to generate paraphrases of existing documents. This can multiply training data 3โ€“5ร— but requires human review to ensure paraphrases maintain semantic accuracy โ€” especially important for Indigenous knowledge where nuance matters.


Data Sovereignty & Cultural Protocols

This section is foundational, not optional. For an Indigenous-AI Collaborative Platform, data sovereignty is a core requirement, not an afterthought. Local training on a Mac Mini provides inherent advantages, but specific protocols must be followed.

Local-Only Training as Sovereignty Advantage

Training AI models locally on a Mac Mini means:

  • No training data leaves the machine. Unlike cloud-based fine-tuning (including HuggingFace Jobs at ~$1.50/run), local training keeps all Indigenous knowledge on-premises.
  • Model weights stay local. Fine-tuned model weights encode patterns from training data. Uploading them to HuggingFace Hub or other public repositories could expose encoded Indigenous knowledge. Do not push models trained on culturally sensitive data to public repositories.
  • No third-party access. Cloud training providers may log data, cache models, or retain training artifacts. Local training avoids these risks entirely.

Recommendation: Use the local MLX-LM or PyTorch MPS training paths for all Indigenous knowledge training. Reserve the HuggingFace Jobs cloud path only for non-sensitive, general-purpose training data.

OCAP Principles Applied to AI Training

The First Nations principles of OCAPยฎ โ€” Ownership, Control, Access, and Possession โ€” should govern training data and model management:

  • Ownership: The community owns all training data derived from its knowledge. Fine-tuned model weights are derivative works and inherit the same ownership.
  • Control: Community governance determines what knowledge can be used for training, who can initiate training runs, and who can access fine-tuned models.
  • Access: Fine-tuned models should be accessible only to authorized users. Ollama's local-only deployment model supports this naturally.
  • Possession: Physical custody of training data and model weights remains with the community. The Mac Mini sitting in Guillaume's workspace provides this.

Knowledge Classification for Training

Not all Indigenous knowledge can or should be used for model training. Recommended classification:

CategoryTraining UseExample
Public Teachingโœ… Appropriate for trainingPublished educational materials, public presentations, general cultural context
Community Knowledgeโš ๏ธ Requires explicit consentCommunity-shared stories, local governance processes, land-based practices
Sacred/RestrictedโŒ Must NOT be usedCeremonial details, sacred songs, vision quest accounts, clan-specific knowledge
Personal/Privateโš ๏ธ Requires individual consentPersonal journals, private correspondence, individual research notes

Before any training run using community knowledge:

  1. Obtain free, prior, and informed consent from knowledge holders
  2. Document which knowledge was used in training metadata
  3. Ensure right to withdraw โ€” ability to retrain the model without specific contributions
  4. Include attribution mechanisms in model metadata files

Model Access Controls

  • Store fine-tuned model weights in encrypted volumes on the Mac Mini
  • Maintain a training log documenting data sources, consent records, and training parameters
  • Treat fine-tuned model weights with the same access controls as the source knowledge
  • If models must be shared, only share those trained exclusively on Public Teaching content

Mac Hardware Scenarios for Training

Apple Silicon Chip Comparison (Training-Relevant)

ChipGPU CoresMax RAMMemory BandwidthAvailable In
M4 (base)1032GB120 GB/sMac Mini ($599+)
M4 Pro2064GB273 GB/sMac Mini ($1,399+)
M4 Max40128GB546 GB/sMac Studio ($1,999+)
M4 Ultra80192GB~800+ GB/sMac Studio ($3,999+)

Memory bandwidth is the primary bottleneck for training throughput, not GPU core count. Higher bandwidth = faster token processing during training. Training time scales roughly inversely with bandwidth.

Scenario 1: Minimal Mac Mini

Config: Mac Mini M4 (base), 24GB RAM, 512GB SSD Price: ~$799โ€“$999

What it can handle:

  • โœ… QLoRA fine-tuning of 3Bโ€“7B models (4-bit quantized) โ€” ~8GB model, leaves room for OS
  • โœ… Embedding model fine-tuning (300Mโ€“600M models) โ€” trivial
  • โœ… spaCy NER training โ€” trivial
  • โœ… sentence-transformers fine-tuning

What it cannot:

  • โŒ LoRA on 13B+ models (insufficient RAM for optimizer states)
  • โŒ Full fine-tuning of anything larger than 3B
  • โŒ Training while other heavy workloads run

Limitations:

  • 10 GPU cores and 120 GB/s bandwidth make training ~2.3ร— slower than M4 Pro
  • 24GB is tight โ€” model + optimizer + activations must all fit
  • Batch size limited to 1โ€“2 for 7B models
  • 512GB SSD may be tight if storing multiple model checkpoints (estimate 20โ€“50GB per training cycle)

Training time (7B QLoRA, 1K steps): ~45โ€“90 minutes

Verdict: Adequate for prototyping and small models. Not recommended for regular persona training workloads.

Scenario 2: Maximal Mac Mini โ€” โญ RECOMMENDED

Config: Mac Mini M4 Pro, 48GB RAM

  • With 512GB SSD: ~$1,799
  • With 1TB SSD: ~$2,299 (recommended for storing model checkpoints)

Prices verified via Apple Store, B&H Photo, and Micro Center as of April 2026. Deal pricing as low as ~$1,539 has been observed on discount channels.

What it can handle:

  • โœ… QLoRA fine-tuning of 7Bโ€“8B models comfortably โ€” ~8GB model + 4GB overhead, 36GB headroom
  • โœ… LoRA fine-tuning of 7Bโ€“8B models (full precision) โ€” ~14GB model + overhead
  • โœ… QLoRA on 13B models โ€” ~16GB model + overhead, tight but workable
  • โœ… All three QMD models fine-tuned individually
  • โœ… Multiple sequential persona training runs overnight
  • โœ… Training while light workloads continue

What it cannot:

  • โŒ Full fine-tuning of 13B+ models
  • โŒ QLoRA on 30B+ models

Key specs: 20 GPU cores, 273 GB/s memory bandwidth.

Verdict: The sweet spot for Guillaume's use case. All realistic training workloads complete in reasonable time. Weekend self-training is fully viable.

Upgrade option: The 64GB RAM variant (~$2,699) provides headroom for 13B models and parallel workloads.

Scenario 3: Mac Studio Alternative

When the Mac Mini isn't enough โ€” for future scaling to larger models or parallel training.

Config: Mac Studio M4 Max, 128GB RAM Price: ~$3,199 (verified April 2026; note: supply shortages reported at high-end configs) Memory bandwidth: 546 GB/s (2ร— Mac Mini M4 Pro) GPU cores: 40

Unlocks:

  • QLoRA on 30B models comfortably
  • Full LoRA on 13B models without compromise
  • Multiple concurrent training jobs
  • ~2ร— faster training than M4 Pro

When Guillaume needs this: If he moves to fine-tuning 30B+ models, needs parallel persona training, or wants to do continual pretraining.

Comparison Table

CapabilityMac Mini M4 24GB (~$999)Mac Mini M4 Pro 48GB (~$1,799โ€“$2,299)Mac Studio M4 Max 128GB (~$3,199)
Embedding fine-tuningโœ…โœ…โœ…
7B QLoRA persona adaptersโš ๏ธ Tight, slowโœ… Comfortableโœ… Overkill
13B QLoRAโŒโš ๏ธ Tightโœ… Comfortable
30B QLoRAโŒโŒโœ…
Weekend batch training (5 personas)โš ๏ธ ~4โ€“7 hoursโœ… ~2โ€“3 hoursโœ… ~1โ€“1.5 hours
QMD model retrainingโœ…โœ…โœ…
Thermal concerns for overnight runsโš ๏ธ Possible throttlingโš ๏ธ Monitorโœ… Better cooling

Training Time Estimates

Methodology note: M4 Pro times are extrapolated from M2/M3 community benchmarks scaled by memory bandwidth ratio (273/150 โ‰ˆ 1.8ร—). This is a reasonable first approximation but not benchmarked on M4 Pro hardware directly. Actual times may vary ยฑ30%.

LLM LoRA/QLoRA Fine-Tuning

ModelMethodHardwareStepsBatchEstimated TimeSource
Mistral 7B (4-bit)QLoRAM2 Pro 16GB1,0004~30 minmarkaicode.com (measured)
Mistral 7B (4-bit)QLoRAM1 Max 64GB1,0008~15 minrandalscottking.com (measured)
Llama 7B (FP16)LoRAM2 Ultra1,0004~35 minml-explore/mlx-examples (measured, 475 tok/s)
Llama 8B (4-bit)QLoRAM4 Pro 48GB1,0004โ€“8~10โ€“25 minExtrapolated from M2 benchmarks
13B (4-bit)QLoRAM4 Pro 48GB1,0001โ€“2~45โ€“90 minCommunity estimates
Qwen3-1.7BLoRA SFTM4 Pro 48GB1,0004~10โ€“15 minExtrapolated (small model)

Embedding Model Fine-Tuning

ModelData SizeEpochsHardwareEstimated Time
all-MiniLM-L6-v2 (22M)1,000 pairs3Any Mac (MPS)1โ€“3 min
embeddinggemma-300M1,000 pairs3M4 Pro (MPS)3โ€“8 min
embeddinggemma-300M10,000 pairs3M4 Pro (MPS)10โ€“25 min
Qwen3-Embedding-0.6B5,000 pairs3M4 Pro (MPS/MLX)10โ€“20 min

Complete Weekend Training Cycle

StepTaskEstimated Duration
1Data preparation scripts15โ€“30 min
2Query expansion model LoRA (1.7B, 1K steps)10โ€“15 min
3Persona 1 QLoRA (8B, 1K steps)20โ€“25 min
4Persona 2 QLoRA20โ€“25 min
5Persona 3 QLoRA20โ€“25 min
6Persona 4 QLoRA20โ€“25 min
7Persona 5 QLoRA20โ€“25 min
8Embedding model fine-tuning (300M)10โ€“20 min
9NER model training5โ€“10 min
10Export adapters, GGUF conversion, rebuild Ollama models15โ€“20 min
11Validation tests10โ€“15 min
Total~2.5โ€“4 hours

The entire pipeline completes in a single evening. Machine is available for other work by morning.


Automated Weekend Training Pipeline

Practical launchd/cron-based Workflow

On macOS, launchd is the recommended scheduling mechanism (preferred over cron). However, cron also works.

#!/bin/bash
# weekend-training.sh โ€” scheduled via launchd for Friday night
# Includes error handling, logging, and validation gates

set -euo pipefail
LOG="$HOME/models/training-logs/$(date +%Y%m%d-%H%M%S).log"
mkdir -p "$(dirname "$LOG")"

exec > >(tee -a "$LOG") 2>&1
echo "=== Training run started: $(date) ==="

# 0. Snapshot current working models (rollback point)
cp -r "$HOME/models/qmd-indigenous/current" \
      "$HOME/models/qmd-indigenous/backup-$(date +%Y%m%d)" 2>/dev/null || true

# 1. Generate training data from latest QMD content
echo "[Step 1] Generating training data..."
python scripts/generate_training_data.py || {
    echo "FAILED: Data generation. Aborting."
    # Send notification (e.g., via terminal-notifier on macOS)
    osascript -e 'display notification "Training data generation failed" with title "IAIP Training"'
    exit 1
}

# 2. Fine-tune query expansion model
echo "[Step 2] Training query expansion model..."
python -m mlx_lm.lora \
    --model mlx_models/Qwen3-1.7B \
    --train --data ./data/expansion/ \
    --batch-size 2 --lora-layers 16 --iters 1000 \
    --adapter-file adapters/expansion.npz

# 3. Fine-tune each persona adapter
for persona in mia miette tushell council; do
    echo "[Step 3] Training persona: $persona"
    python -m mlx_lm.lora \
        --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
        --train --data "data/${persona}/" \
        --iters 1000 --batch-size 4 --lora-layers 8 \
        --adapter-file "adapters/${persona}.npz" || {
        echo "WARNING: Persona $persona training failed, continuing..."
        continue
    }
done

# 4. Export and rebuild
echo "[Step 4] Exporting models..."
python -m mlx_lm.fuse \
    --model mlx_models/Qwen3-1.7B \
    --adapter-file adapters/expansion.npz \
    --export-gguf

for persona in mia miette tushell council; do
    [ -f "adapters/${persona}.npz" ] || continue
    python -m mlx_lm.fuse \
        --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
        --adapter-file "adapters/${persona}.npz" \
        --export-gguf
    ollama create "${persona}-persona" -f "Modelfiles/${persona}"
done

# 5. Validation gate โ€” test search quality before deploying
echo "[Step 5] Running validation..."
python scripts/validate_models.py || {
    echo "FAILED: Validation. Rolling back to previous models."
    ln -sfn "$HOME/models/qmd-indigenous/backup-$(date +%Y%m%d)" \
            "$HOME/models/qmd-indigenous/current"
    osascript -e 'display notification "Validation failed โ€” rolled back" with title "IAIP Training"'
    exit 1
}

# 6. Deploy new models
echo "[Step 6] Deploying..."
mv "$HOME/models/qmd-indigenous/current" "$HOME/models/qmd-indigenous/previous"
ln -sfn "$HOME/models/qmd-indigenous/v-$(date +%Y%m%d)" \
        "$HOME/models/qmd-indigenous/current"

echo "=== Training run completed: $(date) ==="
osascript -e 'display notification "Training complete โ€” new models deployed" with title "IAIP Training"'

Monitoring and Validation

A validation script should test search quality before deploying new models:

# scripts/validate_models.py
test_queries = {
    "medicine wheel teachings": ["medicine-wheel.md", "four-directions.md"],
    "relational accountability": ["research-ceremony.md", "ethics.md"],
    "seven grandfather teachings": ["seven-teachings.md", "anishinaabe.md"],
}

for query, expected_docs in test_queries.items():
    results = qmd_search(query, top_k=5)
    found = [r for r in results if r.path in expected_docs]
    if len(found) < 1:
        raise AssertionError(f"Query '{query}' failed to find expected documents")

Evaluation Metrics

To measure whether fine-tuning actually improved search:

  • MRR (Mean Reciprocal Rank): Average of 1/rank for first relevant result across test queries
  • Precision@5: Fraction of top-5 results that are relevant
  • A/B comparison: Run the same test queries against base and fine-tuned models, compare rankings

Maintain a test query set of 20โ€“50 domain-specific queries with known expected results.


Known Limitations & Honest Caveats

GGUF Conversion for Embedding Models โ€” โš ๏ธ UNVERIFIED

The entire embedding fine-tuning โ†’ GGUF conversion โ†’ QMD deployment pipeline is theoretical. The individual steps are plausible, but the end-to-end path has not been verified by anyone in the community as of April 2026. Specific risks:

  • convert_hf_to_gguf.py may not correctly handle fine-tuned sentence-transformers models with modified pooling layers
  • Quantization (Q8_0) of a fine-tuned encoder-only model may degrade embedding quality unpredictably
  • node-llama-cpp may not correctly load a converted encoder model that differs structurally from the original

Mitigation: Test with dummy data first. Have a fallback plan (Qwen3-Embedding-0.6B, which uses decoder architecture).

CUDA Dependencies in QMD's Pipeline

QMD's finetune/ directory is a CUDA-targeted pipeline. The pyproject.toml includes nvidia-ml-py, and training configs reference A10G GPUs. This pipeline cannot run on Mac without:

  • Removing nvidia-ml-py from dependencies
  • Changing device configuration from cuda to mps
  • Reducing batch sizes to fit in available memory
  • Potentially modifying mixed-precision settings (MPS has limited bfloat16/float16 support)

The HuggingFace Jobs cloud path (~$1.50/run) works without modification but sends training data off-device โ€” a data sovereignty concern for Indigenous knowledge.

What's Proven vs Experimental

ComponentStatusEvidence
MLX-LM LoRA/QLoRA training on Apple Siliconโœ… ProvenApple's official examples, multiple community guides, r/LocalLLaMA reports
Ollama GGUF/LoRA import workflowโœ… ProvenDocumented in Ollama repository
QMD model swapping via env varsโœ… VerifiedConfirmed in source code
QMD query expansion fine-tuning pipelineโœ… ProvenProduction pipeline exists, documented
sentence-transformers fine-tuning on MPSโœ… ProvenStandard PyTorch workflow
M4 Pro training speed estimatesโš ๏ธ ExtrapolatedScaled from M2/M3 benchmarks by bandwidth ratio
mlx-tune embedding fine-tuningโš ๏ธ ClaimedListed as stable in v0.4.19, limited community verification
embeddinggemma-300M sentence-transformers loadingโš ๏ธ Needs wrappingNot a native ST model โ€” requires manual pooling layer
Fine-tuned embedding โ†’ GGUF conversionโŒ UnverifiedNo community reports of successful end-to-end path
Reranker fine-tuning with calibration preservationโŒ UnverifiedNo pipeline or documentation exists

Thermal and Disk Space

  • Thermal: Mac Mini's compact form factor may cause thermal throttling during sustained 2โ€“3 hour training runs. Monitor CPU/GPU temperatures. Ensure good ventilation.
  • Disk space: Each training run generates checkpoints, optimizer states, and cached datasets. Budget 20โ€“50GB scratch space for five persona training runs plus model exports. The 1TB SSD configuration is recommended.

Multilingual Considerations

Indigenous communities often have knowledge in Indigenous languages, French (Guillaume's Quรฉbรฉcois context), and English. The current QMD embedding model (embeddinggemma-300M) was trained on 100+ languages but may not adequately represent Indigenous languages with small digital footprints. Fine-tuning with bilingual/trilingual training pairs is important.


Community Resources

Framework Documentation

Guides and Benchmarks

QMD-Specific

HuggingFace Model Cards

Hardware

Data Sovereignty


Recommendation

What to Buy

Mac Mini M4 Pro, 48GB RAM, 1TB SSD (~$2,299) โ€” the recommended configuration. The 1TB SSD provides space for model checkpoints, training artifacts, and multiple model versions. The 48GB RAM handles all training tasks described in this document comfortably.

If budget is constrained, the 512GB SSD variant (~$1,799) works but monitor disk space. External storage can supplement if needed.

Do not buy the base M4 Mac Mini for regular training workloads โ€” the 120 GB/s bandwidth makes training 2.3ร— slower, and 24GB RAM limits model options.

What to Train First

Phase 1 (Week 1โ€“2): QMD Query Expansion Model

  • Lowest risk โ€” existing pipeline, proven GGUF conversion
  • Write 200โ€“300 Indigenous knowledge expansion examples
  • Train locally via MLX-LM (data stays local)
  • Deploy via QMD_GENERATE_MODEL env var
  • Measure improvement with test queries

Phase 2 (Week 3โ€“4): Persona LoRA Adapters

  • Proven MLX-LM workflow, routine on Apple Silicon
  • Start with one persona (e.g., Mia), validate quality
  • Scale to all personas once workflow is solid
  • Deploy via Ollama Modelfiles

Phase 3 (Month 2): Embedding Model Fine-Tuning

  • Highest impact but highest risk
  • First: Test GGUF conversion pipeline with dummy data
  • If conversion works: fine-tune with Indigenous knowledge pairs
  • If conversion fails: fall back to Qwen3-Embedding-0.6B or explore QMD source modification
  • Deploy via QMD_EMBED_MODEL + full re-embedding

Phase 4 (Month 3+): Reranker, NER, Classification

  • Build custom pipeline for reranker domain adaptation
  • Train NER model for Indigenous terminology extraction
  • Integrate into automated weekend pipeline

Automation

Once Phase 1โ€“2 are validated manually, implement the automated weekend training script (see ยงAutomated Weekend Training Pipeline). Schedule via launchd for Friday nights. Include validation gates and rollback capability.


Sources

  1. ml-explore/mlx-examples LoRA โ€” Apple's official LoRA example with benchmarks
  2. ml-explore/mlx-lm v0.31.1 โ€” Apple's LLM fine-tuning package
  3. ARahim3/mlx-tune v0.4.19 โ€” Community fine-tuning wrapper
  4. PyTorch MPS Backend โ€” Official documentation
  5. HuggingFace Apple Silicon Guide โ€” Trainer + MPS integration
  6. sentence-transformers โ€” Embedding fine-tuning documentation
  7. markaicode.com MLX-LM Guide โ€” Training benchmarks
  8. randalscottking.com MLX Guide โ€” Step-by-step training
  9. tobi/qmd GitHub โ€” QMD source (commit cfd640e, MIT license)
  10. QMD finetune/ pipeline โ€” Query expansion training
  11. google/embeddinggemma-300M โ€” HuggingFace model card
  12. Qwen/Qwen3-Reranker-0.6B โ€” HuggingFace model card
  13. tobil/qmd-query-expansion-1.7B โ€” HuggingFace model card
  14. Apple Mac Mini Specs โ€” Official hardware specs
  15. Apple Mac Studio Specs โ€” Official hardware specs
  16. r/LocalLLaMA Community Benchmarks โ€” ~10,000 benchmark runs
  17. OCAPยฎ Principles โ€” First Nations Information Governance Centre
  18. CARE Principles โ€” Indigenous Data Governance
  19. B&H Photo Mac Mini Pricing โ€” Verified April 2026
  20. CDW Mac Studio Pricing โ€” Verified April 2026

Final document compiled April 15, 2026. Incorporates corrections from senior technical review โ€” BLOCKING issues (GGUF conversion, CUDA dependency, architecture misidentification) addressed. All pricing verified via web search. Training time estimates flagged as extrapolated where applicable.