Review: Track 2 — Training & QMD Models

Reviewer: Senior Technical Reviewer Review Date: April 15, 2026 Documents Reviewed:

Agent C: Apple Silicon Training & Fine-Tuning Capabilities (agent-c-mac-training.md)
Agent D: QMD Models & Fine-Tuning Potential (agent-d-qmd-models.md)

Verification Method: Cross-referenced against QMD GitHub source (commit cfd640e), web searches for current framework versions, Apple hardware specs, HuggingFace model cards, and QMD src/llm.ts source code.

Review Summary

Overall Quality: B+ (Good with significant corrections needed)

Both documents are substantive and well-structured. Agent D is notably strong — its QMD source code analysis is verified accurate against the actual src/llm.ts and finetune/ directory. Agent C provides useful practical guidance but contains several version/pricing inaccuracies and one significant architectural misstatement shared with Agent D. The most critical gap across both documents is the embedding model fine-tuning-to-GGUF conversion pipeline, which is presented as straightforward but is actually the hardest unsolved step in the entire workflow. Both documents are also thin on data sovereignty and Indigenous knowledge handling — a gap that matters given the specific user's context.

Document 1: Mac Mini Training — Review

Accuracy Issues

mlx-tune version inflated: Doc C states mlx-tune v0.4.21 with all features "Stable." Web search confirms the latest public version is v0.4.19 as of April 2026. The version number is fabricated or anticipatory. The capability matrix may be aspirational rather than verified — revision agents should confirm each feature's actual status against the v0.4.19 release notes.
MLX version notation ambiguous: Doc C says "MLX 0.20+ is production-ready." The MLX framework (ml-explore/mlx) and MLX-LM (ml-explore/mlx-lm) are separate packages. MLX-LM is at v0.31.1 as of March 2026. The "0.20+" likely refers to the MLX array framework, not MLX-LM. This distinction matters and should be clarified — users will pip install mlx-lm, not mlx directly, for LLM work.
Mac Mini M4 Pro 48GB pricing incorrect: Doc C claims ~$1,800 for "Mac Mini M4 Pro, 48GB RAM, 1TB SSD." Verified pricing shows:
- 48GB/512GB SSD: ~$1,799
- 48GB/1TB SSD: ~$2,199–$2,499 (depending on CPU/GPU tier)
The $1,800 figure is only accurate for the 512GB SSD config. With 1TB SSD the price is $400–$700 higher than claimed. This affects budget recommendations.
LoRA training CLI example outdated: The example uses python lora.py --model ... which is from the older mlx-examples repository. Current MLX-LM uses python -m mlx_lm.lora or the CLI mlx_lm.lora directly. The document later uses the correct form in the automation script (line 508), creating internal inconsistency.
EmbeddingGemma architecture misidentified: Doc C (line 204) and Doc D (line 92) both describe embeddinggemma-300M as "Gemma 3 (T5Gemma initialization)." This is incorrect. Web search and HuggingFace model card indicate embeddinggemma-300M is an encoder-only transformer (conceptually similar to BERT), not a Gemma 3 decoder model. It does NOT use the Gemma 3 decoder architecture. This matters for fine-tuning — encoder-only models use different training procedures than causal LMs.
"30–40% training speedup" claim for MLX over PyTorch MPS: This appears on both line 36 (as advantage) and line 106 (as disadvantage for MPS). No specific benchmarks are cited to support this exact figure. Community reports suggest MLX is faster, but the precise percentage varies by model size and task. Should be qualified as "approximately" or sourced.

Completeness Gaps

No mention of QMD's fine-tune pipeline being CUDA-only: The QMD finetune/pyproject.toml includes nvidia-ml-py as a dependency and the default training configs target A10G (CUDA) GPUs. The sft_local.yaml config is designed for multi-GPU CUDA setups, not Mac MPS. Doc C doesn't address this incompatibility — it implies Guillaume can run QMD's pipeline locally on Mac, but the pipeline would need modification to work with MPS (remove nvidia-ml-py dep, change device config, possibly reduce batch sizes).
Missing: How to actually convert LoRA adapters trained with MLX-LM to GGUF for QMD: The document describes training with MLX-LM and serving via Ollama, but QMD uses node-llama-cpp with GGUF files directly. The conversion path from MLX LoRA adapters → merged weights → GGUF → QMD is not fully documented. The mlx_lm.fuse + GGUF export path needs explicit verification.
Missing: Axolotl Mac compatibility caveat: Doc C lists Axolotl as "✅ Works on Mac" but Axolotl has historically had CUDA dependencies and limited MPS support. This claim needs verification.
Missing: Power consumption and thermal management for overnight training: The Mac Mini has a compact form factor. For 2.5+ hour overnight training runs, thermal throttling is a realistic concern that isn't addressed.
Missing: Disk space requirements for training artifacts: LoRA training generates checkpoints, optimizer states, and cached datasets. Five persona training runs could easily consume 20–50GB of scratch space. Not mentioned.

Corrections Needed

Item	Current (Incorrect)	Corrected
mlx-tune version	v0.4.21	v0.4.19 (verify each capability against release notes)
Mac Mini M4 Pro 48GB/1TB price	~$1,800	~$2,200–$2,500
LoRA CLI example	`python lora.py`	`python -m mlx_lm.lora`
EmbeddingGemma architecture	Gemma 3 (T5Gemma)	Encoder-only transformer
MLX version reference	"MLX 0.20+"	Clarify: MLX framework vs MLX-LM v0.31.1

Training Time Verification

The training time estimates are reasonable but optimistic for M4 Pro hardware:

7B QLoRA 1K steps in 10–20 min on M4 Pro 48GB: Plausible. Community reports show ~15–30 min on M2 Pro/Max, and M4 Pro has 1.5–2× the bandwidth. The 10-minute low end may be achievable with aggressive quantization and batch=1.
Embedding fine-tuning 5–20 min: Reasonable for 300M–600M parameter models with 1K–10K pairs. These models are genuinely small.
5 persona adapters in 1.5–3 hours: Arithmetic checks out (5 × 20–30 min = 1.5–2.5 hours + overhead).
M4 Pro extrapolation methodology: Using memory bandwidth ratio for scaling is an imperfect but reasonable first approximation. Should be flagged as estimated, not benchmarked.

Additional Findings

Positive: The weekend self-training schedule (lines 437–449) is one of the strongest contributions — practical, time-boxed, and actionable.
Gap: The train.sh automation script (lines 499–529) lacks error handling, logging, notification on failure, and validation gates. A production script should verify adapter quality before deploying.
Gap: No mention of launchd (Mac's native service manager) as an alternative to cron for scheduling. On macOS, launchd is the recommended approach.

Document 2: QMD Models — Review

Accuracy Issues

EmbeddingGemma architecture (same error as Doc C): Line 92 says "Gemma 3 (T5Gemma initialization)." This is incorrect — it's an encoder-only transformer. This error appears in both documents, suggesting a shared misconception or copy-paste from the same incorrect source.
EmbeddingGemma paper citation: Doc D cites arxiv.org/abs/2509.20354 — an arxiv ID starting with 2509 implies a September 2025 submission. This is temporally plausible but the paper title and ID should be verified. I could not confirm this specific arxiv ID exists.
"embeddinggemma-300M... Does NOT support float16 — requires float32 or bfloat16": This claim is stated but not sourced. If the model is encoder-only, this precision constraint should be verified against the model card. It has implications for fine-tuning (float32 uses 2× memory vs float16).
Training data format discrepancy: Doc D shows the training data format as:
```
{"query": "...", "output": [["hyde", "..."], ["lex", "..."], ["vec", "..."]]}
```
But QMD's finetune/dataset/prepare_data.py processes data into Qwen3 chat template format (single "text" field with <|im_start|> markers), as shown in sft.yaml: text_field: "text". The JSONL format shown is the raw format before prepare_data.py processing. This is clarified later in the doc but could confuse readers who skip ahead.

Model Identification Verification

All three models verified correct against src/llm.ts:

Model	Doc D URI	Source Code URI	Match?
Embedding	`hf:ggml-org/embeddinggemma-300M-GGUF/embeddinggemma-300M-Q8_0.gguf`	`hf:ggml-org/embeddinggemma-300M-GGUF/embeddinggemma-300M-Q8_0.gguf`	✅ Exact
Reranker	`hf:ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF/qwen3-reranker-0.6b-q8_0.gguf`	`hf:ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF/qwen3-reranker-0.6b-q8_0.gguf`	✅ Exact
Query Expansion	`hf:tobil/qmd-query-expansion-1.7B-gguf/qmd-query-expansion-1.7B-q4_k_m.gguf`	`hf:tobil/qmd-query-expansion-1.7B-gguf/qmd-query-expansion-1.7B-q4_k_m.gguf`	✅ Exact

Env var swapping verified correct:

QMD_EMBED_MODEL → line 504 of src/llm.ts
QMD_GENERATE_MODEL → line 505 of src/llm.ts
QMD_RERANK_MODEL → line 506 of src/llm.ts

SDK constructor API verified correct — createStore() with LlamaCpp({ embedModel, rerankModel, generateModel }) matches the source.

QMD version confirmed: @tobilu/qmd v2.1.0, node-llama-cpp 3.18.1, sqlite-vec 0.1.9 — all match package.json.

Fine-Tuning Proposal Assessment

Query Expansion Model (Priority 1) — ✅ STRONG PROPOSAL:
- The fine-tuning pipeline is verified to exist in finetune/ directory with complete tooling: train.py, eval.py, convert_gguf.py, reward.py, dataset/schema.py, dataset/prepare_data.py.
- LoRA config verified: rank 16, alpha 32, all projection layers, 5 epochs, 2e-4 lr, ~2,290 examples.
- Cloud training via HuggingFace Jobs (~$1.50/run on A10G) is a documented, tested path.
- CRITICAL CAVEAT NOT ADEQUATELY HIGHLIGHTED: The pipeline is designed for CUDA GPUs. The pyproject.toml depends on nvidia-ml-py. Running locally on Mac requires either: (a) Modifying the pipeline to remove CUDA deps and use MPS device (b) Using the HuggingFace Jobs cloud path (easiest) (c) Rewriting the training to use MLX-LM instead of PyTorch/trl/peft
Embedding Model Fine-Tuning (Priority 2) — ⚠️ PROPOSAL HAS CRITICAL GAP:
- The sentence-transformers fine-tuning approach is technically sound for training.
- THE CONVERSION BACK TO GGUF IS THE HARD PART and is inadequately addressed. Both documents treat this as a simple convert_hf_to_gguf.py call, but:
  - EmbeddingGemma-300M is an encoder-only model, not a causal LM
  - llama.cpp's convert_hf_to_gguf.py was designed for causal LMs (LLaMA, Mistral, etc.)
  - Support for converting fine-tuned encoder-only models back to GGUF is experimental and unverified
  - The GGUF format has recently added embedding model support, but converting a modified (fine-tuned) sentence-transformers model is not the same as converting the original pre-trained model
- This is the single riskiest step in the entire proposal. If conversion fails, the embedding fine-tuning work cannot be deployed into QMD.
- Doc D acknowledges this at evidence level "⚠️ Theoretically viable, not tested" — but the surrounding text presents the workflow as if it's straightforward. The risk needs to be elevated to a blocking concern with a recommended mitigation: test the conversion pipeline with a dummy fine-tuned model before investing in Indigenous knowledge dataset creation.
Reranker Fine-Tuning (Priority 3) — ⚠️ UNDERSPECIFIED:
- The Qwen3-Reranker is a standard causal LM (0.6B params), so LoRA fine-tuning is technically feasible.
- However, no pipeline exists for this in QMD. No training data format, no training script, no evaluation harness.
- The yes/no logprob scoring mechanism (lines 136–148) means the fine-tuning needs to preserve this specific behavior — standard SFT might break the calibration.
- Should be clearly labeled as "requires building custom pipeline from scratch."

Completeness Gaps

Missing: How QMD re-embeds documents after model swap. Doc D mentions qmd embed (line 348) for re-embedding, but doesn't explain:
- How long this takes for a typical knowledge base (100s–1000s of documents)
- Whether it can be done incrementally or must re-process everything
- Whether the SQLite index needs to be rebuilt
- What happens if the old and new models have different embedding dimensions
Missing: Embedding dimension compatibility. If Guillaume switches from embeddinggemma-300M (768-dim) to a fine-tuned model, the dimensions MUST match or the sqlite-vec index will be incompatible. If he switches to Qwen3-Embedding-0.6B (1024-dim), he needs a full re-index. This is not discussed.
Missing: Testing/evaluation methodology for domain-specific models. Both documents describe training but neither provides a rigorous evaluation framework for measuring whether Indigenous knowledge search actually improved. Need:
- A test query set with expected results
- Metrics (MRR, NDCG, precision@k)
- A/B comparison methodology between base and fine-tuned models
Missing: Multi-language considerations. Indigenous communities often have knowledge in Indigenous languages, French (given Guillaume's Québécois context), and English. The documents don't address multilingual embedding challenges or model selection for non-English content.

Additional Findings

Strong: The QMD architecture diagram (lines 43–62) is clear and accurate.
Strong: The reranker scoring formula (lines 147–148) matches the actual implementation pattern in node-llama-cpp.
Strong: The version management recommendation (lines 552–577) with symlinked model directories is practical and well-designed.
Gap: The qmd search "*" --all --json command (line 433) for extracting training data should be verified — the --all flag may not exist in QMD's CLI.

Cross-Document Contradictions

1. Embedding Fine-Tuning Framework Disagreement

Doc C recommends mlx-tune as the primary tool for embedding fine-tuning (line 58–88), claiming it supports "InfoNCE/contrastive learning" for "BERT, ModernBERT, Qwen3-Embedding, Harrier architectures."

Doc D recommends sentence-transformers on PyTorch (lines 297–323), using MultipleNegativesRankingLoss.

These are two different frameworks with different conversion paths:

mlx-tune → MLX format → ??? (no documented GGUF export for embedding models)
sentence-transformers → HuggingFace format → GGUF (untested for encoder-only)

Neither document reconciles these approaches or recommends one definitively. The revision must pick one primary path and document the full end-to-end conversion.

2. "embeddinggemma supports sentence-transformers" vs Reality

Doc D (line 302) shows SentenceTransformer("google/embeddinggemma-300M") as if it's a drop-in. Web search indicates this model is NOT natively published as a sentence-transformers package — it requires manual wrapping with a pooling layer. This complicates the fine-tuning workflow.

3. QMD Fine-Tune Pipeline: CUDA vs Mac

Doc C implies local Mac training is seamless. Doc D correctly references the finetune/ pipeline but doesn't flag that pyproject.toml has nvidia-ml-py as a hard dependency, making uv run train.py fail on Mac without modification. The documents need to align on this: either provide Mac-specific instructions or clearly state the cloud path is recommended for QMD's existing pipeline.

4. Priority Ordering Discrepancy

Doc C (line 203) lists embedding fine-tuning as item #2 ("✅ FEASIBLE"). Doc D (line 221) lists query expansion as "Priority 1" and embedding as "Priority 2", noting query expansion has the existing pipeline.

These should be reconciled. Doc D's ordering (query expansion first, because infrastructure exists) is the more pragmatic recommendation.

5. Training Data Format

Doc C shows training data as: {"text": "<|user|>\n...\n<|assistant|>\n..."} (line 192) Doc D shows QMD's actual format as: {"query": "...", "output": [["hyde", "..."], ...]} (line 236)

These are for different purposes (persona LoRA vs query expansion), so not truly contradictory — but the document doesn't make this distinction clear enough.

Critical Gaps for Revision

BLOCKING: Verify embeddinggemma-300M → fine-tune → GGUF conversion pipeline end-to-end. Test with a trivially fine-tuned model before any Indigenous knowledge training. If this conversion fails, the entire embedding fine-tuning proposal collapses. Provide a fallback plan (e.g., use Qwen3-Embedding-0.6B instead, which may have better GGUF conversion support as a causal-architecture model).
BLOCKING: Fix embeddinggemma-300M architecture description. It is an encoder-only transformer, NOT "Gemma 3 (T5Gemma initialization)." This error appears in BOTH documents and affects fine-tuning guidance.
IMPORTANT: Correct mlx-tune version to v0.4.19 and verify each capability claim against actual release notes rather than projected features.
IMPORTANT: Correct Mac Mini pricing. The $1,800 figure is for 48GB/512GB only. With 1TB SSD: ~$2,200–$2,500. The budget recommendation should specify the exact configuration.
IMPORTANT: Address QMD fine-tune pipeline CUDA dependency. Either (a) provide modified pyproject.toml and config for Mac MPS, or (b) clearly recommend HuggingFace Jobs ($1.50/run) as the path of least resistance, or (c) document rewriting the training to use MLX-LM.
IMPORTANT: Provide a concrete evaluation framework for measuring whether domain-specific fine-tuning improved search quality. Include test queries, expected results, and metrics.
MODERATE: Reconcile embedding fine-tuning framework recommendation — pick either mlx-tune or sentence-transformers and document the full path including GGUF conversion.
MODERATE: Address embedding dimension compatibility when swapping models in QMD. Document what happens with sqlite-vec when dimensions change.
MODERATE: Add data preparation guidance specific to Indigenous knowledge. Both documents say "create 500–5,000 pairs" but provide no guidance on:
- How to extract training pairs from an existing QMD knowledge base programmatically
- Quality criteria for Indigenous knowledge training pairs
- How to handle sacred/restricted knowledge that should NOT be in training data
- Language diversity in the training set
MODERATE: Add error handling and rollback to the weekend training automation. The current train.sh has no validation gates, no notification, no rollback capability.
MINOR: Update LoRA CLI example from python lora.py to python -m mlx_lm.lora.
MINOR: Clarify MLX vs MLX-LM version numbers — they are separate packages with separate versioning.

Data Sovereignty Considerations

Both documents are significantly deficient in this area. For an Indigenous-AI Collaborative Platform, data sovereignty is not a nice-to-have — it is a foundational requirement. Issues that must be addressed:

1. Training Data Containment

Fine-tuning creates model weights that encode training data patterns. If Indigenous knowledge is used for training, the resulting model weights carry that knowledge. Who owns these weights? Can they be shared?
The documents recommend pushing models to HuggingFace Hub. Sacred or restricted knowledge embedded in model weights would be publicly exposed. The revision must explicitly warn against this and recommend private/local-only model storage.

2. Cloud Training Risks

Doc D recommends HuggingFace Jobs for training (~$1.50/run). This sends training data to HuggingFace's cloud infrastructure. For culturally sensitive Indigenous knowledge, this may violate data sovereignty principles.
The revision should clearly distinguish between paths that keep data local (MLX-LM on Mac, PyTorch MPS) and paths that send data to cloud (HuggingFace Jobs).

3. Knowledge Categories

Not all Indigenous knowledge can or should be used for training. The documents should recommend:
- A classification system for knowledge: Public Teaching / Community Knowledge / Sacred/Restricted
- Only using "Public Teaching" category content for model training
- Community consultation before any training on shared knowledge
- OCAP principles (Ownership, Control, Access, Possession) applied to training data and model weights

4. Model Weight Sovereignty

Fine-tuned model weights should be treated as derived works of the training data. Storage, distribution, and access controls should match the sensitivity level of the source knowledge.
Local-only deployment (Mac Mini) inherently respects data sovereignty better than cloud deployment.

5. Consent and Attribution

If using community-sourced Indigenous knowledge for training, the process should include:
- Free, prior, and informed consent from knowledge holders
- Attribution mechanisms in the model metadata
- Right to withdraw (ability to retrain without specific contributions)

Verified Facts

The following claims were confirmed through source code inspection and/or web search:

Fact	Verification Method
QMD uses exactly 3 GGUF models via node-llama-cpp	✅ Verified in `src/llm.ts` lines 196–199
All 3 models swappable via env vars	✅ Verified in `src/llm.ts` lines 504–506
QMD version is 2.1.0	✅ Verified in `package.json`
node-llama-cpp version is 3.18.1	✅ Verified in `package.json`
sqlite-vec version is 0.1.9	✅ Verified in `package.json`
Complete fine-tuning pipeline exists in `finetune/`	✅ Verified: 20+ files including train.py, eval.py, convert_gguf.py, dataset/, configs/
Fine-tuning uses LoRA SFT on Qwen3-1.7B	✅ Verified in `finetune/configs/sft.yaml`
LoRA config: rank 16, alpha 32, all projection layers	✅ Verified in `finetune/configs/sft.yaml`
~2,290 training examples	✅ Stated in `finetune/README.md`
Training pipeline uses trl/peft/transformers	✅ Verified in `finetune/pyproject.toml`
MLX-LM latest version is v0.31.1 (March 2026)	✅ Verified via web search (PyPI)
mlx-tune latest version is v0.4.19	✅ Verified via web search (PyPI, GitHub)
Mac Mini M4 Pro 48GB/512GB pricing ~$1,799	✅ Verified via multiple retailers
M4 Pro has 20 GPU cores, 273 GB/s bandwidth	✅ Verified via Apple specs
QMD embeddinggemma uses nomic-style task prefix formatting	✅ Verified in `src/llm.ts` `formatQueryForEmbedding()`
QMD supports Qwen3-Embedding alternative with auto-detection	✅ Verified in `src/llm.ts` `isQwen3EmbeddingModel()` regex
QMD license is MIT	✅ Verified in `package.json`

Updated/Corrected Data

Version Corrections

Item	Document Value	Corrected Value	Source
mlx-tune version	v0.4.21	v0.4.19	PyPI, GitHub releases
MLX framework reference	"MLX 0.20+"	MLX-LM v0.31.1 (separate from MLX framework)	PyPI

Architecture Corrections

Item	Document Value	Corrected Value	Source
EmbeddingGemma-300M architecture	"Gemma 3 (T5Gemma initialization)"	Encoder-only transformer (BERT-like)	HuggingFace model card, web search

Pricing Corrections

Item	Document Value	Corrected Value	Source
Mac Mini M4 Pro 48GB/1TB	~$1,800	~$2,200–$2,500	Apple Store, B&H, Micro Center
Mac Mini M4 Pro 48GB/512GB	(not specified separately)	~$1,799	Multiple retailers

Pipeline Corrections

Item	Document Claim	Actual Status	Impact
QMD fine-tune pipeline runs on Mac	Implied possible	CUDA-dependent (`nvidia-ml-py` in deps, A10G target hardware)	Must modify for Mac or use cloud
`SentenceTransformer("google/embeddinggemma-300M")` direct load	Shown as working	Requires manual wrapping (not a native ST model)	Complicates embedding fine-tuning
Fine-tuned embeddinggemma → GGUF conversion	Presented as straightforward	Experimental, unverified for fine-tuned encoder models	Blocking risk for embedding improvement

Review completed April 15, 2026. All source code references verified against QMD commit cfd640e.