Review: Track 2 β Training & QMD Models
Reviewer: Senior Technical Reviewer Review Date: April 15, 2026 Documents Reviewed:
- Agent C: Apple Silicon Training & Fine-Tuning Capabilities (
agent-c-mac-training.md) - Agent D: QMD Models & Fine-Tuning Potential (
agent-d-qmd-models.md)
Verification Method: Cross-referenced against QMD GitHub source (commit cfd640e), web searches for current framework versions, Apple hardware specs, HuggingFace model cards, and QMD src/llm.ts source code.
Review Summary
Overall Quality: B+ (Good with significant corrections needed)
Both documents are substantive and well-structured. Agent D is notably strong β its QMD source code analysis is verified accurate against the actual src/llm.ts and finetune/ directory. Agent C provides useful practical guidance but contains several version/pricing inaccuracies and one significant architectural misstatement shared with Agent D. The most critical gap across both documents is the embedding model fine-tuning-to-GGUF conversion pipeline, which is presented as straightforward but is actually the hardest unsolved step in the entire workflow. Both documents are also thin on data sovereignty and Indigenous knowledge handling β a gap that matters given the specific user's context.
Document 1: Mac Mini Training β Review
Accuracy Issues
-
mlx-tune version inflated: Doc C states mlx-tune v0.4.21 with all features "Stable." Web search confirms the latest public version is v0.4.19 as of April 2026. The version number is fabricated or anticipatory. The capability matrix may be aspirational rather than verified β revision agents should confirm each feature's actual status against the v0.4.19 release notes.
-
MLX version notation ambiguous: Doc C says "MLX 0.20+ is production-ready." The MLX framework (ml-explore/mlx) and MLX-LM (ml-explore/mlx-lm) are separate packages. MLX-LM is at v0.31.1 as of March 2026. The "0.20+" likely refers to the MLX array framework, not MLX-LM. This distinction matters and should be clarified β users will
pip install mlx-lm, notmlxdirectly, for LLM work. -
Mac Mini M4 Pro 48GB pricing incorrect: Doc C claims ~$1,800 for "Mac Mini M4 Pro, 48GB RAM, 1TB SSD." Verified pricing shows:
- 48GB/512GB SSD: ~$1,799
- 48GB/1TB SSD: ~$2,199β$2,499 (depending on CPU/GPU tier)
The $1,800 figure is only accurate for the 512GB SSD config. With 1TB SSD the price is $400β$700 higher than claimed. This affects budget recommendations.
-
LoRA training CLI example outdated: The example uses
python lora.py --model ...which is from the oldermlx-examplesrepository. Current MLX-LM usespython -m mlx_lm.loraor the CLImlx_lm.loradirectly. The document later uses the correct form in the automation script (line 508), creating internal inconsistency. -
EmbeddingGemma architecture misidentified: Doc C (line 204) and Doc D (line 92) both describe embeddinggemma-300M as "Gemma 3 (T5Gemma initialization)." This is incorrect. Web search and HuggingFace model card indicate embeddinggemma-300M is an encoder-only transformer (conceptually similar to BERT), not a Gemma 3 decoder model. It does NOT use the Gemma 3 decoder architecture. This matters for fine-tuning β encoder-only models use different training procedures than causal LMs.
-
"30β40% training speedup" claim for MLX over PyTorch MPS: This appears on both line 36 (as advantage) and line 106 (as disadvantage for MPS). No specific benchmarks are cited to support this exact figure. Community reports suggest MLX is faster, but the precise percentage varies by model size and task. Should be qualified as "approximately" or sourced.
Completeness Gaps
-
No mention of QMD's fine-tune pipeline being CUDA-only: The QMD
finetune/pyproject.tomlincludesnvidia-ml-pyas a dependency and the default training configs target A10G (CUDA) GPUs. Thesft_local.yamlconfig is designed for multi-GPU CUDA setups, not Mac MPS. Doc C doesn't address this incompatibility β it implies Guillaume can run QMD's pipeline locally on Mac, but the pipeline would need modification to work with MPS (remove nvidia-ml-py dep, change device config, possibly reduce batch sizes). -
Missing: How to actually convert LoRA adapters trained with MLX-LM to GGUF for QMD: The document describes training with MLX-LM and serving via Ollama, but QMD uses
node-llama-cppwith GGUF files directly. The conversion path from MLX LoRA adapters β merged weights β GGUF β QMD is not fully documented. Themlx_lm.fuse+ GGUF export path needs explicit verification. -
Missing: Axolotl Mac compatibility caveat: Doc C lists Axolotl as "β Works on Mac" but Axolotl has historically had CUDA dependencies and limited MPS support. This claim needs verification.
-
Missing: Power consumption and thermal management for overnight training: The Mac Mini has a compact form factor. For 2.5+ hour overnight training runs, thermal throttling is a realistic concern that isn't addressed.
-
Missing: Disk space requirements for training artifacts: LoRA training generates checkpoints, optimizer states, and cached datasets. Five persona training runs could easily consume 20β50GB of scratch space. Not mentioned.
Corrections Needed
| Item | Current (Incorrect) | Corrected |
|---|---|---|
| mlx-tune version | v0.4.21 | v0.4.19 (verify each capability against release notes) |
| Mac Mini M4 Pro 48GB/1TB price | ~$1,800 | ~$2,200β$2,500 |
| LoRA CLI example | python lora.py | python -m mlx_lm.lora |
| EmbeddingGemma architecture | Gemma 3 (T5Gemma) | Encoder-only transformer |
| MLX version reference | "MLX 0.20+" | Clarify: MLX framework vs MLX-LM v0.31.1 |
Training Time Verification
The training time estimates are reasonable but optimistic for M4 Pro hardware:
- 7B QLoRA 1K steps in 10β20 min on M4 Pro 48GB: Plausible. Community reports show ~15β30 min on M2 Pro/Max, and M4 Pro has 1.5β2Γ the bandwidth. The 10-minute low end may be achievable with aggressive quantization and batch=1.
- Embedding fine-tuning 5β20 min: Reasonable for 300Mβ600M parameter models with 1Kβ10K pairs. These models are genuinely small.
- 5 persona adapters in 1.5β3 hours: Arithmetic checks out (5 Γ 20β30 min = 1.5β2.5 hours + overhead).
- M4 Pro extrapolation methodology: Using memory bandwidth ratio for scaling is an imperfect but reasonable first approximation. Should be flagged as estimated, not benchmarked.
Additional Findings
- Positive: The weekend self-training schedule (lines 437β449) is one of the strongest contributions β practical, time-boxed, and actionable.
- Gap: The
train.shautomation script (lines 499β529) lacks error handling, logging, notification on failure, and validation gates. A production script should verify adapter quality before deploying. - Gap: No mention of
launchd(Mac's native service manager) as an alternative to cron for scheduling. On macOS,launchdis the recommended approach.
Document 2: QMD Models β Review
Accuracy Issues
-
EmbeddingGemma architecture (same error as Doc C): Line 92 says "Gemma 3 (T5Gemma initialization)." This is incorrect β it's an encoder-only transformer. This error appears in both documents, suggesting a shared misconception or copy-paste from the same incorrect source.
-
EmbeddingGemma paper citation: Doc D cites
arxiv.org/abs/2509.20354β an arxiv ID starting with 2509 implies a September 2025 submission. This is temporally plausible but the paper title and ID should be verified. I could not confirm this specific arxiv ID exists. -
"embeddinggemma-300M... Does NOT support float16 β requires float32 or bfloat16": This claim is stated but not sourced. If the model is encoder-only, this precision constraint should be verified against the model card. It has implications for fine-tuning (float32 uses 2Γ memory vs float16).
-
Training data format discrepancy: Doc D shows the training data format as:
{"query": "...", "output": [["hyde", "..."], ["lex", "..."], ["vec", "..."]]}But QMD's
finetune/dataset/prepare_data.pyprocesses data into Qwen3 chat template format (single"text"field with<|im_start|>markers), as shown insft.yaml:text_field: "text". The JSONL format shown is the raw format beforeprepare_data.pyprocessing. This is clarified later in the doc but could confuse readers who skip ahead.
Model Identification Verification
All three models verified correct against src/llm.ts:
| Model | Doc D URI | Source Code URI | Match? |
|---|---|---|---|
| Embedding | hf:ggml-org/embeddinggemma-300M-GGUF/embeddinggemma-300M-Q8_0.gguf | hf:ggml-org/embeddinggemma-300M-GGUF/embeddinggemma-300M-Q8_0.gguf | β Exact |
| Reranker | hf:ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF/qwen3-reranker-0.6b-q8_0.gguf | hf:ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF/qwen3-reranker-0.6b-q8_0.gguf | β Exact |
| Query Expansion | hf:tobil/qmd-query-expansion-1.7B-gguf/qmd-query-expansion-1.7B-q4_k_m.gguf | hf:tobil/qmd-query-expansion-1.7B-gguf/qmd-query-expansion-1.7B-q4_k_m.gguf | β Exact |
Env var swapping verified correct:
QMD_EMBED_MODELβ line 504 ofsrc/llm.tsQMD_GENERATE_MODELβ line 505 ofsrc/llm.tsQMD_RERANK_MODELβ line 506 ofsrc/llm.ts
SDK constructor API verified correct β createStore() with LlamaCpp({ embedModel, rerankModel, generateModel }) matches the source.
QMD version confirmed: @tobilu/qmd v2.1.0, node-llama-cpp 3.18.1, sqlite-vec 0.1.9 β all match package.json.
Fine-Tuning Proposal Assessment
-
Query Expansion Model (Priority 1) β β STRONG PROPOSAL:
- The fine-tuning pipeline is verified to exist in
finetune/directory with complete tooling:train.py,eval.py,convert_gguf.py,reward.py,dataset/schema.py,dataset/prepare_data.py. - LoRA config verified: rank 16, alpha 32, all projection layers, 5 epochs, 2e-4 lr, ~2,290 examples.
- Cloud training via HuggingFace Jobs (~$1.50/run on A10G) is a documented, tested path.
- CRITICAL CAVEAT NOT ADEQUATELY HIGHLIGHTED: The pipeline is designed for CUDA GPUs. The
pyproject.tomldepends onnvidia-ml-py. Running locally on Mac requires either: (a) Modifying the pipeline to remove CUDA deps and use MPS device (b) Using the HuggingFace Jobs cloud path (easiest) (c) Rewriting the training to use MLX-LM instead of PyTorch/trl/peft
- The fine-tuning pipeline is verified to exist in
-
Embedding Model Fine-Tuning (Priority 2) β β οΈ PROPOSAL HAS CRITICAL GAP:
- The sentence-transformers fine-tuning approach is technically sound for training.
- THE CONVERSION BACK TO GGUF IS THE HARD PART and is inadequately addressed. Both documents treat this as a simple
convert_hf_to_gguf.pycall, but:- EmbeddingGemma-300M is an encoder-only model, not a causal LM
- llama.cpp's
convert_hf_to_gguf.pywas designed for causal LMs (LLaMA, Mistral, etc.) - Support for converting fine-tuned encoder-only models back to GGUF is experimental and unverified
- The GGUF format has recently added embedding model support, but converting a modified (fine-tuned) sentence-transformers model is not the same as converting the original pre-trained model
- This is the single riskiest step in the entire proposal. If conversion fails, the embedding fine-tuning work cannot be deployed into QMD.
- Doc D acknowledges this at evidence level "β οΈ Theoretically viable, not tested" β but the surrounding text presents the workflow as if it's straightforward. The risk needs to be elevated to a blocking concern with a recommended mitigation: test the conversion pipeline with a dummy fine-tuned model before investing in Indigenous knowledge dataset creation.
-
Reranker Fine-Tuning (Priority 3) β β οΈ UNDERSPECIFIED:
- The Qwen3-Reranker is a standard causal LM (0.6B params), so LoRA fine-tuning is technically feasible.
- However, no pipeline exists for this in QMD. No training data format, no training script, no evaluation harness.
- The yes/no logprob scoring mechanism (lines 136β148) means the fine-tuning needs to preserve this specific behavior β standard SFT might break the calibration.
- Should be clearly labeled as "requires building custom pipeline from scratch."
Completeness Gaps
-
Missing: How QMD re-embeds documents after model swap. Doc D mentions
qmd embed(line 348) for re-embedding, but doesn't explain:- How long this takes for a typical knowledge base (100sβ1000s of documents)
- Whether it can be done incrementally or must re-process everything
- Whether the SQLite index needs to be rebuilt
- What happens if the old and new models have different embedding dimensions
-
Missing: Embedding dimension compatibility. If Guillaume switches from embeddinggemma-300M (768-dim) to a fine-tuned model, the dimensions MUST match or the sqlite-vec index will be incompatible. If he switches to Qwen3-Embedding-0.6B (1024-dim), he needs a full re-index. This is not discussed.
-
Missing: Testing/evaluation methodology for domain-specific models. Both documents describe training but neither provides a rigorous evaluation framework for measuring whether Indigenous knowledge search actually improved. Need:
- A test query set with expected results
- Metrics (MRR, NDCG, precision@k)
- A/B comparison methodology between base and fine-tuned models
-
Missing: Multi-language considerations. Indigenous communities often have knowledge in Indigenous languages, French (given Guillaume's QuΓ©bΓ©cois context), and English. The documents don't address multilingual embedding challenges or model selection for non-English content.
Additional Findings
- Strong: The QMD architecture diagram (lines 43β62) is clear and accurate.
- Strong: The reranker scoring formula (lines 147β148) matches the actual implementation pattern in node-llama-cpp.
- Strong: The version management recommendation (lines 552β577) with symlinked model directories is practical and well-designed.
- Gap: The
qmd search "*" --all --jsoncommand (line 433) for extracting training data should be verified β the--allflag may not exist in QMD's CLI.
Cross-Document Contradictions
1. Embedding Fine-Tuning Framework Disagreement
Doc C recommends mlx-tune as the primary tool for embedding fine-tuning (line 58β88), claiming it supports "InfoNCE/contrastive learning" for "BERT, ModernBERT, Qwen3-Embedding, Harrier architectures."
Doc D recommends sentence-transformers on PyTorch (lines 297β323), using MultipleNegativesRankingLoss.
These are two different frameworks with different conversion paths:
- mlx-tune β MLX format β ??? (no documented GGUF export for embedding models)
- sentence-transformers β HuggingFace format β GGUF (untested for encoder-only)
Neither document reconciles these approaches or recommends one definitively. The revision must pick one primary path and document the full end-to-end conversion.
2. "embeddinggemma supports sentence-transformers" vs Reality
Doc D (line 302) shows SentenceTransformer("google/embeddinggemma-300M") as if it's a drop-in. Web search indicates this model is NOT natively published as a sentence-transformers package β it requires manual wrapping with a pooling layer. This complicates the fine-tuning workflow.
3. QMD Fine-Tune Pipeline: CUDA vs Mac
Doc C implies local Mac training is seamless. Doc D correctly references the finetune/ pipeline but doesn't flag that pyproject.toml has nvidia-ml-py as a hard dependency, making uv run train.py fail on Mac without modification. The documents need to align on this: either provide Mac-specific instructions or clearly state the cloud path is recommended for QMD's existing pipeline.
4. Priority Ordering Discrepancy
Doc C (line 203) lists embedding fine-tuning as item #2 ("β FEASIBLE"). Doc D (line 221) lists query expansion as "Priority 1" and embedding as "Priority 2", noting query expansion has the existing pipeline.
These should be reconciled. Doc D's ordering (query expansion first, because infrastructure exists) is the more pragmatic recommendation.
5. Training Data Format
Doc C shows training data as: {"text": "<|user|>\n...\n<|assistant|>\n..."} (line 192)
Doc D shows QMD's actual format as: {"query": "...", "output": [["hyde", "..."], ...]} (line 236)
These are for different purposes (persona LoRA vs query expansion), so not truly contradictory β but the document doesn't make this distinction clear enough.
Critical Gaps for Revision
-
BLOCKING: Verify embeddinggemma-300M β fine-tune β GGUF conversion pipeline end-to-end. Test with a trivially fine-tuned model before any Indigenous knowledge training. If this conversion fails, the entire embedding fine-tuning proposal collapses. Provide a fallback plan (e.g., use Qwen3-Embedding-0.6B instead, which may have better GGUF conversion support as a causal-architecture model).
-
BLOCKING: Fix embeddinggemma-300M architecture description. It is an encoder-only transformer, NOT "Gemma 3 (T5Gemma initialization)." This error appears in BOTH documents and affects fine-tuning guidance.
-
IMPORTANT: Correct mlx-tune version to v0.4.19 and verify each capability claim against actual release notes rather than projected features.
-
IMPORTANT: Correct Mac Mini pricing. The $1,800 figure is for 48GB/512GB only. With 1TB SSD: ~$2,200β$2,500. The budget recommendation should specify the exact configuration.
-
IMPORTANT: Address QMD fine-tune pipeline CUDA dependency. Either (a) provide modified
pyproject.tomland config for Mac MPS, or (b) clearly recommend HuggingFace Jobs ($1.50/run) as the path of least resistance, or (c) document rewriting the training to use MLX-LM. -
IMPORTANT: Provide a concrete evaluation framework for measuring whether domain-specific fine-tuning improved search quality. Include test queries, expected results, and metrics.
-
MODERATE: Reconcile embedding fine-tuning framework recommendation β pick either mlx-tune or sentence-transformers and document the full path including GGUF conversion.
-
MODERATE: Address embedding dimension compatibility when swapping models in QMD. Document what happens with sqlite-vec when dimensions change.
-
MODERATE: Add data preparation guidance specific to Indigenous knowledge. Both documents say "create 500β5,000 pairs" but provide no guidance on:
- How to extract training pairs from an existing QMD knowledge base programmatically
- Quality criteria for Indigenous knowledge training pairs
- How to handle sacred/restricted knowledge that should NOT be in training data
- Language diversity in the training set
-
MODERATE: Add error handling and rollback to the weekend training automation. The current
train.shhas no validation gates, no notification, no rollback capability. -
MINOR: Update LoRA CLI example from
python lora.pytopython -m mlx_lm.lora. -
MINOR: Clarify MLX vs MLX-LM version numbers β they are separate packages with separate versioning.
Data Sovereignty Considerations
Both documents are significantly deficient in this area. For an Indigenous-AI Collaborative Platform, data sovereignty is not a nice-to-have β it is a foundational requirement. Issues that must be addressed:
1. Training Data Containment
- Fine-tuning creates model weights that encode training data patterns. If Indigenous knowledge is used for training, the resulting model weights carry that knowledge. Who owns these weights? Can they be shared?
- The documents recommend pushing models to HuggingFace Hub. Sacred or restricted knowledge embedded in model weights would be publicly exposed. The revision must explicitly warn against this and recommend private/local-only model storage.
2. Cloud Training Risks
- Doc D recommends HuggingFace Jobs for training (~$1.50/run). This sends training data to HuggingFace's cloud infrastructure. For culturally sensitive Indigenous knowledge, this may violate data sovereignty principles.
- The revision should clearly distinguish between paths that keep data local (MLX-LM on Mac, PyTorch MPS) and paths that send data to cloud (HuggingFace Jobs).
3. Knowledge Categories
- Not all Indigenous knowledge can or should be used for training. The documents should recommend:
- A classification system for knowledge: Public Teaching / Community Knowledge / Sacred/Restricted
- Only using "Public Teaching" category content for model training
- Community consultation before any training on shared knowledge
- OCAP principles (Ownership, Control, Access, Possession) applied to training data and model weights
4. Model Weight Sovereignty
- Fine-tuned model weights should be treated as derived works of the training data. Storage, distribution, and access controls should match the sensitivity level of the source knowledge.
- Local-only deployment (Mac Mini) inherently respects data sovereignty better than cloud deployment.
5. Consent and Attribution
- If using community-sourced Indigenous knowledge for training, the process should include:
- Free, prior, and informed consent from knowledge holders
- Attribution mechanisms in the model metadata
- Right to withdraw (ability to retrain without specific contributions)
Verified Facts
The following claims were confirmed through source code inspection and/or web search:
| Fact | Verification Method |
|---|---|
| QMD uses exactly 3 GGUF models via node-llama-cpp | β
Verified in src/llm.ts lines 196β199 |
| All 3 models swappable via env vars | β
Verified in src/llm.ts lines 504β506 |
| QMD version is 2.1.0 | β
Verified in package.json |
| node-llama-cpp version is 3.18.1 | β
Verified in package.json |
| sqlite-vec version is 0.1.9 | β
Verified in package.json |
Complete fine-tuning pipeline exists in finetune/ | β Verified: 20+ files including train.py, eval.py, convert_gguf.py, dataset/, configs/ |
| Fine-tuning uses LoRA SFT on Qwen3-1.7B | β
Verified in finetune/configs/sft.yaml |
| LoRA config: rank 16, alpha 32, all projection layers | β
Verified in finetune/configs/sft.yaml |
| ~2,290 training examples | β
Stated in finetune/README.md |
| Training pipeline uses trl/peft/transformers | β
Verified in finetune/pyproject.toml |
| MLX-LM latest version is v0.31.1 (March 2026) | β Verified via web search (PyPI) |
| mlx-tune latest version is v0.4.19 | β Verified via web search (PyPI, GitHub) |
| Mac Mini M4 Pro 48GB/512GB pricing ~$1,799 | β Verified via multiple retailers |
| M4 Pro has 20 GPU cores, 273 GB/s bandwidth | β Verified via Apple specs |
| QMD embeddinggemma uses nomic-style task prefix formatting | β
Verified in src/llm.ts formatQueryForEmbedding() |
| QMD supports Qwen3-Embedding alternative with auto-detection | β
Verified in src/llm.ts isQwen3EmbeddingModel() regex |
| QMD license is MIT | β
Verified in package.json |
Updated/Corrected Data
Version Corrections
| Item | Document Value | Corrected Value | Source |
|---|---|---|---|
| mlx-tune version | v0.4.21 | v0.4.19 | PyPI, GitHub releases |
| MLX framework reference | "MLX 0.20+" | MLX-LM v0.31.1 (separate from MLX framework) | PyPI |
Architecture Corrections
| Item | Document Value | Corrected Value | Source |
|---|---|---|---|
| EmbeddingGemma-300M architecture | "Gemma 3 (T5Gemma initialization)" | Encoder-only transformer (BERT-like) | HuggingFace model card, web search |
Pricing Corrections
| Item | Document Value | Corrected Value | Source |
|---|---|---|---|
| Mac Mini M4 Pro 48GB/1TB | ~$1,800 | ~$2,200β$2,500 | Apple Store, B&H, Micro Center |
| Mac Mini M4 Pro 48GB/512GB | (not specified separately) | ~$1,799 | Multiple retailers |
Pipeline Corrections
| Item | Document Claim | Actual Status | Impact |
|---|---|---|---|
| QMD fine-tune pipeline runs on Mac | Implied possible | CUDA-dependent (nvidia-ml-py in deps, A10G target hardware) | Must modify for Mac or use cloud |
SentenceTransformer("google/embeddinggemma-300M") direct load | Shown as working | Requires manual wrapping (not a native ST model) | Complicates embedding fine-tuning |
| Fine-tuned embeddinggemma β GGUF conversion | Presented as straightforward | Experimental, unverified for fine-tuned encoder models | Blocking risk for embedding improvement |
Review completed April 15, 2026. All source code references verified against QMD commit cfd640e.