← Back to Articles & Artefacts
artefactswest

Agent D: QMD Models & Fine-Tuning Potential for Indigenous Knowledge

IAIP Research
rch-tech-jgwill-claws-infrastructure

Agent D: QMD Models & Fine-Tuning Potential for Indigenous Knowledge

Research Date: April 15, 2026 Researcher: Agent D (QMD Internals + HuggingFace Models) Subject: QMD architecture, exact model inventory, and domain-specific fine-tuning for Guillaume Descoteaux-Isabelle's Indigenous knowledge corpus


Key Findings

  1. QMD uses exactly 3 GGUF models via node-llama-cpp β€” all running locally, no cloud APIs:

    • Embedding: embeddinggemma-300M (Google, ~300MB, Q8_0 quantization)
    • Reranking: Qwen3-Reranker-0.6B (Alibaba, ~640MB, Q8_0 quantization)
    • Query Expansion: qmd-query-expansion-1.7B (Tobi's custom SFT of Qwen3-1.7B, ~1.1GB, Q4_K_M)
  2. All 3 models can be swapped via environment variables (QMD_EMBED_MODEL, QMD_RERANK_MODEL, QMD_GENERATE_MODEL). This is the primary mechanism for deploying fine-tuned models.

  3. QMD already has a complete fine-tuning pipeline in its finetune/ directory β€” for the query expansion model only. This pipeline uses LoRA SFT on Qwen3-1.7B with HuggingFace's trl/peft libraries and converts to GGUF for deployment.

  4. The embedding model (embeddinggemma-300M) is the highest-impact fine-tuning target for domain-specific knowledge. It determines what QMD considers "semantically similar" β€” fine-tuning it on Indigenous knowledge pairs would directly improve vector search quality.

  5. The reranker (Qwen3-Reranker-0.6B) is the second-highest impact target β€” it decides final result ordering. Domain-specific reranking training (yes/no relevance judgments on Indigenous queries) would improve result quality.

  6. Mac Mini M4 can handle fine-tuning of all three models via PyTorch MPS backend or Apple MLX. The models are small enough (300M–1.7B params) for LoRA fine-tuning on 16–64GB unified memory.


QMD Architecture Overview

What QMD Is

QMD ("Query Markup Documents") is an on-device local search engine for markdown files, created by Tobi LΓΌtke (Shopify founder). It indexes markdown notes, meeting transcripts, and documentation, then searches them using a hybrid pipeline combining three techniques β€” all running locally with no cloud dependencies.

  • License: MIT
  • Runtime: Node.js 22+ or Bun
  • Package: @tobilu/qmd (npm, v2.1.0 as of research date)
  • Storage: SQLite (FTS5 + sqlite-vec extension)
  • LLM Runtime: node-llama-cpp 3.18.1 (bindings for llama.cpp)
  • Models: Auto-downloaded GGUF files from HuggingFace, cached at ~/.cache/qmd/models/

Search Pipeline Architecture

QMD implements a 3-stage hybrid search pipeline:

User Query
    β”‚
    β”œβ”€β–Ί Query Expansion (Qwen3-1.7B fine-tuned)
    β”‚   Produces: hyde: / lex: / vec: structured expansions
    β”‚
    β”œβ”€β–Ί BM25 Full-Text Search (SQLite FTS5)
    β”‚   Fast keyword matching, no LLM needed
    β”‚
    β”œβ”€β–Ί Vector Similarity Search (embeddinggemma-300M)
    β”‚   Semantic search via cosine distance on embeddings
    β”‚
    β”œβ”€β–Ί Reciprocal Rank Fusion (RRF)
    β”‚   Combines BM25 + vector results with position weighting
    β”‚
    └─► LLM Reranking (Qwen3-Reranker-0.6B)
        Yes/No logprob-based relevance scoring

Key Technical Details

  • Chunking: 900 tokens/chunk with 15% overlap, prefers markdown heading boundaries
  • AST-aware chunking: Optional tree-sitter based chunking for code files (.ts/.js/.py/.go/.rs)
  • Embedding dimension: 768 (with MRL support for 512/256/128 truncation)
  • Score fusion: Position-aware blending β€” top 1-3 results get 75% RRF weight, 4-10 get 60%, 11+ get 40%
  • Context system: Hierarchical metadata that travels with search results, improving LLM contextual understanding

QMD MCP Server

QMD exposes an MCP (Model Context Protocol) server for AI agent integration:

  • Transport: stdio (default) or HTTP (:8181)
  • Tools exposed: query, get, multi_get, status
  • HTTP mode: Models stay loaded in VRAM across requests, contexts auto-disposed after 5 min idle
  • Config: qmd mcp (stdio) or qmd mcp --http --daemon (HTTP background)

HuggingFace Models Used by QMD

Model 1: Embedding Model β€” embeddinggemma-300M

PropertyValue
GGUF URIhf:ggml-org/embeddinggemma-300M-GGUF/embeddinggemma-300M-Q8_0.gguf
Source modelgoogle/embeddinggemma-300M
Parameters300M
ArchitectureGemma 3 (T5Gemma initialization)
Embedding dim768 (MRL: 512/256/128)
Context length2048 tokens
QuantizationQ8_0 (~300MB on disk)
Training data~320B tokens, 100+ languages, web + code + synthetic
Precision noteDoes NOT support float16 β€” requires float32 or bfloat16
PaperEmbeddingGemma: Powerful and Lightweight Text Representations
Override env varQMD_EMBED_MODEL

How QMD formats text for embedding:

For queries:

task: search result | query: {query}

For documents:

title: {title} | text: {text}

QMD also supports Qwen3-Embedding format (auto-detected via regex on model URI):

Instruct: Retrieve relevant documents for the given query
Query: {query}

Alternative embedding model supported:

  • Qwen/Qwen3-Embedding-0.6B β€” can be used by setting QMD_EMBED_MODEL=hf:Qwen/Qwen3-Embedding-0.6B-GGUF/...

Model 2: Reranking Model β€” Qwen3-Reranker-0.6B

PropertyValue
GGUF URIhf:ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF/qwen3-reranker-0.6b-q8_0.gguf
Source modelQwen/Qwen3-Reranker-0.6B
Parameters0.6B (600M)
ArchitectureQwen3 (28 layers, transformer)
Context length32K tokens
QuantizationQ8_0 (~640MB on disk)
Languages100+ languages
Override env varQMD_RERANK_MODEL

How the reranker works:

The reranker is a causal LM that produces yes/no logprob scores:

System: Judge whether the Document meets the requirements based on the
        Query and the Instruct provided. Answer "yes" or "no".
User: <Instruct>: {instruction}
      <Query>: {query}
      <Document>: {document}

It outputs logprobs for "yes" and "no" tokens, computing: score = exp(logprob_yes) / (exp(logprob_yes) + exp(logprob_no))

Model 3: Query Expansion Model β€” qmd-query-expansion-1.7B

PropertyValue
GGUF URIhf:tobil/qmd-query-expansion-1.7B-gguf/qmd-query-expansion-1.7B-q4_k_m.gguf
Source/base modelQwen/Qwen3-1.7B
Parameters~2B (1.7B base + merged LoRA)
QuantizationQ4_K_M (~1.1GB on disk)
Training methodLoRA SFT (rank 16, alpha 32, all projection layers)
Training data~2,290 examples
Override env varQMD_GENERATE_MODEL
HF repostobil/qmd-query-expansion-1.7B (merged), tobil/qmd-query-expansion-1.7B-gguf (GGUF), tobil/qmd-query-expansion-1.7B-sft (adapter)

Prompt format (Qwen3 chat template):

<|im_start|>user
/no_think Expand this search query: {query}<|im_end|>
<|im_start|>assistant

Output format:

hyde: A hypothetical document passage that answers the query
lex: keyword1 keyword2
lex: another keyword variation
vec: natural language semantic query
vec: alternative semantic reformulation

Model Size Summary

ModelParamsDisk SizeVRAM UsageRole
embeddinggemma-300M (Q8_0)300M~300MB~400MBEmbedding
Qwen3-Reranker-0.6B (Q8_0)600M~640MB~700MBReranking
qmd-query-expansion-1.7B (Q4_K_M)1.7B~1.1GB~1.5GBQuery Expansion
Total~2GB~2.6GB

Can Models Be Swapped?

Yes. All three models can be overridden via environment variables:

# Use a custom fine-tuned embedding model
export QMD_EMBED_MODEL="/path/to/my-domain-embeddings.gguf"
# or from HuggingFace
export QMD_EMBED_MODEL="hf:myuser/my-model-GGUF/model.gguf"

# Use a custom reranker
export QMD_RERANK_MODEL="/path/to/my-reranker.gguf"

# Use a custom query expansion model
export QMD_GENERATE_MODEL="/path/to/my-expander.gguf"

Or via the SDK constructor:

const store = await createStore({
  dbPath: './index.sqlite',
  llm: new LlamaCpp({
    embedModel: '/path/to/custom-embed.gguf',
    rerankModel: '/path/to/custom-reranker.gguf',
    generateModel: '/path/to/custom-expander.gguf',
  }),
});

Fine-Tuning Potential for Domain Knowledge

Priority 1: Fine-Tune the Query Expansion Model (Highest Feasibility)

Why: QMD already has a complete, production-grade fine-tuning pipeline for this model. You can add domain-specific query expansion examples that teach the model how Indigenous knowledge concepts should be expanded.

Impact: Medium-High. When Guillaume searches for "medicine wheel", the expansion model would generate domain-aware expansions like:

hyde: The Medicine Wheel is a sacred circle representing the four directions, seasons, and stages of life in Indigenous cosmology. Each direction carries specific teachings and ceremonial significance.
lex: medicine wheel four directions
lex: sacred circle indigenous ceremony
vec: what are the teachings of the medicine wheel in indigenous tradition
vec: ceremonial significance of the four directions

What you need β€” Training data format (JSONL):

{"query": "medicine wheel", "output": [["hyde", "The Medicine Wheel represents..."], ["lex", "sacred circle four directions"], ["lex", "indigenous cosmology ceremony"], ["vec", "what are the teachings of the medicine wheel"], ["vec", "ceremonial significance of four directions and seasons"]]}
{"query": "relational accountability", "output": [["hyde", "Relational accountability in Indigenous research..."], ["lex", "relational accountability indigenous"], ["lex", "research as ceremony relationships"], ["vec", "what is relational accountability in indigenous methodology"], ["vec", "how does ceremonial research differ from extractive research"]]}

How many examples needed: QMD's own model was trained on ~2,290 examples. For domain adaptation, 200-500 high-quality examples covering your key concepts would be a strong start. 100 examples minimum to see measurable improvement.

Exact training procedure (using QMD's own pipeline):

# 1. Clone QMD repo
git clone https://github.com/tobi/qmd.git
cd qmd/finetune

# 2. Create your domain training data
# Add your JSONL files to data/indigenous-knowledge.jsonl

# 3. Prepare data (dedup, format for Qwen3 chat template, split)
uv run dataset/prepare_data.py

# 4. Validate
just validate

# 5. Train locally (requires CUDA GPU)
uv run train.py sft --config configs/sft.yaml

# 5b. Or train via HuggingFace Jobs (~$1.50 for A10G, ~45 min)
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN --timeout 2h jobs/sft.py

# 6. Evaluate
uv run eval.py outputs/sft

# 7. Convert to GGUF
uv run convert_gguf.py --size 1.7B

# 8. Deploy into QMD
export QMD_GENERATE_MODEL="/path/to/your-indigenous-expansion-1.7B-q4_k_m.gguf"

Priority 2: Fine-Tune the Embedding Model (Highest Impact)

Why: The embedding model determines what QMD considers semantically similar. The default embeddinggemma-300M was trained on general web text β€” it has no understanding of Indigenous knowledge concepts, relational science terminology, or ceremonial technology vocabulary. Fine-tuning it on domain pairs would dramatically improve vector search quality.

Impact: Very High. This is the single most impactful improvement. After fine-tuning, searching "relational accountability" would surface documents about "research as ceremony" and "Four Directions" rather than generic accountability documents.

Challenge: EmbeddingGemma-300M uses Sentence Transformers for training but QMD consumes GGUF format via llama.cpp. The workflow requires:

  1. Fine-tune the PyTorch model using sentence-transformers
  2. Convert back to GGUF
  3. Deploy via QMD_EMBED_MODEL

Training data format for embedding fine-tuning (pairs):

# Positive pairs (semantically similar)
training_pairs = [
    ("medicine wheel", "The sacred circle represents the four directions and life stages"),
    ("relational accountability", "Research as ceremony requires maintaining relationships with all participants"),
    ("seven grandfather teachings", "Wisdom, love, respect, bravery, honesty, humility, and truth guide behavior"),
    ("structural tension", "The creative force between current reality and desired vision"),
    ("two-eyed seeing", "Integrating Indigenous and Western knowledge systems"),
]

Training script for embedding fine-tuning:

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Load the base embedding model
model = SentenceTransformer("google/embeddinggemma-300M")

# Prepare training data
train_examples = [
    InputExample(texts=[query, positive_doc])
    for query, positive_doc in training_pairs
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# MultipleNegativesRankingLoss β€” best for retrieval fine-tuning
# Each pair is a positive; other items in the batch are treated as negatives
train_loss = losses.MultipleNegativesRankingLoss(model=model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=50,
    output_path="./indigenous-embeddinggemma-300M",
    show_progress_bar=True,
)

How many examples needed:

  • Minimum viable: 500 query-document pairs
  • Good quality: 2,000-5,000 pairs
  • Excellent: 10,000+ pairs
  • Data augmentation tip: Use an LLM to generate paraphrases of your existing knowledge base documents to multiply your training data

Converting fine-tuned embedding model to GGUF:

This is the hardest step. EmbeddingGemma is a Gemma 3 architecture model and requires:

  1. Save the fine-tuned model in HuggingFace format
  2. Use llama.cpp/convert_hf_to_gguf.py to convert to GGUF
  3. Quantize with llama-quantize to Q8_0
# After fine-tuning:
python convert_hf_to_gguf.py ./indigenous-embeddinggemma-300M \
  --outfile indigenous-embeddinggemma-300M-f16.gguf --outtype f16

llama-quantize indigenous-embeddinggemma-300M-f16.gguf \
  indigenous-embeddinggemma-300M-Q8_0.gguf Q8_0

# Deploy
export QMD_EMBED_MODEL="./indigenous-embeddinggemma-300M-Q8_0.gguf"
qmd embed  # Re-embed all documents with the new model

⚠️ Critical Note: After changing the embedding model, you MUST re-run qmd embed to regenerate all embeddings. The old embeddings are incompatible with the new model.

Priority 3: Fine-Tune the Reranker (Advanced)

Why: The reranker makes final relevance judgments. Teaching it domain-specific relevance would improve result ordering.

Impact: Medium. Improves ranking quality but only affects the query command (not search or vsearch).

Training approach: The Qwen3-Reranker is a causal LM fine-tuned for yes/no relevance judgments. Domain adaptation would require:

  1. Curate query-document pairs labeled as relevant/not-relevant
  2. Fine-tune using the same yes/no judgment prompt format
  3. Convert to GGUF and deploy

Training data format:

{"query": "four directions teachings", "document": "The Medicine Wheel maps...", "relevant": true}
{"query": "four directions teachings", "document": "GPS navigation uses four...", "relevant": false}

This is more advanced and should be attempted after the embedding and expansion models show improvement.


Other Trainable Models for the Use Case

1. NER Model for Indigenous Terminology Extraction

Purpose: Automatically identify and tag Indigenous concepts, place names, ceremony names, and relational terms in documents.

Recommended base: dslim/bert-base-NER or Jean-Baptiste/camembert-ner (for French-language content)

Training approach:

  • Annotate 200-500 documents with custom entity types: CEREMONY, DIRECTION, TEACHING, PLACE, RELATION
  • Fine-tune with HuggingFace's token-classification pipeline
  • Runs on Mac Mini M4 easily (BERT-base is 110M params)

Impact: Could auto-tag QMD documents with rich metadata, improving search context.

2. Document Classification Model

Purpose: Auto-categorize documents into domains (Ceremony, Teaching, Governance, Land, Language, etc.)

Recommended base: distilbert-base-uncased (66M params) or google/embeddinggemma-300M with a classification head

Training approach:

  • Label 100-300 documents by category
  • Fine-tune a text classification model
  • Use as a pre-processing step when indexing into QMD

3. Summarization Model via Ollama + LoRA

Purpose: Generate culturally appropriate summaries of Indigenous knowledge documents.

Recommended approach:

  • Start with Qwen3-1.7B or Qwen3-4B via Ollama
  • Apply LoRA fine-tuning using mlx-lm on Mac Mini
  • Train on document β†’ summary pairs from the knowledge base

Why Qwen3: QMD already uses the Qwen3 family, so there's architectural consistency. The fine-tuned model could also serve double duty as a better query expansion model.

4. Ollama Models for LoRA Adaptation

Models that can be LoRA-adapted for Indigenous knowledge use cases via mlx-lm on Mac Mini:

ModelSizeUse CaseMac Mini Feasibility
Qwen3-1.7B1.7BQuery expansion, summarizationβœ… Easy (16GB RAM)
Qwen3-4B4BBetter quality summariesβœ… Comfortable (32GB RAM)
Gemma 3-4B4BGeneral understandingβœ… Comfortable (32GB RAM)
Llama 3.2-3B3BGeneral purposeβœ… Easy (16GB RAM)
Mistral-7B7BHigh quality generation⚠️ Needs 32GB+ RAM

Practical Fine-Tuning Workflow

Step 1: Data Preparation from Existing Knowledge Base

Assuming Guillaume has a QMD-indexed knowledge base of Indigenous wisdom documents:

# Extract all documents from QMD for training data preparation
qmd search "*" --all --json -c indigenous-knowledge > all_docs.json

# Or use multi-get to retrieve full documents
qmd multi-get "**/*.md" > all_documents.txt

Generating training pairs from existing documents:

Use an LLM (Claude, GPT-4, or a local model) to generate training data from your existing corpus:

# Pseudocode for generating query-expansion training data
for document in corpus:
    # Generate likely search queries for this document
    queries = llm.generate(f"""
        Given this document about Indigenous knowledge:
        {document.text[:500]}
        
        Generate 3 realistic search queries someone might use to find this document.
    """)
    
    for query in queries:
        # Generate expansion in QMD format
        expansion = {
            "query": query,
            "output": [
                ["hyde", document.text[:200]],  # First 200 chars as hypothetical doc
                ["lex", extract_keywords(query)],
                ["vec", rephrase_as_question(query)],
            ]
        }
        save_to_jsonl(expansion)

For embedding model fine-tuning pairs:

# Generate query-document pairs from your knowledge base
pairs = []
for doc in corpus:
    # Each document's title/heading β†’ content is a natural positive pair
    pairs.append((doc.title, doc.content[:500]))
    
    # Generate synthetic queries for the document
    synthetic_queries = llm.generate(f"Generate 3 search queries for: {doc.content[:300]}")
    for q in synthetic_queries:
        pairs.append((q, doc.content[:500]))

Step 2: Training Pipeline on Apple Silicon

For Query Expansion Model (Recommended First)

Option A: HuggingFace Jobs (easiest, ~$1.50)

cd qmd/finetune
# Place your JSONL data in data/
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN --timeout 2h jobs/sft.py

Option B: Local on Mac Mini M4 via MLX

# Install mlx-lm
pip install mlx-lm

# Convert Qwen3-1.7B to MLX format
python -m mlx_lm.convert --model Qwen/Qwen3-1.7B

# LoRA fine-tune
python -m mlx_lm.lora \
  --model mlx_models/Qwen3-1.7B \
  --train \
  --data ./data/train/ \
  --batch-size 2 \
  --lora-layers 16 \
  --iters 1000

Option C: Local on Mac Mini M4 via PyTorch MPS

cd qmd/finetune
# Modify configs/sft_local.yaml to use device: "mps" instead of "cuda"
# Reduce batch_size to 1-2 for memory
uv run train.py sft --config configs/sft_local.yaml

For Embedding Model

# Install sentence-transformers
pip install sentence-transformers torch

# Training script (saves to ./output/)
python train_indigenous_embeddings.py

# Convert to GGUF (requires llama.cpp)
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp && make
python convert_hf_to_gguf.py ../output/ --outfile indigenous-embed-f16.gguf --outtype f16
./llama-quantize indigenous-embed-f16.gguf indigenous-embed-Q8_0.gguf Q8_0

Step 3: Deploying Fine-Tuned Models Back into QMD

# Option 1: Environment variables (simple, per-session)
export QMD_EMBED_MODEL="$HOME/models/indigenous-embeddinggemma-Q8_0.gguf"
export QMD_GENERATE_MODEL="$HOME/models/indigenous-query-expansion-q4_k_m.gguf"

# Option 2: Shell profile (persistent)
echo 'export QMD_EMBED_MODEL="$HOME/models/indigenous-embeddinggemma-Q8_0.gguf"' >> ~/.zshrc
echo 'export QMD_GENERATE_MODEL="$HOME/models/indigenous-query-expansion-q4_k_m.gguf"' >> ~/.zshrc

# CRITICAL: Re-embed all documents after changing the embedding model
qmd embed  # This will re-generate all vector embeddings

Step 4: Version Management of Fine-Tuned Models

Recommended directory structure:

~/models/qmd-indigenous/
β”œβ”€β”€ v1/
β”‚   β”œβ”€β”€ indigenous-embed-Q8_0.gguf
β”‚   β”œβ”€β”€ indigenous-expansion-q4_k_m.gguf
β”‚   └── training-metadata.json  # training date, data size, eval scores
β”œβ”€β”€ v2/
β”‚   β”œβ”€β”€ indigenous-embed-Q8_0.gguf
β”‚   └── ...
└── current -> v2/  # symlink to active version
# Switch versions
export QMD_EMBED_MODEL="$HOME/models/qmd-indigenous/current/indigenous-embed-Q8_0.gguf"

Evaluation workflow before deploying:

# Test search quality with domain-specific queries
qmd query "medicine wheel teachings" --json
qmd query "relational accountability research" --json
qmd query "four directions ceremony" --json
# Compare results against expected documents

Evidence Quality

ClaimEvidence LevelSource
Exact model URIs used by QMDβœ… Verified in source codesrc/llm.ts lines defining DEFAULT_EMBED_MODEL, DEFAULT_RERANK_MODEL, DEFAULT_GENERATE_MODEL
Models can be swapped via env varsβœ… Verified in source codesrc/llm.ts constructor reads QMD_EMBED_MODEL, QMD_RERANK_MODEL, QMD_GENERATE_MODEL
EmbeddingGemma is 300M params, 768-dimβœ… Verified on model cardgoogle/embeddinggemma-300M HuggingFace page
Qwen3-Reranker is 0.6B, 28 layers, 32K contextβœ… Verified on model cardQwen/Qwen3-Reranker-0.6B HuggingFace page
QMD fine-tuning pipeline uses LoRA SFT on Qwen3-1.7Bβœ… Verified in source codefinetune/README.md, finetune/configs/sft.yaml, finetune/train.py
Training data schema is JSONL with query/output pairsβœ… Verified in source codefinetune/dataset/schema.py β€” Pydantic model TrainingExample
~2,290 training examples for query expansionβœ… Stated in finetune READMEfinetune/README.md training results table
Sentence-transformers fine-tuning workflow⚠️ Standard practice, not QMD-specificsentence-transformers documentation, not tested against embeddinggemma specifically
GGUF conversion of fine-tuned embeddinggemma⚠️ Theoretically viable, not testedllama.cpp supports Gemma 3 architecture but conversion of fine-tuned ST models needs validation
Mac Mini M4 can handle LoRA fine-tuning⚠️ Widely reported, not personally verifiedCommunity reports of MLX and MPS fine-tuning on M-series Macs
Reranker fine-tuning feasibility⚠️ Inferred from architectureQwen3-Reranker is a standard causal LM; LoRA fine-tuning is standard but not documented in QMD

Sources

Primary Sources (Verified in Source Code)

  1. QMD GitHub Repository β€” https://github.com/tobi/qmd (commit cfd640e)
    • src/llm.ts β€” Model URIs, configuration, embedding formatting
    • CLAUDE.md β€” Architecture overview, commands
    • package.json β€” Dependencies (node-llama-cpp 3.18.1, sqlite-vec 0.1.9)
    • finetune/README.md β€” Complete fine-tuning documentation
    • finetune/CLAUDE.md β€” Fine-tuning pipeline instructions
    • finetune/dataset/schema.py β€” Training data schema (Pydantic)
    • finetune/configs/sft.yaml β€” SFT hyperparameters
    • finetune/configs/sft_local.yaml β€” Local training config
    • finetune/convert_gguf.py β€” GGUF conversion script
    • finetune/Justfile β€” Training commands
    • finetune/pyproject.toml β€” Python dependencies for fine-tuning

HuggingFace Model Cards

  1. google/embeddinggemma-300M β€” https://huggingface.co/google/embeddinggemma-300M
  2. ggml-org/embeddinggemma-300M-GGUF β€” https://huggingface.co/ggml-org/embeddinggemma-300M-GGUF
  3. Qwen/Qwen3-Reranker-0.6B β€” https://huggingface.co/Qwen/Qwen3-Reranker-0.6B
  4. ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF β€” https://huggingface.co/ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF
  5. tobil/qmd-query-expansion-1.7B β€” https://huggingface.co/tobil/qmd-query-expansion-1.7B
  6. tobil/qmd-query-expansion-1.7B-gguf β€” https://huggingface.co/tobil/qmd-query-expansion-1.7B-gguf (inferred)

Technical References

  1. sentence-transformers documentation β€” https://www.sbert.net/docs/training/overview.html
  2. EmbeddingGemma paper β€” https://arxiv.org/abs/2509.20354
  3. HyDE technique β€” https://arxiv.org/abs/2212.10496
  4. MLX library β€” https://github.com/ml-explore/mlx
  5. mlx-lm β€” https://github.com/ml-explore/mlx-lm
  6. node-llama-cpp β€” https://node-llama-cpp.withcat.ai/
  7. llama.cpp GGUF conversion β€” https://github.com/ggerganov/llama.cpp

DeepWiki / Articles

  1. QMD DeepWiki β€” https://deepwiki.com/tobi/qmd/2-getting-started
  2. QMD Medium article β€” https://medium.com/coding-nexus/qmd-local-hybrid-search-engine-for-markdown-that-cuts-token-usage-by-95-e0f9d21f89af