Agent D: QMD Models & Fine-Tuning Potential for Indigenous Knowledge
Research Date: April 15, 2026 Researcher: Agent D (QMD Internals + HuggingFace Models) Subject: QMD architecture, exact model inventory, and domain-specific fine-tuning for Guillaume Descoteaux-Isabelle's Indigenous knowledge corpus
Key Findings
-
QMD uses exactly 3 GGUF models via node-llama-cpp β all running locally, no cloud APIs:
- Embedding:
embeddinggemma-300M(Google, ~300MB, Q8_0 quantization) - Reranking:
Qwen3-Reranker-0.6B(Alibaba, ~640MB, Q8_0 quantization) - Query Expansion:
qmd-query-expansion-1.7B(Tobi's custom SFT of Qwen3-1.7B, ~1.1GB, Q4_K_M)
- Embedding:
-
All 3 models can be swapped via environment variables (
QMD_EMBED_MODEL,QMD_RERANK_MODEL,QMD_GENERATE_MODEL). This is the primary mechanism for deploying fine-tuned models. -
QMD already has a complete fine-tuning pipeline in its
finetune/directory β for the query expansion model only. This pipeline uses LoRA SFT on Qwen3-1.7B with HuggingFace'strl/peftlibraries and converts to GGUF for deployment. -
The embedding model (embeddinggemma-300M) is the highest-impact fine-tuning target for domain-specific knowledge. It determines what QMD considers "semantically similar" β fine-tuning it on Indigenous knowledge pairs would directly improve vector search quality.
-
The reranker (Qwen3-Reranker-0.6B) is the second-highest impact target β it decides final result ordering. Domain-specific reranking training (yes/no relevance judgments on Indigenous queries) would improve result quality.
-
Mac Mini M4 can handle fine-tuning of all three models via PyTorch MPS backend or Apple MLX. The models are small enough (300Mβ1.7B params) for LoRA fine-tuning on 16β64GB unified memory.
QMD Architecture Overview
What QMD Is
QMD ("Query Markup Documents") is an on-device local search engine for markdown files, created by Tobi LΓΌtke (Shopify founder). It indexes markdown notes, meeting transcripts, and documentation, then searches them using a hybrid pipeline combining three techniques β all running locally with no cloud dependencies.
- License: MIT
- Runtime: Node.js 22+ or Bun
- Package:
@tobilu/qmd(npm, v2.1.0 as of research date) - Storage: SQLite (FTS5 + sqlite-vec extension)
- LLM Runtime: node-llama-cpp 3.18.1 (bindings for llama.cpp)
- Models: Auto-downloaded GGUF files from HuggingFace, cached at
~/.cache/qmd/models/
Search Pipeline Architecture
QMD implements a 3-stage hybrid search pipeline:
User Query
β
βββΊ Query Expansion (Qwen3-1.7B fine-tuned)
β Produces: hyde: / lex: / vec: structured expansions
β
βββΊ BM25 Full-Text Search (SQLite FTS5)
β Fast keyword matching, no LLM needed
β
βββΊ Vector Similarity Search (embeddinggemma-300M)
β Semantic search via cosine distance on embeddings
β
βββΊ Reciprocal Rank Fusion (RRF)
β Combines BM25 + vector results with position weighting
β
βββΊ LLM Reranking (Qwen3-Reranker-0.6B)
Yes/No logprob-based relevance scoring
Key Technical Details
- Chunking: 900 tokens/chunk with 15% overlap, prefers markdown heading boundaries
- AST-aware chunking: Optional tree-sitter based chunking for code files (
.ts/.js/.py/.go/.rs) - Embedding dimension: 768 (with MRL support for 512/256/128 truncation)
- Score fusion: Position-aware blending β top 1-3 results get 75% RRF weight, 4-10 get 60%, 11+ get 40%
- Context system: Hierarchical metadata that travels with search results, improving LLM contextual understanding
QMD MCP Server
QMD exposes an MCP (Model Context Protocol) server for AI agent integration:
- Transport: stdio (default) or HTTP (
:8181) - Tools exposed:
query,get,multi_get,status - HTTP mode: Models stay loaded in VRAM across requests, contexts auto-disposed after 5 min idle
- Config:
qmd mcp(stdio) orqmd mcp --http --daemon(HTTP background)
HuggingFace Models Used by QMD
Model 1: Embedding Model β embeddinggemma-300M
| Property | Value |
|---|---|
| GGUF URI | hf:ggml-org/embeddinggemma-300M-GGUF/embeddinggemma-300M-Q8_0.gguf |
| Source model | google/embeddinggemma-300M |
| Parameters | 300M |
| Architecture | Gemma 3 (T5Gemma initialization) |
| Embedding dim | 768 (MRL: 512/256/128) |
| Context length | 2048 tokens |
| Quantization | Q8_0 (~300MB on disk) |
| Training data | ~320B tokens, 100+ languages, web + code + synthetic |
| Precision note | Does NOT support float16 β requires float32 or bfloat16 |
| Paper | EmbeddingGemma: Powerful and Lightweight Text Representations |
| Override env var | QMD_EMBED_MODEL |
How QMD formats text for embedding:
For queries:
task: search result | query: {query}
For documents:
title: {title} | text: {text}
QMD also supports Qwen3-Embedding format (auto-detected via regex on model URI):
Instruct: Retrieve relevant documents for the given query
Query: {query}
Alternative embedding model supported:
Qwen/Qwen3-Embedding-0.6Bβ can be used by settingQMD_EMBED_MODEL=hf:Qwen/Qwen3-Embedding-0.6B-GGUF/...
Model 2: Reranking Model β Qwen3-Reranker-0.6B
| Property | Value |
|---|---|
| GGUF URI | hf:ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF/qwen3-reranker-0.6b-q8_0.gguf |
| Source model | Qwen/Qwen3-Reranker-0.6B |
| Parameters | 0.6B (600M) |
| Architecture | Qwen3 (28 layers, transformer) |
| Context length | 32K tokens |
| Quantization | Q8_0 (~640MB on disk) |
| Languages | 100+ languages |
| Override env var | QMD_RERANK_MODEL |
How the reranker works:
The reranker is a causal LM that produces yes/no logprob scores:
System: Judge whether the Document meets the requirements based on the
Query and the Instruct provided. Answer "yes" or "no".
User: <Instruct>: {instruction}
<Query>: {query}
<Document>: {document}
It outputs logprobs for "yes" and "no" tokens, computing:
score = exp(logprob_yes) / (exp(logprob_yes) + exp(logprob_no))
Model 3: Query Expansion Model β qmd-query-expansion-1.7B
| Property | Value |
|---|---|
| GGUF URI | hf:tobil/qmd-query-expansion-1.7B-gguf/qmd-query-expansion-1.7B-q4_k_m.gguf |
| Source/base model | Qwen/Qwen3-1.7B |
| Parameters | ~2B (1.7B base + merged LoRA) |
| Quantization | Q4_K_M (~1.1GB on disk) |
| Training method | LoRA SFT (rank 16, alpha 32, all projection layers) |
| Training data | ~2,290 examples |
| Override env var | QMD_GENERATE_MODEL |
| HF repos | tobil/qmd-query-expansion-1.7B (merged), tobil/qmd-query-expansion-1.7B-gguf (GGUF), tobil/qmd-query-expansion-1.7B-sft (adapter) |
Prompt format (Qwen3 chat template):
<|im_start|>user
/no_think Expand this search query: {query}<|im_end|>
<|im_start|>assistant
Output format:
hyde: A hypothetical document passage that answers the query
lex: keyword1 keyword2
lex: another keyword variation
vec: natural language semantic query
vec: alternative semantic reformulation
Model Size Summary
| Model | Params | Disk Size | VRAM Usage | Role |
|---|---|---|---|---|
| embeddinggemma-300M (Q8_0) | 300M | ~300MB | ~400MB | Embedding |
| Qwen3-Reranker-0.6B (Q8_0) | 600M | ~640MB | ~700MB | Reranking |
| qmd-query-expansion-1.7B (Q4_K_M) | 1.7B | ~1.1GB | ~1.5GB | Query Expansion |
| Total | ~2GB | ~2.6GB |
Can Models Be Swapped?
Yes. All three models can be overridden via environment variables:
# Use a custom fine-tuned embedding model
export QMD_EMBED_MODEL="/path/to/my-domain-embeddings.gguf"
# or from HuggingFace
export QMD_EMBED_MODEL="hf:myuser/my-model-GGUF/model.gguf"
# Use a custom reranker
export QMD_RERANK_MODEL="/path/to/my-reranker.gguf"
# Use a custom query expansion model
export QMD_GENERATE_MODEL="/path/to/my-expander.gguf"
Or via the SDK constructor:
const store = await createStore({
dbPath: './index.sqlite',
llm: new LlamaCpp({
embedModel: '/path/to/custom-embed.gguf',
rerankModel: '/path/to/custom-reranker.gguf',
generateModel: '/path/to/custom-expander.gguf',
}),
});
Fine-Tuning Potential for Domain Knowledge
Priority 1: Fine-Tune the Query Expansion Model (Highest Feasibility)
Why: QMD already has a complete, production-grade fine-tuning pipeline for this model. You can add domain-specific query expansion examples that teach the model how Indigenous knowledge concepts should be expanded.
Impact: Medium-High. When Guillaume searches for "medicine wheel", the expansion model would generate domain-aware expansions like:
hyde: The Medicine Wheel is a sacred circle representing the four directions, seasons, and stages of life in Indigenous cosmology. Each direction carries specific teachings and ceremonial significance.
lex: medicine wheel four directions
lex: sacred circle indigenous ceremony
vec: what are the teachings of the medicine wheel in indigenous tradition
vec: ceremonial significance of the four directions
What you need β Training data format (JSONL):
{"query": "medicine wheel", "output": [["hyde", "The Medicine Wheel represents..."], ["lex", "sacred circle four directions"], ["lex", "indigenous cosmology ceremony"], ["vec", "what are the teachings of the medicine wheel"], ["vec", "ceremonial significance of four directions and seasons"]]}
{"query": "relational accountability", "output": [["hyde", "Relational accountability in Indigenous research..."], ["lex", "relational accountability indigenous"], ["lex", "research as ceremony relationships"], ["vec", "what is relational accountability in indigenous methodology"], ["vec", "how does ceremonial research differ from extractive research"]]}
How many examples needed: QMD's own model was trained on ~2,290 examples. For domain adaptation, 200-500 high-quality examples covering your key concepts would be a strong start. 100 examples minimum to see measurable improvement.
Exact training procedure (using QMD's own pipeline):
# 1. Clone QMD repo
git clone https://github.com/tobi/qmd.git
cd qmd/finetune
# 2. Create your domain training data
# Add your JSONL files to data/indigenous-knowledge.jsonl
# 3. Prepare data (dedup, format for Qwen3 chat template, split)
uv run dataset/prepare_data.py
# 4. Validate
just validate
# 5. Train locally (requires CUDA GPU)
uv run train.py sft --config configs/sft.yaml
# 5b. Or train via HuggingFace Jobs (~$1.50 for A10G, ~45 min)
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN --timeout 2h jobs/sft.py
# 6. Evaluate
uv run eval.py outputs/sft
# 7. Convert to GGUF
uv run convert_gguf.py --size 1.7B
# 8. Deploy into QMD
export QMD_GENERATE_MODEL="/path/to/your-indigenous-expansion-1.7B-q4_k_m.gguf"
Priority 2: Fine-Tune the Embedding Model (Highest Impact)
Why: The embedding model determines what QMD considers semantically similar. The default embeddinggemma-300M was trained on general web text β it has no understanding of Indigenous knowledge concepts, relational science terminology, or ceremonial technology vocabulary. Fine-tuning it on domain pairs would dramatically improve vector search quality.
Impact: Very High. This is the single most impactful improvement. After fine-tuning, searching "relational accountability" would surface documents about "research as ceremony" and "Four Directions" rather than generic accountability documents.
Challenge: EmbeddingGemma-300M uses Sentence Transformers for training but QMD consumes GGUF format via llama.cpp. The workflow requires:
- Fine-tune the PyTorch model using sentence-transformers
- Convert back to GGUF
- Deploy via
QMD_EMBED_MODEL
Training data format for embedding fine-tuning (pairs):
# Positive pairs (semantically similar)
training_pairs = [
("medicine wheel", "The sacred circle represents the four directions and life stages"),
("relational accountability", "Research as ceremony requires maintaining relationships with all participants"),
("seven grandfather teachings", "Wisdom, love, respect, bravery, honesty, humility, and truth guide behavior"),
("structural tension", "The creative force between current reality and desired vision"),
("two-eyed seeing", "Integrating Indigenous and Western knowledge systems"),
]
Training script for embedding fine-tuning:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
# Load the base embedding model
model = SentenceTransformer("google/embeddinggemma-300M")
# Prepare training data
train_examples = [
InputExample(texts=[query, positive_doc])
for query, positive_doc in training_pairs
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
# MultipleNegativesRankingLoss β best for retrieval fine-tuning
# Each pair is a positive; other items in the batch are treated as negatives
train_loss = losses.MultipleNegativesRankingLoss(model=model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=50,
output_path="./indigenous-embeddinggemma-300M",
show_progress_bar=True,
)
How many examples needed:
- Minimum viable: 500 query-document pairs
- Good quality: 2,000-5,000 pairs
- Excellent: 10,000+ pairs
- Data augmentation tip: Use an LLM to generate paraphrases of your existing knowledge base documents to multiply your training data
Converting fine-tuned embedding model to GGUF:
This is the hardest step. EmbeddingGemma is a Gemma 3 architecture model and requires:
- Save the fine-tuned model in HuggingFace format
- Use
llama.cpp/convert_hf_to_gguf.pyto convert to GGUF - Quantize with
llama-quantizeto Q8_0
# After fine-tuning:
python convert_hf_to_gguf.py ./indigenous-embeddinggemma-300M \
--outfile indigenous-embeddinggemma-300M-f16.gguf --outtype f16
llama-quantize indigenous-embeddinggemma-300M-f16.gguf \
indigenous-embeddinggemma-300M-Q8_0.gguf Q8_0
# Deploy
export QMD_EMBED_MODEL="./indigenous-embeddinggemma-300M-Q8_0.gguf"
qmd embed # Re-embed all documents with the new model
β οΈ Critical Note: After changing the embedding model, you MUST re-run qmd embed to regenerate all embeddings. The old embeddings are incompatible with the new model.
Priority 3: Fine-Tune the Reranker (Advanced)
Why: The reranker makes final relevance judgments. Teaching it domain-specific relevance would improve result ordering.
Impact: Medium. Improves ranking quality but only affects the query command (not search or vsearch).
Training approach: The Qwen3-Reranker is a causal LM fine-tuned for yes/no relevance judgments. Domain adaptation would require:
- Curate query-document pairs labeled as relevant/not-relevant
- Fine-tune using the same yes/no judgment prompt format
- Convert to GGUF and deploy
Training data format:
{"query": "four directions teachings", "document": "The Medicine Wheel maps...", "relevant": true}
{"query": "four directions teachings", "document": "GPS navigation uses four...", "relevant": false}
This is more advanced and should be attempted after the embedding and expansion models show improvement.
Other Trainable Models for the Use Case
1. NER Model for Indigenous Terminology Extraction
Purpose: Automatically identify and tag Indigenous concepts, place names, ceremony names, and relational terms in documents.
Recommended base: dslim/bert-base-NER or Jean-Baptiste/camembert-ner (for French-language content)
Training approach:
- Annotate 200-500 documents with custom entity types:
CEREMONY,DIRECTION,TEACHING,PLACE,RELATION - Fine-tune with HuggingFace's
token-classificationpipeline - Runs on Mac Mini M4 easily (BERT-base is 110M params)
Impact: Could auto-tag QMD documents with rich metadata, improving search context.
2. Document Classification Model
Purpose: Auto-categorize documents into domains (Ceremony, Teaching, Governance, Land, Language, etc.)
Recommended base: distilbert-base-uncased (66M params) or google/embeddinggemma-300M with a classification head
Training approach:
- Label 100-300 documents by category
- Fine-tune a text classification model
- Use as a pre-processing step when indexing into QMD
3. Summarization Model via Ollama + LoRA
Purpose: Generate culturally appropriate summaries of Indigenous knowledge documents.
Recommended approach:
- Start with
Qwen3-1.7BorQwen3-4Bvia Ollama - Apply LoRA fine-tuning using
mlx-lmon Mac Mini - Train on document β summary pairs from the knowledge base
Why Qwen3: QMD already uses the Qwen3 family, so there's architectural consistency. The fine-tuned model could also serve double duty as a better query expansion model.
4. Ollama Models for LoRA Adaptation
Models that can be LoRA-adapted for Indigenous knowledge use cases via mlx-lm on Mac Mini:
| Model | Size | Use Case | Mac Mini Feasibility |
|---|---|---|---|
| Qwen3-1.7B | 1.7B | Query expansion, summarization | β Easy (16GB RAM) |
| Qwen3-4B | 4B | Better quality summaries | β Comfortable (32GB RAM) |
| Gemma 3-4B | 4B | General understanding | β Comfortable (32GB RAM) |
| Llama 3.2-3B | 3B | General purpose | β Easy (16GB RAM) |
| Mistral-7B | 7B | High quality generation | β οΈ Needs 32GB+ RAM |
Practical Fine-Tuning Workflow
Step 1: Data Preparation from Existing Knowledge Base
Assuming Guillaume has a QMD-indexed knowledge base of Indigenous wisdom documents:
# Extract all documents from QMD for training data preparation
qmd search "*" --all --json -c indigenous-knowledge > all_docs.json
# Or use multi-get to retrieve full documents
qmd multi-get "**/*.md" > all_documents.txt
Generating training pairs from existing documents:
Use an LLM (Claude, GPT-4, or a local model) to generate training data from your existing corpus:
# Pseudocode for generating query-expansion training data
for document in corpus:
# Generate likely search queries for this document
queries = llm.generate(f"""
Given this document about Indigenous knowledge:
{document.text[:500]}
Generate 3 realistic search queries someone might use to find this document.
""")
for query in queries:
# Generate expansion in QMD format
expansion = {
"query": query,
"output": [
["hyde", document.text[:200]], # First 200 chars as hypothetical doc
["lex", extract_keywords(query)],
["vec", rephrase_as_question(query)],
]
}
save_to_jsonl(expansion)
For embedding model fine-tuning pairs:
# Generate query-document pairs from your knowledge base
pairs = []
for doc in corpus:
# Each document's title/heading β content is a natural positive pair
pairs.append((doc.title, doc.content[:500]))
# Generate synthetic queries for the document
synthetic_queries = llm.generate(f"Generate 3 search queries for: {doc.content[:300]}")
for q in synthetic_queries:
pairs.append((q, doc.content[:500]))
Step 2: Training Pipeline on Apple Silicon
For Query Expansion Model (Recommended First)
Option A: HuggingFace Jobs (easiest, ~$1.50)
cd qmd/finetune
# Place your JSONL data in data/
hf jobs uv run --flavor a10g-large --secrets HF_TOKEN --timeout 2h jobs/sft.py
Option B: Local on Mac Mini M4 via MLX
# Install mlx-lm
pip install mlx-lm
# Convert Qwen3-1.7B to MLX format
python -m mlx_lm.convert --model Qwen/Qwen3-1.7B
# LoRA fine-tune
python -m mlx_lm.lora \
--model mlx_models/Qwen3-1.7B \
--train \
--data ./data/train/ \
--batch-size 2 \
--lora-layers 16 \
--iters 1000
Option C: Local on Mac Mini M4 via PyTorch MPS
cd qmd/finetune
# Modify configs/sft_local.yaml to use device: "mps" instead of "cuda"
# Reduce batch_size to 1-2 for memory
uv run train.py sft --config configs/sft_local.yaml
For Embedding Model
# Install sentence-transformers
pip install sentence-transformers torch
# Training script (saves to ./output/)
python train_indigenous_embeddings.py
# Convert to GGUF (requires llama.cpp)
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp && make
python convert_hf_to_gguf.py ../output/ --outfile indigenous-embed-f16.gguf --outtype f16
./llama-quantize indigenous-embed-f16.gguf indigenous-embed-Q8_0.gguf Q8_0
Step 3: Deploying Fine-Tuned Models Back into QMD
# Option 1: Environment variables (simple, per-session)
export QMD_EMBED_MODEL="$HOME/models/indigenous-embeddinggemma-Q8_0.gguf"
export QMD_GENERATE_MODEL="$HOME/models/indigenous-query-expansion-q4_k_m.gguf"
# Option 2: Shell profile (persistent)
echo 'export QMD_EMBED_MODEL="$HOME/models/indigenous-embeddinggemma-Q8_0.gguf"' >> ~/.zshrc
echo 'export QMD_GENERATE_MODEL="$HOME/models/indigenous-query-expansion-q4_k_m.gguf"' >> ~/.zshrc
# CRITICAL: Re-embed all documents after changing the embedding model
qmd embed # This will re-generate all vector embeddings
Step 4: Version Management of Fine-Tuned Models
Recommended directory structure:
~/models/qmd-indigenous/
βββ v1/
β βββ indigenous-embed-Q8_0.gguf
β βββ indigenous-expansion-q4_k_m.gguf
β βββ training-metadata.json # training date, data size, eval scores
βββ v2/
β βββ indigenous-embed-Q8_0.gguf
β βββ ...
βββ current -> v2/ # symlink to active version
# Switch versions
export QMD_EMBED_MODEL="$HOME/models/qmd-indigenous/current/indigenous-embed-Q8_0.gguf"
Evaluation workflow before deploying:
# Test search quality with domain-specific queries
qmd query "medicine wheel teachings" --json
qmd query "relational accountability research" --json
qmd query "four directions ceremony" --json
# Compare results against expected documents
Evidence Quality
| Claim | Evidence Level | Source |
|---|---|---|
| Exact model URIs used by QMD | β Verified in source code | src/llm.ts lines defining DEFAULT_EMBED_MODEL, DEFAULT_RERANK_MODEL, DEFAULT_GENERATE_MODEL |
| Models can be swapped via env vars | β Verified in source code | src/llm.ts constructor reads QMD_EMBED_MODEL, QMD_RERANK_MODEL, QMD_GENERATE_MODEL |
| EmbeddingGemma is 300M params, 768-dim | β Verified on model card | google/embeddinggemma-300M HuggingFace page |
| Qwen3-Reranker is 0.6B, 28 layers, 32K context | β Verified on model card | Qwen/Qwen3-Reranker-0.6B HuggingFace page |
| QMD fine-tuning pipeline uses LoRA SFT on Qwen3-1.7B | β Verified in source code | finetune/README.md, finetune/configs/sft.yaml, finetune/train.py |
| Training data schema is JSONL with query/output pairs | β Verified in source code | finetune/dataset/schema.py β Pydantic model TrainingExample |
| ~2,290 training examples for query expansion | β Stated in finetune README | finetune/README.md training results table |
| Sentence-transformers fine-tuning workflow | β οΈ Standard practice, not QMD-specific | sentence-transformers documentation, not tested against embeddinggemma specifically |
| GGUF conversion of fine-tuned embeddinggemma | β οΈ Theoretically viable, not tested | llama.cpp supports Gemma 3 architecture but conversion of fine-tuned ST models needs validation |
| Mac Mini M4 can handle LoRA fine-tuning | β οΈ Widely reported, not personally verified | Community reports of MLX and MPS fine-tuning on M-series Macs |
| Reranker fine-tuning feasibility | β οΈ Inferred from architecture | Qwen3-Reranker is a standard causal LM; LoRA fine-tuning is standard but not documented in QMD |
Sources
Primary Sources (Verified in Source Code)
- QMD GitHub Repository β https://github.com/tobi/qmd (commit
cfd640e)src/llm.tsβ Model URIs, configuration, embedding formattingCLAUDE.mdβ Architecture overview, commandspackage.jsonβ Dependencies (node-llama-cpp 3.18.1, sqlite-vec 0.1.9)finetune/README.mdβ Complete fine-tuning documentationfinetune/CLAUDE.mdβ Fine-tuning pipeline instructionsfinetune/dataset/schema.pyβ Training data schema (Pydantic)finetune/configs/sft.yamlβ SFT hyperparametersfinetune/configs/sft_local.yamlβ Local training configfinetune/convert_gguf.pyβ GGUF conversion scriptfinetune/Justfileβ Training commandsfinetune/pyproject.tomlβ Python dependencies for fine-tuning
HuggingFace Model Cards
- google/embeddinggemma-300M β https://huggingface.co/google/embeddinggemma-300M
- ggml-org/embeddinggemma-300M-GGUF β https://huggingface.co/ggml-org/embeddinggemma-300M-GGUF
- Qwen/Qwen3-Reranker-0.6B β https://huggingface.co/Qwen/Qwen3-Reranker-0.6B
- ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF β https://huggingface.co/ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF
- tobil/qmd-query-expansion-1.7B β https://huggingface.co/tobil/qmd-query-expansion-1.7B
- tobil/qmd-query-expansion-1.7B-gguf β https://huggingface.co/tobil/qmd-query-expansion-1.7B-gguf (inferred)
Technical References
- sentence-transformers documentation β https://www.sbert.net/docs/training/overview.html
- EmbeddingGemma paper β https://arxiv.org/abs/2509.20354
- HyDE technique β https://arxiv.org/abs/2212.10496
- MLX library β https://github.com/ml-explore/mlx
- mlx-lm β https://github.com/ml-explore/mlx-lm
- node-llama-cpp β https://node-llama-cpp.withcat.ai/
- llama.cpp GGUF conversion β https://github.com/ggerganov/llama.cpp
DeepWiki / Articles
- QMD DeepWiki β https://deepwiki.com/tobi/qmd/2-getting-started
- QMD Medium article β https://medium.com/coding-nexus/qmd-local-hybrid-search-engine-for-markdown-that-cuts-token-usage-by-95-e0f9d21f89af